SIGMAP 2022 Abstracts

Area 1 - Multimedia and Deep Learning

Short Papers

Paper Nr:	13
Title:	Improving Car Detection from Aerial Footage with Elevation Information and Markov Random Fields
Authors:	Kevin Qiu, Dimitri Bulatov and Lukas Lucks
Abstract:	Convolutional neural networks are often trained on RGB images because it is standard practice to use transfer learning using a pre-trained model. Satellite and aerial imagery, however, usually have additional bands, such as infrared or elevation channels. Especially when it comes to detection of small objects, like cars, this additional information could provide a significant benefit. We developed a semantic segmentation model trained on the combined optical and elevation data. Moreover, a post-processing routine using Markov Random Fields was developed and compared to a sequence of pixel-wise and object-wise filtering steps. The models are evaluated on the Potsdam dataset on the pixel and object-based level, whereby accuracies around 90% were obtained.
Download

Area 2 - Multimedia Signal Processing

Full Papers

Paper Nr:	4
Title:	Using Video Motion Vectors for Structure from Motion 3D Reconstruction
Authors:	Richard N. C. Turner, Natasha Kholgade Banerjee and Sean Banerjee
Abstract:	H.264 video compression has become the prevalent choice for devices which require live video streaming and include mobile phones, laptops and Micro Aerial Vehicles (MAV). H.264 utilizes motion estimation to predict the distance of pixels, grouped together as macroblocks, between two or more video frames. Live video compression using H.264 is ideal as each frame contains much of the information found in previous and future frames. By estimating the motion vector of each macroblock for every frame, significant compression can be obtained. Combined with Socket on Chip (SoC) encoders, high quality video with low power and bandwidth is now achievable. 3D scene reconstruction utilizing structure from motion (SfM) is a highly computational intensive process, typically performed offline with high computing devices. A significant portion of the computation required for SfM is in the feature detection, matching and correspondence tracking necessary for the 3D scene reconstruction. We present a SfM pipeline which uses H.264 motion vectors to replace much of the processing required to detect, match and track correspondences across video frames. Our pipeline results have shown a significant decrease in computation, while accurately reconstructing a 3D scene.
Download

Paper Nr:	5
Title:	CECNN: A Convergent Error Concealment Neural Network for Videos
Authors:	Razib Iqbal, Shashi Khanal and Mohammad Kazemi
Abstract:	In video error concealment, we estimate any missing information in the video frames as close to the actual data. In this paper, we present a video error concealment technique, named Convergent Error Concealment Neural Network (CECNN), based on Convolutional Neural Network (CNN). CECNN is a two-stage process where it first learns to predict voxel information from the training dataset. It then applies transfer learning using the pre-trained model from the first stage to produce intermediate outputs. CECNN consists of dedicated paths for the past and future frames to produce the intermediate outputs, which are then combined to fill the missing information in the errored frame. The quality of the outputs from CECNN is compared with other techniques, such as motion vector estimation, error concealment using neighboring motion vectors, and generative image inpainting. The evaluation results suggest that our CECNN approach would be a good candidate for error concealment in video decoders.
Download

Paper Nr:	7
Title:	Merged Pitch Histograms and Pitch-duration Histograms
Authors:	Hui Liu, Tingting Xue and Tanja Schultz
Abstract:	The traditional pitch histogram and various features extracted from it play a pivotal role in music information retrieval. In the research on songs, especially applying pitch statistics to investigate the main melody, we found that the pitch histogram may not necessarily reflect the notes' pitch characteristic of the whole song perfectly. Therefore, we took the note duration into account to propose two advanced versions of pitch histograms and validated their applicability. This paper introduces these two novel histograms: the merged pitch histogram by merging consecutively repeated pitches and the pitch-duration histogram by utilizing each pitch's duration information. Complemented by the description of their calculation algorithms, the discussion of their advantages and limitations, the analysis of their application to songs from various languages and cultures, and the demonstration of their use cases in state-of-the-art research works, the proposed histograms' characteristics and usefulness are intuitively revealed.
Download

Paper Nr:	14
Title:	STIFS: Spatio-Temporal Input Frame Selection for Learning-based Video Super-Resolution Models
Authors:	Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter W. Eklund and Sunil Aryal
Abstract:	Deep learning Video Super-Resolution (VSR) methods rely on learning spatio-temporal correlations between a target frame and its neighbouring frames in a given temporal radius to generate a high-resolution output. Among recent VSR models, a sliding window mechanism is popularly adopted by picking a fixed number of consecutive frames as neighbouring frames for a given target frame. This results in a single frame being used multiple times in the input space during the super-resolution process. Moreover, the approach of adopting the fixed consecutive frames directly does not allow deep learning models to learn the full extent of spatio-temporal inter-dependencies between a target frame and its neighbours along a video sequence. To mitigate these issues, this paper proposes a Spatio-Temporal Input Frame Selection (STIFS) algorithm based on image analysis to adaptively select the neighbouring frame(s) based on the spatio-temporal context dynamics with respect to the target frame. STIFS is first-ever dynamic selection mechanism proposed for VSR methods. It aims to enable VSR models to better learn spatio-temporal correlations in a given temporal radius and consequently maximise the quality of the high-definition output. The proposed STIFS algorithm achieved remarkable PSNR improvements in the high-resolution output for VSR models on benchmark datasets.
Download

Short Papers

Paper Nr:	2
Title:	A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
Authors:	Georgia Paraskevopoulou, Evaggelos Spyrou and Stavros Perantonis
Abstract:	The recognition of the emotions of humans is crucial for various applications related to human-computer interaction or for understanding the users’ mood in several tasks. Typical machine learning approaches used towards this goal first extract a set of linguistic features from raw data, which are then used to train supervised learning models. Recently, Convolutional Neural Networks (CNNs), which unlike traditional approaches, learn to extract the appropriate features of their inputs, have also been applied as emotion recognition classifiers. In this work, we adopt a CNN architecture that uses spectrograms, extracted from audio signals as inputs and we propose data augmentation techniques to boost the classification performance. The proposed data augmentation approach includes noise addition, shifting of the audio signal, and changing its pitch or its speed. Experimental results indicate that the herein presented approach outperforms previous work which not use augmented data.
Download

Paper Nr:	6
Title:	Visual RSSI Fingerprinting for Radio-based Indoor Localization
Authors:	Giuseppe Puglisi, Daniele Di Mauro, Antonino Furnari, Luigi Gulino and Giovanni M. Farinella
Abstract:	The problem of localizing objects exploiting RSSI signals has been tackled using both geometric and machine learning based methods. Solutions machine learning based have the advantage to better cope with noise, but require many radio signal observations associated to the correct position in the target space. This data collection and labeling process is not trivial and it typically requires building a grid of dense observations, which can be resource-intensive. To overcome this issue, we propose a pipeline which uses an autonomous robot to collect RSSI-image pairs and Structure from Motion to associate 2D positions to the RSSI values based on the inferred position of each image. This method, as we shown in the paper, allows to acquire large quantities of data in an inexpensive way. Using the collected data, we experiment with machine learning models based on RNNs and propose an optimized model composed of a set of LSTMs that specialize on the RSSI observations coming from different antennas. The proposed method shows promising results outperforming different baselines, suggesting that the proposed pipeline allowing to collect and automatically label observations is useful in real scenarios. Furthermore, to aid research in this area, we publicly release the collected dataset comprising 57158 RSSI observations paired with RGB images.
Download

Paper Nr:	9
Title:	HERO: An Artificial Conversational Assistant to Support Humans in Industrial Scenarios
Authors:	Claudia Bonanno, Francesco Ragusa, Rosario Leonardi, Antonino Furnari and Giovanni Maria Farinella
Abstract:	We present HERO, a Conversational Intelligent Assistant to support workers in industrial domains. The proposed system is able to interact with humans using natural language and observing the surrounding world in order to avoid the language ambiguity. HERO is composed of four modules: 1) the input module to process both text and visual signals, 2) the NLP module to predict user intent and extract relevant entities from text, 3) the object detection module to extract entities by analyzing images captured by the user and 4) the output module which is responsible for choosing the best answer to send to the user. To assess its usefulness in a real scenario, the proposed system is implemented and evaluated in an industrial laboratory setting. Preliminary experiments show that HERO achieves good performance in predicting intents and entities exploiting both text and visual signals.
Download

Paper Nr:	11
Title:	Spline Modeling and Level of Detail for Audio
Authors:	Matt Klassen
Abstract:	In this paper we propose spline models of audio as the first step toward a hierarchical system of level of detail (LOD) for audio rendering. We describe methods such as cycle interpolation which can produce spline models and approximations of audio data. These models can be used to render output in realtime, but can also be mixed prior to rendering. Examples of audio data simplified to spline models with cycle interpolation can reduce the data to less than 2% of the full resolution data size, with only minor impact to audio quality. We present a sequence of such examples with instrument models. We also introduce the idea of pre-rendered filtering and mixing, based on the B-spline coefficients of the models.
Download

Area 3 - Multimedia Systems and Applications

Full Papers

Paper Nr:	10
Title:	Statistical Analysis of Color Differences on Iris Images for Supporting Cluster Headache Diagnosis
Authors:	Inmaculada Mora-Jiménez, Andrés Iglesias-Rojano, Mohammed El-Yaagoubi, José Luis Rojo-Álvarez and Juan Antonio Pareja-Grande
Abstract:	It is well known the existence of certain headaches in humans caused by the sympathetic hypofunction, either congenital or developed at birth. These pathologies, called cluster headaches, are physically manifested by the change in texture, color and/or intensity of the iris eye on the painful side. The automatic study of these variations would make it possible to provide quantitative measures of the existence of such pathology from color images of the left and right iris of a particular individual. In this context, this work analyzes the color of the left and right irises to identify chromatic differences between the irises belonging to the same individual by analyzing three color spaces. The iris color distribution in the same eye has been studied, as well as the degree of similarity and divergence between the chromatic distributions of irises in both eyes. Cross-correlation between color feature vectors exhibited low detection capabilities, whereas a relative measure based on the Kullback-Leibler divergence provided good performance to show color differences in the irises. No color space was identified as the most appropriate for evidencing color differences in all the scrutinized cases. The results obtained are promising on a dataset with eight patients, and can be considered a proof of concept on which it is necessary to extend the analysis with a larger database. From a practical viewpoint, this characterization could help to discriminate patients who attend the neurology department suffering from headache.
Download

Short Papers

Paper Nr:	12
Title:	Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos
Authors:	Ghulam Mujtaba, Jaehyuk Choi and Eun-Seok Ryu
Abstract:	This paper proposes a lightweight methodology to attract users and increase views of videos through personalized artistic media i.e., static thumbnails and animated Graphics Interchange Format (GIF) images. The proposed method analyzes lightweight thumbnail containers (LTC) using computational resources of the client device to recognize personalized events from feature-length sports videos. In addition, instead of processing the entire video, small video segments are used in order to generate artistic media. This makes our approach more computationally efficient compared to existing methods that use the entire video data. Further, the proposed method retrieves and uses thumbnail containers and video segments, which reduces the required transmission bandwidth as well as the amount of locally stored data that are used during artistic media generation. After conducting experiments on the NVIDIA Jetson TX2, the computational complexity of our method was 3:78 times lower than that of the state-of-the-art method. To the best of our knowledge, this is the first technique that uses LTC to generate artistic media while providing lightweight and high-performance services on resource-constrained devices.
Download

Paper Nr:	8
Title:	Improvement of Privacy Prevented Person Tracking System using Artificial Fiber Pattern
Authors:	Hiroki Urakawa, Kitahiro Kaneda and Keiichi Iwamura
Abstract:	Owing to the low equipment cost, the number of surveillance cameras installed has increased significantly; however, most of them are not being used effectively. These cameras can be used for various purposes, such as marketing, if behavior tracking is possible from the obtained images. Previously, we proposed a method of tracking by embedding information in “Artificial Fiber Pattern.” However, the body shape of the wearer and wrinkles of the clothes affect the accuracy of the results. To overcome this drawback, in this study, we combined PIFu HD, a technology that generates a full three-dimensional model from a single image of a person, with the modeling and calculation of the body shape of the subject to verify the conditions under which the body shape of the wearer and wrinkles in the clothes affect the accuracy. Consequently, we achieved precision improvement by removing data that met unsuitable conditions.
Download