Building semantic understanding beyond deep learning from sound and vision

Authors :: Guillermo Cámara-Chávez
Sudeep Sarkar
Fillipe D. M. de Souza
Source :: ICPR
Publication Year :: 2016
Publisher :: IEEE, 2016.
Abstract: Deep learning-based models have recently been widely successful at outperforming traditional approaches in several computer vision applications such as image classification, object recognition and action recognition. However, those models are not naturally designed to learn structural information that can be important to tasks such as human pose estimation and structured semantic interpretation of video events. In this paper, we demonstrate how to build structured semantic understanding of audio-video events by reasoning on multiple-label decisions of deep visual models and auditory models using Grenander's structures for imposing semantic consistency. The proposed structured model does not require joint training of the structural semantic dependencies and deep models. Instead they are independent components linked by Grenander's structures. Furthermore, we exploited Grenander's structures as a means to facilitate and enrich the model with fusion of multimodal sensory data; in particular, auditory features with visual features. Overall, we observed improvements in the quality of semantic interpretations using deep models and auditory features in combination with Grenander's structures, reflecting as numerical improvements of up to 11.5% and 12.3% in precision and recall, respectively.