Back to Search Start Over

Seeing and hearing egocentric actions: How much can we learn?

Authors :
Ministerio de Ciencia, Innovación y Universidades (España)
Agencia Estatal de Investigación (España)
Fundació La Marató de TV3
Generalitat de Catalunya
Consejo Nacional de Ciencia y Tecnología (México)
NVIDIA Corporation
Cartas, Alejandro
Luque, Jordi
Radeva, Petia
Segura, Carlos
Dimiccoli, Mariella
Ministerio de Ciencia, Innovación y Universidades (España)
Agencia Estatal de Investigación (España)
Fundació La Marató de TV3
Generalitat de Catalunya
Consejo Nacional de Ciencia y Tecnología (México)
NVIDIA Corporation
Cartas, Alejandro
Luque, Jordi
Radeva, Petia
Segura, Carlos
Dimiccoli, Mariella
Publication Year :
2019

Abstract

Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification.

Details

Database :
OAIster
Publication Type :
Electronic Resource
Accession number :
edsoai.on1286543451
Document Type :
Electronic Resource