Back to Search
Start Over
Seeing and hearing egocentric actions: How much can we learn?
- Publication Year :
- 2019
-
Abstract
- Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification.
Details
- Database :
- OAIster
- Publication Type :
- Electronic Resource
- Accession number :
- edsoai.on1286543451
- Document Type :
- Electronic Resource