Start Over

Toward Long Form Audio-Visual Video Understanding.

Authors :: Hou, Wenxuan
Li, Guangyao
Tian, Yapeng
Hu, Di
Source :: ACM Transactions on Multimedia Computing, Communications & Applications; Sep2024, Vol. 20 Issue 9, p1-26, 26p
Publication Year :: 2024
Abstract: We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos (LFAVs) are expected as an important bridge for better exploring and understanding the world. In this article, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale LFAV dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of LFAV understanding. Project page: https://gewu-lab.github.io/LFAV/. [ABSTRACT FROM AUTHOR]

Subjects :: VIDEOS
HOPE
FORECASTING

Details

Language :: English
ISSN :: 15516857
Volume :: 20
Issue :: 9
Database :: Complementary Index
Journal :: ACM Transactions on Multimedia Computing, Communications & Applications
Publication Type :: Academic Journal
Accession number :: 179790512
Full Text :: https://doi.org/10.1145/3672079