Back to Search Start Over

Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Authors :
Pan Zhou
Xiaoye Qu
Jianfeng Dong
Daizong Liu
Source :
COLING
Publication Year :
2020
Publisher :
International Committee on Computational Linguistics, 2020.

Abstract

Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these single-step attention are insufficient to model complex video contents and restrict the higher-level reasoning demand for this task. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on the wrong position during the cross-interaction process. Then, a modulation module is developed to capture the frame-to-frame relation with the help of sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Details

Database :
OpenAIRE
Journal :
Proceedings of the 28th International Conference on Computational Linguistics
Accession number :
edsair.doi...........924cce67a10a3401f7ff771a5b18a09a
Full Text :
https://doi.org/10.18653/v1/2020.coling-main.167