1. Learning Video Moment Retrieval Without a Single Annotated Video
- Author
-
Changsheng Xu and Junyu Gao
- Subjects
Matching (graph theory) ,Natural language user interface ,Computer science ,business.industry ,computer.software_genre ,Task (project management) ,Moment (mathematics) ,Media Technology ,Embedding ,Artificial intelligence ,Electrical and Electronic Engineering ,Representation (mathematics) ,business ,computer ,Sentence ,Natural language processing ,Generator (mathematics) - Abstract
Video moment retrieval has progressed significantly over the past few years, aiming to search the moment that is most relevant to a given natural language query. Most existing methods are trained in a fully-supervised or a weakly-supervised manner, which requires a time-consuming and expensive manually labeling process. In this work, we propose an alternative approach to achieving video moment retrieval that requires no textual annotations of videos and instead leverages the existing visual concept detectors and a pre-trained image-sentence embedding space. Specifically, we design a video-conditioned sentence generator to produce a suitable sentence representation by utilizing the mined visual concepts in videos. We then design a GNN-based relation-aware moment localizer to reasonably select a portion of video clips under the guidance of the generated sentence. Finally, the pre-trained image-sentence embedding space is adopted to evaluate the matching scores between the generated sentence and moment representations with the knowledge transferred from the image domain. By maximizing these scores, the sentence generator and moment localizer can enhance and complement each other to achieve the moment retrieval task. Experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our proposed method.
- Published
- 2022