AsyNCE

Authors :: Yinghui Xu
Yanhao Zhang
Chunhong Pan
Yun Zheng
Pan Pan
Cheng Da
Source :: ACM Multimedia
Publication Year :: 2021
Publisher :: ACM, 2021.
Abstract: Weakly-supervised video grounding has been investigated to ground textual phases in video content with only video-sentence pairs provided during training, for the lack of prohibitively costly bounding box annotations. Existing methods cast this task into a frame-level multiple instance learning (MIL) problem with the ranking loss. While an object might appear sparsely across multiple frames, causing uncertain false-positive frames. Thus, directly computing the average loss of all frames is inadequate in video domain. Moreover, the positive and negative pairs are equally coupling in ranking loss, so that it is impossible to handle false-positive frames individually. Additionally, naive inner production is suboptimal for the similarity measure of cross domains. To solve these issues, we propose a novel AsyNCE loss to flexibly disentangle the positive pairs from negative ones in frame-level MIL, which allows for mitigating the uncertainty of false-positive frames effectively. Besides, a cross-modal transformer block is introduced to purify the text feature by image frame context, generating a visual-guided text feature for better similarity measure. Extensive experiments on YouCook2, RoboWatch and WAB datasets demonstrate the superiority and robustness of our method over state-of-the-art methods.

Subjects :: Robustness (computer science)
Computer science
business.industry
Feature (computer vision)
Minimum bounding box
False positive paradox
Pattern recognition
Context (language use)
Artificial intelligence
Similarity measure
business
Block (data storage)
Ranking (information retrieval)

Database :: OpenAIRE
Journal :: Proceedings of the 29th ACM International Conference on Multimedia
Accession number :: edsair.doi...........e3ff6d3b22f99825ea2505b3ef16edcf

Tools