Back to Search Start Over

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Authors :
Li, Hongxiang
Cao, Meng
Cheng, Xuxin
Li, Yaowei
Zhu, Zhihong
Zou, Yuexian
Publication Year :
2023

Abstract

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.<br />Comment: ICCV2023 oral, release the code

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2307.14277
Document Type :
Working Paper