1. First- And Third-Person Video Co-Analysis By Learning Spatial-Temporal Joint Attention
- Author
-
Huangyue Yu, Yunfei Liu, Feng Lu, and Minjie Cai
- Subjects
Matching (statistics) ,Joint attention ,Computer science ,business.industry ,Applied Mathematics ,Transition (fiction) ,Computational Theory and Mathematics ,Artificial Intelligence ,Human–computer interaction ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Set (psychology) ,Representation (mathematics) ,Joint (audio engineering) ,business ,Feature learning ,Software ,Wearable technology - Abstract
Recent years have witnessed a tremendous increasing of first-person videos captured by wearable devices. Such videos record information from different perspectives than the traditional third-person view, and thus show a wide range of potential usages. However, techniques for analyzing videos from different views can be fundamentally different, not to mention co-analyzing on both views to explore the shared information. In this paper, we take the challenge of cross-view video co-analysis and deliver a novel learning-based method. At the core of our method is the notion of "joint attention", indicating the shared attention regions that link the corresponding views, and eventually guide the shared representation learning across views. To this end, we propose a multi-branch deep network, which extracts cross-view joint attention and shared representation from static frames with spatial constraints, in a self-supervised and simultaneous manner. In addition, by incorporating the temporal transition model of the joint attention, we obtain spatial-temporal joint attention that can robustly capture the essential information extending through time. Our method outperforms the state-of-the-art on the standard cross-view video matching tasks on public datasets. Furthermore, we demonstrate how the learnt joint information can benefit various applications through a set of qualitative and quantitative experiments.
- Published
- 2023