1. Fuzzy Multimodal Graph Reasoning for Human-Centric Instructional Video Grounding
- Author
-
Li, Yujie, Jiang, Xun, Xu, Xing, Lu, Huimin, and Tao Shen, Heng
- Abstract
Human-centric instructional videos provide opportunities for users to learn real-world multistep tasks, such as cooking, makeup, and using professional tools. However, these lengthy videos always lead to a tedious learning experience, making it challenging for learners to catch specific guidance efficiently. In this article, we present a novel approach, named fuzzy multimodal graph reasoning (FMGR), to extract target events in long untrimmed human-centric instructional videos using natural language. Specifically, we devise a fuzzy multimodal graph learning layers in our method, which encompass first contextual graph reasoning that transforms the individual features into contextualized features, second cross-modal relation fuzzifier that models the fine-grained matching relationships between two modalities, and third fuzzy graph reasoning that conducts massage passing among cross-modal matching node pairs. Particularly, we integrate fuzzy theory into the cross-modal relation fuzzifier to amplify potential matching pairs, while simultaneously mitigating the interference from ambiguous matches. To validate our method, we conducted evaluations on two human-centric instructional video datasets, i.e., MedVidQA and YouMakeUp. Moreover, we also take further analysis on the impacts of interrogative and declarative queries. Extensive experimental results and further analysis reveal the effectiveness of our proposed FMGR method.
- Published
- 2024
- Full Text
- View/download PDF