Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Authors :: Lin, Kevin Qinghong
Wang, Alex Jinpeng
Yan, Rui
Xu, Eric Zhongcong
Tu, Rongcheng
Zhu, Yanru
Zhao, Wenzhe
Kong, Weijie
Cai, Chengfei
Wang, Hongfa
Liu, Wei
Shou, Mike Zheng
Publication Year :: 2022
Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.<br />Comment: To appeared in CVPRW22. 5 pages, 2 figures, 2 tables. Code: https://github.com/showlab/EgoVLP. The EPIC challenge technical report of EgoVLP arXiv:2206.01670. See Ego4D challenge technical report arXiv:2207.01622

Tools