Back to Search
Start Over
Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
- Source :
- ACM Multimedia
- Publication Year :
- 2021
- Publisher :
- ACM, 2021.
-
Abstract
- This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holistic scenes and fine-grained key objects, respectively. The scene-level and object-level features are enhanced separately by multi-head self-attention mechanisms before feeding them into the decoding module. Towards generating content-relevant and human-like captions, we train our network end-to-end by semantic-reinforced learning. Finally, in order to select the best caption from captions produced by distinct models, we perform caption reranking by cross-modal matching between a given video and each candidate caption. Both internal experiments on the MSR-VTT test set and external evaluations by the challenge organizers justify the viability of the proposed solution.
- Subjects :
- Closed captioning
Computer science
business.industry
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
computer.software_genre
Task (project management)
Test set
Encoding (memory)
Key (cryptography)
Reinforcement learning
Artificial intelligence
Representation (mathematics)
business
computer
Decoding methods
Natural language processing
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Proceedings of the 29th ACM International Conference on Multimedia
- Accession number :
- edsair.doi...........957a20eed898b370e65025023d81b754