An encoder-decoder model for video captioning using RESNET and GRU.

Authors :: Preethi, A.
Dhanalakshmi, P.
Source :: AIP Conference Proceedings. 2023, Vol. 2917 Issue 1, p1-10. 10p.
Publication Year :: 2023
Abstract: Video Captioning is a process that generates the sentences for the visual information in a video. It is an essential process for video retrieval and analysis. Unlike the still images, the frames in video are temporally connected. It is very important to consider the visual, temporal and grammatical information while generating captions for a video. This is done through encoder-decoder architecture model. In encoder module, the ResNet-152 is used as a feature extractor to obtain the features from video frames. Then, in the decoder module, LSTM and GRU were employed to make the sentence generation. The architecture is trained and tested over the benchmark dataset Microsoft Video Description Corpus (MSVD) and performance is evaluated using BLEU, METEOR and CIDEr. [ABSTRACT FROM AUTHOR]