1. Title-based video summarization using attention networks
- Author
-
Li, Changwei
- Subjects
- Electrical Engineering, Supervised video summarization, Key-frame extraction, Text-visual cross-attention, Key-shot extraction, Query based Summarization, Self-Attention
- Abstract
The rapid advances in video storage, processing and streaming services, improvements of cellular communication speed, enhancement of mobile phone cameras and increase in social media engagement led to explosive growth in the number of videos generated every minute. Therefore, content-based video searching, browsing, and information retrieval technologies have received significant attention in recent years adapting to the massive number of videos generated. Video summarization techniques are among methodologies which can help users browse the video fast and retrieve information more efficiently by either solely extracting key-frames/segments or assembling the important segments further as video skims, highlights or summaries. In this research, the current video summarization pipeline, collected datasets, and related evaluation metrics are reviewed. Furthermore, various video summarization models which rely on the fusion of video title and visual features using attention networks will be proposed and evaluated using publicly available datasets: 1.A baseline video summarization model which uses correlation among visual features of video frames using attention network is studied. The training procedure and evaluation metrics will be compared against similar recent studies.2.Extracting Video Title embeddings using pre-trained language models, various methodologies for integrating video title information in the baseline model are studied and evaluated. Re-shaping self-attention to cross-attention, a model which takes advantage of correlation among video title and frame visual features is proposed. Given that the correlation of visual frames in long sequences does not necessarily provide video storyline, the fusion of title information in the proposed model improved the video summarization performance as expected. 3.Finally, to further improve the performance of the proposed model, loss function is modified to combine the accuracy of frame-level score predictions and segment-level score predictions. Optimizing the proposed loss, the model tries to predict the frame scores as accurate as possible while not deviating from the segment importance scores desired. The performance of the proposed model increased the summarization performance (F1-Score) by 1.1% on TvSum dataset and 2.2% on SumMe dataset.
- Published
- 2022