1. RaSTFormer: region-aware spatiotemporal transformer for visual homogenization recognition in short videos.
- Author
-
Zhang, Shuying, Zhang, Jing, Zhang, Hui, and Zhuo, Li
- Subjects
- *
TRANSFORMER models , *COMPUTER network traffic , *RECOGNITION (Psychology) , *VIDEOS - Abstract
With the surge in network traffic, the homogenization of short video content is becoming increasingly prominent, resulting in low-quality entertainment due to proliferation and infringement. Therefore, recognizing visual homogeneity in short videos is of great significance. Considering the extremely similar dynamic evolution in specific regions from intra and inter-frames, introducing region-aware attention is attractive to achieve homogenization recognition. Therefore, we propose a region-aware spatiotemporal Transformer (RaSTFormer) for visual homogenization recognition, including: (1) a region-aware Transformer encoder is designed to extract multi-region frame-level features from short videos; (2) a multi-layer spatiotemporal Transformer decoder is used to aggregate multi-region frame-level features, generating shot-level spatiotemporal features; and (3) measuring symmetric shot-level chamfer similarity to recognize visual homogeneous content. Specially, we established a real-world video homogenization dataset, BJUT-HCD, and conducted extensive experiments. The proposed RaSTFormer achieved the highest mean average precision (mAP) of 98.13% and a top-1 accuracy of 99.37%, outperforming SOTA methods. The results show that our method achieves competitive performance in visual homogenization recognition in short videos. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF