1. Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation
- Author
-
Junho Koh, Byeongwon Lee, Jun Won Choi, Jaekyum Kim, and Seungji Yang
- Subjects
business.industry ,Computer science ,Deep learning ,Frame (networking) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Context (language use) ,02 engineering and technology ,010501 environmental sciences ,Object (computer science) ,01 natural sciences ,Object detection ,Feature (computer vision) ,Video tracking ,Encoding (memory) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
The deep learning technique has recently led to significant improvement in object detection accuracy. In many applications, object detection is performed on video data consisting of a sequence of two-dimensional (2D) image frames. Numerous object detection schemes have been designed to detect objects independently in each video frame. Though temporal information within adjacent image frames can be exploited in subsequent object tracking stage, it has been shown that the object detection accuracy can be significantly improved by exploiting the temporal structure in the image sequence in the object detection stage. In this paper, we propose a novel video object detection method that exploits both the motion context inferred from the adjacent frames and the spatio-temporal features aggregated over the image sequence. First, correlation between the spatial feature maps over two adjacent frames are computed and the embedding vector, representing the motion context, is obtained by encoding the $N$ correlation maps using long short term memory (LSTM). In addition to utilizing the motion context, the spatial feature maps for ( $N+1$ ) consecutive frames are aggregated to boost the quality of the feature map. The gated attention network is employed to selectively combine the temporal feature maps based on their relevance to the feature map in the present image frame. While most video object detectors have been developed for two-stage object detectors, our proposed idea applies to one-stage detectors with the advantage of low computational complexity in practical real-time applications. Our numerical evaluation conducted on the ImageNet object detection from video (VID) dataset demonstrates that our proposed network achieves significant performance gain over the baseline algorithms and outperforms the existing one-stage video object detectors.
- Published
- 2021
- Full Text
- View/download PDF