A joint model for action localization and classification in untrimmed video with visual attention

Authors :: Ge Li
Wenmin Wang
Xiongtao Chen
Jinzhuo Wang
Weimian Li
Source :: ICME
Publication Year :: 2017
Publisher :: IEEE, 2017.
Abstract: In this paper, we introduce a joint model that learns to directly localize the temporal bounds of actions in untrimmed videos as well as precisely classify what actions occur. Most existing approaches tend to scan the whole video to generate action instances, which are really inefficient. Instead, inspired by human perception, our model is formulated based on a recurrent neural network to observe different locations within a video over time. And, it is capable of producing temporal localizations by only observing a fixed number of fragments, and the amount of computation it performs is independent of input video size. The decision policy for determining where to look next is learned by REINFORCE which is powerful in non-differentiable settings. In addition, different from relevant ways, our model runs localization and classification serially, and possesses a strategy for extracting appropriate features to classify. We evaluate our model on ActivityNet dataset, and it greatly outperforms the baseline. Moreover, compared with a recent approach, we show that our serial design can bring about 9% increase in detection performance.

Tools