Tube ConvNets: Better exploiting motion for action recognition

Authors :: Wenmin Wang
Jinzhuo Wang
Zhihao Li
Nannan Li
Source :: ICIP
Publication Year :: 2016
Publisher :: IEEE, 2016.
Abstract: Motion information is a key factor for action recognition and has been eagerly pursued for decades. How to effectively learn motion features in Convolutional Networks (ConvNets) remains an open issue. Prevalent ConvNets often take several full frames of video as input at a time, which can be a heavy burden for network training. In this paper, we introduce a novel framework called Tube ConvNets, by substituting action tubes for full frames to reduce this burden. Tube ConvNets focus on the regions of interest (ROI) where key motions occur, and thus eliminate the distraction of irrelevant objects. Each action tube is a fraction of spatiotemporal volumes, generated by the techniques of object detection and clustering algorithm. We demonstrate the effectiveness of Tube ConvNets for action classification on UCF-101 dataset, and illustrate its potential to support fine-grained localization on UCF-Sports dataset. Source code is available at https://github.com/wangjinzhuo/tubecnn.