Key frame extraction from first-person video with multi-sensor integration

Authors :: Hideki Asoh
Motoaki Kawanabe
Yujie Li
Taiki Miyanishi
Atsunori Kanemura
Source :: ICME
Publication Year :: 2017
Publisher :: IEEE, 2017.
Abstract: First-person videos (FPVs) in daily living help us to memorize our life experience and information systems to process daily activities. Summarizing FPVs into key frames that represent the entire data would allow us to remember our memory in the past and computers to efficiently process the data. However, most video summarization approaches only use visual information, even though our daily activities consist of multiple modalities such as movements and sounds. FPVs are not as stable as movies or sport scenes since the camera attached to the head shakes frequently, and key frame extraction methods rely only on video frames do not always produce satisfactory results. In this paper, we introduce a novel key frame extraction method for FPVs using multiple wearable sensors. To efficiently integrate multimodal sensor signals, our formulation uses sparse dictionary selection, which minimizes a reconstruction error with a subset (key frames) of the original data. We present experimental results with multimodal datasets captured by wearable sensors in a natural environment. The results suggest multi-sensor information improves the precision of extracted key frames as well as the coverage of an entire video sequence.