Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop.

Authors :: Pratama, Raka Aditya
Yudistira, Novanto
Bachtiar, Fitra Abdurrachman
Source :: Multimedia Tools & Applications; Jul2024, Vol. 83 Issue 22, p61995-62017, 23p
Publication Year :: 2024
Abstract: Violence may happen anywhere. One of the ways to know and oversee the violence in some places is by installing Closed-circuit Television (CCTV). The recorded video captured by CCTV can be used as proof in a law court. Violence video classification is also one of the topics being discussed in deep learning. The latest violence video dataset is RWF-2000. That dataset contains violent and non-violent videos, 5 seconds duration, 30 frames per second, with the amount of 2000 videos. That publication also has the best accuracy of 87.25% by their proposed method. In this study, we will use a Residual Network known to have the advantage of solving the vanishing gradient problem. Beside that, we also implement transfer learning from Kinetics and Kinetics + Moments in Time pre-trained data. We also test the number of frames and the location range of the sampled frames. RGB and optical flow inputs are separately trained with different configurations. The RGB input best accuracy is 89.25% with pre-trained Kinetics + Moments in Time, using frame location of 49-149. The optical flow input best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We also try to sum the output of both inputs making accuracy of 90.5%. [ABSTRACT FROM AUTHOR]

Subjects :: OPTICAL flow
CLOSED-circuit television
VIOLENCE
DEEP learning
VIDEOS

Full Text Access

Tools