575 results on '"video classification"'
Search Results
2. 3D-CNN Method for Drowsy Driving Detection Based on Driving Pattern Recognition.
- Author
-
Lee, Jimin, Woo, Soomin, and Moon, Changjoo
- Subjects
CONVOLUTIONAL neural networks ,TRAFFIC accidents ,SUNGLASSES ,HABIT ,CLASSIFICATION - Abstract
Drowsiness impairs drivers' concentration and reaction time, doubling the risk of car accidents. Various methods for detecting drowsy driving have been proposed that rely on facial changes. However, they have poor detection for drivers wearing a mask or sunglasses, and they do not reflect the driver's drowsiness habits. Therefore, this paper proposes a novel method to detect drowsy driving even with facial detection obstructions, such as masks or sunglasses, and regardless of the driver's different drowsiness habits, by recognizing behavioral patterns. We achieve this by constructing both normal driving and drowsy driving datasets and developing a 3D-CNN (3D Convolutional Neural Network) model reflecting the Inception structure of GoogleNet. This binary classification model classifies normal driving and drowsy driving videos. Using actual videos captured inside real vehicles, this model achieved a classification accuracy of 85% for detecting drowsy driving without facial obstructions and 75% for detecting drowsy driving when masks and sunglasses are worn. Our results demonstrate that the behavioral pattern recognition method is effective in detecting drowsy driving. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Effectiveness of deep learning techniques in TV programs classification: A comparative analysis.
- Author
-
Candela, Federico, Giordano, Angelo, Zagaria, Carmen Francesca, and Morabito, Francesco Carlo
- Subjects
- *
CONVOLUTIONAL neural networks , *PATTERN recognition systems , *DEEP learning , *OPTICAL flow , *TRANSFORMER models - Abstract
In the application areas of streaming, social networks, and video-sharing platforms such as YouTube and Facebook, along with traditional television systems, programs' classification stands as a pivotal effort in multimedia content management. Despite recent advancements, it remains a scientific challenge for researchers. This paper proposes a novel approach for television monitoring systems and the classification of extended video content. In particular, it presents two distinct techniques for program classification. The first one leverages a framework integrating Structural Similarity Index Measurement and Convolutional Neural Network, which pipelines on stacked frames to classify program initiation, conclusion, and contents. Noteworthy, this versatile method can be seamlessly adapted across various systems. The second analyzed framework implies directly processing optical flow. Building upon a shot-boundary detection technique, it incorporates background subtraction to adaptively discern frame alterations. These alterations are subsequently categorized through the integration of a Transformers network, showcasing a potential advancement in program classification methodology. A comprehensive overview of the promising experimental results yielded by the two techniques is reported. The first technique achieved an accuracy of 95%, while the second one surpassed it with an even higher accuracy of 87% on multiclass classification. These results underscore the effectiveness and reliability of the proposed frameworks, and pave the way for a more efficient and precise content management in the ever-evolving landscape of multimedia platforms and streaming services. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Machine vs. deep learning comparision for developing an international sign language translator.
- Author
-
Eryilmaz, Meltem, Balkaya, Ecem, Uçan, Eylül, Turan, Gizem, and Oral, Seden Gülay
- Subjects
- *
SIGN language , *DEEP learning , *UNIVERSAL language , *MACHINE learning , *DEAF people , *COMMUNICATION barriers - Abstract
This study aims to enable deaf and hard-of-hearing people to communicate with other individuals who know and do not know sign language. The mobile application was developed for video classification by using MediaPipe Library in the study. While doing this, considering the problems that deaf and hearing loss individuals face in Turkey and abroad modelling and training stages were carried out with the English language option. With the real-time translation feature added to the study individuals were provided with instant communication. In this way, communication problems experienced by hearing-impaired individuals will be greatly reduced. Machine learning and Deep learning concepts were investigated in the study. Model creation and training stages were carried out using VGG16, OpenCV, Pandas, Keras, and Os libraries. Due to the low success rate in the model created using VGG16, the MediaPipe library was used in the formation and training stages of the model. The reason for this is that, thanks to the solutions available in the MediaPipe library, it can normalise the coordinates in 3D by marking the regions to be detected in the human body. Being able to extract the coordinates independently of the background and body type in the videos in the dataset increases the success rate of the model in the formation and training stages. As a result of an experiment, the accuracy rate of the deep learning model is 85% and the application can be easily integrated with different languages. It is concluded that deep learning model is more accure than machine learning one and the communication problem faced by hearing-impaired individuals in many countries can be reduced easily. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Multi-Directional Long-Term Recurrent Convolutional Network for Road Situation Recognition.
- Author
-
Dofitas Jr., Cyreneo, Gil, Joon-Min, and Byun, Yung-Cheol
- Subjects
- *
ARTIFICIAL neural networks , *RECURRENT neural networks , *PEDESTRIANS , *ROAD safety measures , *CONVOLUTIONAL neural networks , *DEEP learning - Abstract
Understanding road conditions is essential for implementing effective road safety measures and driving solutions. Road situations encompass the day-to-day conditions of roads, including the presence of vehicles and pedestrians. Surveillance cameras strategically placed along streets have been instrumental in monitoring road situations and providing valuable information on pedestrians, moving vehicles, and objects within road environments. However, these video data and information are stored in large volumes, making analysis tedious and time-consuming. Deep learning models are increasingly utilized to monitor vehicles and identify and evaluate road and driving comfort situations. However, the current neural network model requires the recognition of situations using time-series video data. In this paper, we introduced a multi-directional detection model for road situations to uphold high accuracy. Deep learning methods often integrate long short-term memory (LSTM) into long-term recurrent network architectures. This approach effectively combines recurrent neural networks to capture temporal dependencies and convolutional neural networks (CNNs) to extract features from extensive video data. In our proposed method, we form a multi-directional long-term recurrent convolutional network approach with two groups equipped with CNN and two layers of LSTM. Additionally, we compare road situation recognition using convolutional neural networks, long short-term networks, and long-term recurrent convolutional networks. The paper presents a method for detecting and recognizing multi-directional road contexts using a modified LRCN. After balancing the dataset through data augmentation, the number of video files increased, resulting in our model achieving 91% accuracy, a significant improvement from the original dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations.
- Author
-
Mao, Makara, Lee, Ahyoung, and Hong, Min
- Subjects
CONVOLUTIONAL neural networks ,DATA augmentation ,VIDEO processing ,PARALLEL processing ,DEEP learning ,IMAGE processing - Abstract
Video classification has achieved remarkable success in recent years, driven by advanced deep learning models that automatically categorize video content. This paper provides a comprehensive review of video classification techniques and the datasets used in this field. We summarize key findings from recent research, focusing on network architectures, model evaluation metrics, and parallel processing methods that enhance training speed. Our review includes an in-depth analysis of state-of-the-art deep learning models and hybrid architectures, comparing models to traditional approaches and highlighting their advantages and limitations. Critical challenges such as handling large-scale datasets, improving model robustness, and addressing computational constraints are explored. By evaluating performance metrics, we identify areas where current models excel and where improvements are needed. Additionally, we discuss data augmentation techniques designed to enhance dataset accuracy and address specific challenges in video classification tasks. This survey also examines the evolution of convolutional neural networks (CNNs) in image processing and their adaptation to video classification tasks. We propose future research directions and provide a detailed comparison of existing approaches using the UCF-101 dataset, highlighting progress and ongoing challenges in achieving robust video classification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Improvement of Tradition Dance Classification Process Using Video Vision Transformer based on Tubelet Embedding.
- Author
-
Mulyanto, Edy, Yuniarno, Eko Mulyanto, Putra, Oddy Virgantara, Hafidz, Isa, Priyadi, Ardyono, and Purnomo, Mauridhi H.
- Subjects
TRANSFORMER models ,HISTORY of dance ,ARTIFICIAL neural networks ,OBJECT recognition (Computer vision) ,VIDEO processing ,THRESHOLDING algorithms - Abstract
Image processing has extensively addressed object detection, classification, clustering, and segmentation challenges. At the same time, the use of computers associated with complex video datasets spurred various strategies to classify videos automatically, particularly in detecting traditional dances. This research proposes advancement in classifying traditional dances by implementing a Video Vision Transformer (ViViT) that relies on tubelet embedding. The authors utilized IDEEH-10, a dataset of videos showcasing traditional dances. In addition, the ViViT artificial neural network model was used for video classification. The video representation is generated by projecting spatiotemporal tokens onto the transformer layer. Next, an embedding strategy is used to improve the classification accuracy of Traditional Dance Videos. The proposed concept treats video as a sequence of tubules mapped into tubule embeddings. Tubelet management has added TA (tubelet attention layer), CA (cross attention layer), and tubelet duration and scale management. From the test results, the proposed approach can better classify traditional dance videos compared to the LSTM, GRU, and RNN methods, with or without balancing data. Experimental results with 5 flods showed Loss between 0.003 to 0.011 with an average Lost of 0.0058. Experiments also produced an accuracy rate between 98.68 to 100 percent, resulting in an average accuracy of 99.216. This result is the best of several comparison methods. ViViT with tubeless embedding has a good level of accuracy with low losses, so that it can be used for dance video classification processes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop.
- Author
-
Pratama, Raka Aditya, Yudistira, Novanto, and Bachtiar, Fitra Abdurrachman
- Subjects
OPTICAL flow ,CLOSED-circuit television ,VIOLENCE ,DEEP learning ,VIDEOS - Abstract
Violence may happen anywhere. One of the ways to know and oversee the violence in some places is by installing Closed-circuit Television (CCTV). The recorded video captured by CCTV can be used as proof in a law court. Violence video classification is also one of the topics being discussed in deep learning. The latest violence video dataset is RWF-2000. That dataset contains violent and non-violent videos, 5 seconds duration, 30 frames per second, with the amount of 2000 videos. That publication also has the best accuracy of 87.25% by their proposed method. In this study, we will use a Residual Network known to have the advantage of solving the vanishing gradient problem. Beside that, we also implement transfer learning from Kinetics and Kinetics + Moments in Time pre-trained data. We also test the number of frames and the location range of the sampled frames. RGB and optical flow inputs are separately trained with different configurations. The RGB input best accuracy is 89.25% with pre-trained Kinetics + Moments in Time, using frame location of 49-149. The optical flow input best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We also try to sum the output of both inputs making accuracy of 90.5%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Utilizing Age‐Adaptive Deep Learning Approaches for Detecting Inappropriate Video Content.
- Author
-
Alam, Iftikhar, Basit, Abdul, Ziar, Riaz Ahmad, and Chakraborty, Pinaki
- Subjects
- *
STREAMING video & television , *EVIDENCE gaps , *INTERNET safety , *DEEP learning , *SAFETY standards , *VIDEO surveillance - Abstract
The exponential growth of video‐sharing platforms, exemplified by platforms like YouTube and Netflix, has made videos available to everyone with minimal restrictions. This proliferation, while offering a variety of content, at the same time introduces challenges, such as the increased vulnerability of children and adolescents to potentially harmful material, notably explicit content. Despite the efforts in developing content moderation tools, a research gap still exists in creating comprehensive solutions capable of reliably estimating users' ages and accurately classifying numerous forms of inappropriate video content. This study is aimed at bridging this gap by introducing VideoTransformer, which combines the power of two existing models: AgeNet and MobileNetV2. To evaluate the effectiveness of the proposed approach, this study utilized a manually annotated video dataset collected from YouTube, covering multiple categories, including safe, real violence, drugs, nudity, simulated violence, kissing, pornography, and terrorism. In contrast to existing models, the proposed VideoTransformer model demonstrates significant performance improvements, as evidenced by two distinct accuracy evaluations. It achieves an impressive accuracy rate of (96.89%) in a 5‐fold cross‐validation setup, outperforming NasNet (92.6%), EfficientNet‐B7 (87.87%), GoogLeNet (85.1%), and VGG‐19 (92.83%). Furthermore, in a single run, it maintains a consistent accuracy rate of 90%. Additionally, the proposed model attains an F1‐score of 90.34%, indicating a well‐balanced trade‐off between precision and recall. These findings highlight the potential of the proposed approach in advancing content moderation and enhancing user safety on video‐sharing platforms. We envision deploying the proposed methodology in real‐time video streaming to effectively mitigate the spread of inappropriate content, thereby raising online safety standards. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Variable Temporal Length Training for Action Recognition CNNs.
- Author
-
Li, Tan-Kun, Chan, Kwok-Leung, and Tjahjadi, Tardi
- Subjects
- *
CONVOLUTIONAL neural networks , *COMPUTER vision , *DEEP learning , *VIDEO processing , *RECOGNITION (Psychology) - Abstract
Most current deep learning models are suboptimal in terms of the flexibility of their input shape. Usually, computer vision models only work on one fixed shape used during training, otherwise their performance degrades significantly. For video-related tasks, the length of each video (i.e., number of video frames) can vary widely; therefore, sampling of video frames is employed to ensure that every video has the same temporal length. This training method brings about drawbacks in both the training and testing phases. For instance, a universal temporal length can damage the features in longer videos, preventing the model from flexibly adapting to variable lengths for the purposes of on-demand inference. To address this, we propose a simple yet effective training paradigm for 3D convolutional neural networks (3D-CNN) which enables them to process videos with inputs having variable temporal length, i.e., variable length training (VLT). Compared with the standard video training paradigm, our method introduces three extra operations during training: sampling twice, temporal packing, and subvideo-independent 3D convolution. These operations are efficient and can be integrated into any 3D-CNN. In addition, we introduce a consistency loss to regularize the representation space. After training, the model can successfully process video with varying temporal length without any modification in the inference phase. Our experiments on various popular action recognition datasets demonstrate the superior performance of the proposed method compared to conventional training paradigm and other state-of-the-art training paradigms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Automatic excavator action recognition and localisation for untrimmed video using hybrid LSTM-Transformer networks.
- Author
-
Martin, Abbey, Hill, Andrew J., Seiler, Konstantin M., and Balamurali, Mehala
- Subjects
- *
EXCAVATING machinery , *COMPUTER vision , *SHORT-term memory , *TRANSFORMER models , *RECOGNITION (Psychology) , *VIDEOS - Abstract
In mining and construction, excavators are integral to earth-moving operations. Accurate knowledge of excavator activities may be used in productivity analysis to streamline delivery. This paper presents a computer vision-based method for excavator action detection which can automatically inference the occurrence and time duration of excavator actions from untrimmed video captured from the excavator cab. The model uses a three-stage architecture consisting of a VGG16 feature extractor, a four-stage Transformer Encoder-Long Short-Term Memory (LSTM) module, and a post-processing component. The model's predictive performance has been validated on the largest dataset among similar studies, comprising 567,000 frames filmed on-site at day and night. When tested on night and daytime videos, the model achieves accuracies of 90% and 70%, respectively, highlighting strong potential for practical implementation of the Transformer-LSTM network in excavator action detection. This study presents the first application of the combined Transformer-LSTM network for action detection in computer vision. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Driver Safety System for Agricultural Machinery Operations Using Deep Learning Algorithm
- Author
-
Hengyi, Zhang, Ahamed, Tofael, and Ahamed, Tofael, editor
- Published
- 2024
- Full Text
- View/download PDF
13. Ensembles of Bidirectional LSTM and GRU Neural Nets for Predicting Mother-Infant Synchrony in Videos
- Author
-
Stamate, Daniel, Davuloori, Pradyumna, Logofatu, Doina, Mercure, Evelyne, Addyman, Caspar, Tomlinson, Mark, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Iliadis, Lazaros, editor, Maglogiannis, Ilias, editor, Papaleonidas, Antonios, editor, Pimenidis, Elias, editor, and Jayne, Chrisina, editor
- Published
- 2024
- Full Text
- View/download PDF
14. Classification Algorithm of Sports Teaching Video Based on Wireless Sensor Network
- Author
-
Chen, Zhipeng, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Yun, Lin, editor, Han, Jiang, editor, and Han, Yu, editor
- Published
- 2024
- Full Text
- View/download PDF
15. MMT: Transformer for Multi-modal Multi-label Self-supervised Learning
- Author
-
Wang, Jiahe, Li, Jia, Liu, Xingrui, Gao, Xizhan, Niu, Sijie, Dong, Jiwen, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, You, Peng, editor, Liu, Shuaiqi, editor, and Wang, Jun, editor
- Published
- 2024
- Full Text
- View/download PDF
16. Exploring the Impact of Convolutions on LSTM Networks for Video Classification
- Author
-
Benzyane, Manal, Azrour, Mourade, Zeroual, Imad, Agoujil, Said, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Farhaoui, Yousef, editor, Hussain, Amir, editor, Saba, Tanzila, editor, Taherdoost, Hamed, editor, and Verma, Anshul, editor
- Published
- 2024
- Full Text
- View/download PDF
17. A Novel Approach for Deep Learning Based Video Classification and Captioning using Keyframe
- Author
-
Ghadekar, Premanand, Pungliya, Vithika, Purohit, Atharva, Bhonsle, Roshita, Raut, Ankur, Pate, Samruddhi, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Mehta, Gayatri, editor, Wickramasinghe, Nilmini, editor, and Kakkar, Deepti, editor
- Published
- 2024
- Full Text
- View/download PDF
18. YoloP-Based Pre-processing for Driving Scenario Detection
- Author
-
Cossu, Marianna, Berta, Riccardo, Forneris, Luca, Fresta, Matteo, Lazzaroni, Luca, Sauvaget, Jean-Louis, Bellotti, Francesco, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Bellotti, Francesco, editor, Grammatikakis, Miltos D., editor, Mansour, Ali, editor, Ruo Roch, Massimo, editor, Seepold, Ralf, editor, Solanas, Agusti, editor, and Berta, Riccardo, editor
- Published
- 2024
- Full Text
- View/download PDF
19. Activity Identification and Recognition in Real-Time Video Data Using Deep Learning Techniques
- Author
-
Grover, Anant, Arora, Deepak, Grover, Anuj, Bansal, Jagdish Chand, Series Editor, Deep, Kusum, Series Editor, Nagar, Atulya K., Series Editor, Jacob, I. Jeena, editor, Piramuthu, Selwyn, editor, and Falkowski-Gilski, Przemyslaw, editor
- Published
- 2024
- Full Text
- View/download PDF
20. Movement in Video Classification Using Structured Data: Workout Videos Application
- Author
-
Múnera, Jonathan, Tabares, Marta Silvia, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Tabares, Marta, editor, Vallejo, Paola, editor, Suarez, Biviana, editor, Suarez, Marco, editor, Ruiz, Oscar, editor, and Aguilar, Jose, editor
- Published
- 2024
- Full Text
- View/download PDF
21. A Multi-scale Multi-modal Multi-dimension Joint Transformer for Two-Stream Action Classification
- Author
-
Wang, Lin, Hawbani, Ammar, Xiong, Yan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Fenrong, editor, Sadanandan, Arun Anand, editor, Pham, Duc Nghia, editor, Mursanto, Petrus, editor, and Lukose, Dickson, editor
- Published
- 2024
- Full Text
- View/download PDF
22. Deep learning-based vehicle event identification
- Author
-
Chen, Yen-Yu, Chen, Jui-Chi, Lian, Zhen-You, Chiang, Hsin-You, Huang, Chung-Lin, and Chuang, Cheng-Hung
- Published
- 2024
- Full Text
- View/download PDF
23. Schlieren imaging and video classification of alphabet pronunciations: exploiting phonetic flows for speech recognition and speech therapy
- Author
-
Mohamed Talaat, Kian Barari, Xiuhua April Si, and Jinxiang Xi
- Subjects
Alphabet pronunciation ,Speech flows ,Articulatory phonetics ,Video classification ,Schlieren ,Long short-term memory ,Drawing. Design. Illustration ,NC1-1940 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Computer software ,QA76.75-76.765 - Abstract
Abstract Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns. The schlieren imaging technique visualizes airflows with subtle density variations. It is hypothesized that speech flows captured by schlieren, when analyzed using a hybrid of convolutional neural network (CNN) and long short-term memory (LSTM) network, can recognize alphabet pronunciations, thus facilitating automatic speech recognition and speech disorder therapy. This study evaluates the feasibility of using a CNN-based video classification network to differentiate speech flows corresponding to the first four alphabets: /A/, /B/, /C/, and /D/. A schlieren optical system was developed, and the speech flows of alphabet pronunciations were recorded for two participants at an acquisition rate of 60 frames per second. A total of 640 video clips, each lasting 1 s, were utilized to train and test a hybrid CNN-LSTM network. Acoustic analyses of the recorded sounds were conducted to understand the phonetic differences among the four alphabets. The hybrid CNN-LSTM network was trained separately on four datasets of varying sizes (i.e., 20, 30, 40, 50 videos per alphabet), all achieving over 95% accuracy in classifying videos of the same participant. However, the network’s performance declined when tested on speech flows from a different participant, with accuracy dropping to around 44%, indicating significant inter-participant variability in alphabet pronunciation. Retraining the network with videos from both participants improved accuracy to 93% on the second participant. Analysis of misclassified videos indicated that factors such as low video quality and disproportional head size affected accuracy. These results highlight the potential of CNN-assisted speech recognition and speech therapy using articulation flows, although challenges remain in expanding the alphabet set and participant cohort.
- Published
- 2024
- Full Text
- View/download PDF
24. Enhancing multimedia management: cloud-based movie type recognition with hybrid deep learning architecture
- Author
-
Fangru Lin, Jie Yuan, Zhiwei Chen, and Maryam Abiri
- Subjects
Video classification ,Deep learning ,Service management ,Cloud computing ,Movie genres ,Bidirectional LSTM ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Film and movie genres play a pivotal role in captivating relevant audiences across interactive multimedia platforms. With a focus on entertainment, streaming providers are increasingly prioritizing the automatic generation of movie genres within cloud-based media services. In service management, the integration of a hybrid convolutional network proves to be instrumental in effectively distinguishing between a diverse array of video genres. This classification process not only facilitates more refined recommendations and content filtering but also enables targeted advertising. Furthermore, given the frequent amalgamation of components from various genres in cinema, there arises a need for social media networks to incorporate real-time video classification mechanisms for accurate genre identification. In this study, we propose a novel architecture leveraging deep learning techniques for the detection and classification of genres in video films. Our approach entails the utilization of a bidirectional long- and short-term memory (BiLSTM) network, augmented with video descriptors extracted from EfficientNet-B7, an ImageNet pre-trained convolutional neural network (CNN) model. By employing BiLSTM, the network acquires robust video representations and proficiently categorizes movies into multiple genres. Evaluation on the LMTD dataset demonstrates the substantial improvement in the performance of the movie genre classifier system achieved by our proposed architecture. Notably, our approach achieves both computational efficiency and precision, outperforming even the most sophisticated models. Experimental results reveal that EfficientNet-BiLSTM achieves a precision rate of 93.5%. Furthermore, our proposed architecture attains state-of-the-art performance, as evidenced by its F1 score of 0.9012.
- Published
- 2024
- Full Text
- View/download PDF
25. Recognizing online video genres using ensemble deep convolutional learning for digital media service management
- Author
-
Yuwen Shao and Na Guo
- Subjects
Video classification ,Movie genres ,Service management ,Ensemble deep learning ,Gated recurrent unit ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract It's evident that streaming services increasingly seek to automate the generation of film genres, a factor profoundly shaping a film's structure and target audience. Integrating a hybrid convolutional network into service management emerges as a valuable technique for discerning various video formats. This innovative approach not only categorizes video content but also facilitates personalized recommendations, content filtering, and targeted advertising. Given the tendency of films to blend elements from multiple genres, there is a growing demand for a real-time video classification system integrated with social media networks. Leveraging deep learning, we introduce a novel architecture for identifying and categorizing video film genres. Our approach utilizes an ensemble gated recurrent unit (ensGRU) neural network, effectively analyzing motion, spatial information, and temporal relationships. Additionally,w we present a sophisticated deep neural network incorporating the recommended GRU for video genre classification. The adoption of a dual-model strategy allows the network to capture robust video representations, leading to exceptional performance in multi-class movie classification. Evaluations conducted on well-known datasets, such as the LMTD dataset, consistently demonstrate the high performance of the proposed GRU model. This integrated model effectively extracts and learns features related to motion, spatial location, and temporal dynamics. Furthermore, the effectiveness of the proposed technique is validated using an engine block assembly dataset. Following the implementation of the enhanced architecture, the movie genre categorization system exhibits substantial improvements on the LMTD dataset, outperforming advanced models while requiring less computing power. With an impressive F1 score of 0.9102 and an accuracy rate of 94.4%, the recommended model consistently delivers outstanding results. Comparative evaluations underscore the accuracy and effectiveness of our proposed model in accurately identifying and classifying video genres, effectively extracting contextual information from video descriptors. Additionally, by integrating edge processing capabilities, our system achieves optimal real-time video processing and analysis, further enhancing its performance and relevance in dynamic media environments.
- Published
- 2024
- Full Text
- View/download PDF
26. Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
- Author
-
Altaf Hussain, Samee Ullah Khan, Noman Khan, Waseem Ullah, Ahmed Alkhayyat, Meshal Alharbi, and Sung Wook Baik
- Subjects
Activity Recognition ,Video Classification ,Surveillance System ,Lowlight Image Enhancement ,Dual Stream Network ,Transformer Network ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
- Published
- 2024
- Full Text
- View/download PDF
27. GMDCSA-24: A dataset for human fall detection in videos
- Author
-
Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo, and Ibrahim A. Hameed
- Subjects
Indoor fall detection ,Remote elderly care ,Video classification ,Video dataset ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Science (General) ,Q1-390 - Abstract
The population of older adults (elders) is increasing at a breakneck pace worldwide. This surge presents a significant challenge in providing adequate care for elders due to the scarcity of human caregivers. Unintentional falls of humans are critical health issues, especially for elders. Detecting falls and providing assistance as early as possible is of utmost importance. Researchers worldwide have shown interest in designing a system to detect falls promptly especially by remote monitoring, enabling the timely provision of medical help. The dataset ‘GMDCSA-24′ has been created to support the researchers on this topic to develop models to detect falls and other activities. This dataset was generated in three different natural home setups, where Falls and Activities of Daily Living were performed by four subjects (actors). To bring the versatility, the recordings were done at different times and lighting conditions: during the day when there is ample light and at night when there is low light in addition, the subjects wear different sets of clothes in the dataset. The actions were captured using the low-cost 0.92 Megapixel webcam. The low-resolution video clips make it suitable for use in real-time systems with fewer resources without any compression or processing of the clips. Users can also use this dataset to check the robustness and generalizability of a system for false positives since many ADL clips involve complex activities that may be falsely detected as falls. These complex activities include sleeping, picking up an object from the ground, doing push-ups, etc. The dataset contains 81 falls and 79 ADL video clips performed by four subjects.
- Published
- 2024
- Full Text
- View/download PDF
28. Schlieren imaging and video classification of alphabet pronunciations: exploiting phonetic flows for speech recognition and speech therapy.
- Author
-
Talaat, Mohamed, Barari, Kian, Si, Xiuhua April, and Xi, Jinxiang
- Subjects
AUTOMATIC speech recognition ,IMAGE recognition (Computer vision) ,SPEECH therapy ,CONVOLUTIONAL neural networks ,SPEECH perception ,PRONUNCIATION - Abstract
Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns. The schlieren imaging technique visualizes airflows with subtle density variations. It is hypothesized that speech flows captured by schlieren, when analyzed using a hybrid of convolutional neural network (CNN) and long short-term memory (LSTM) network, can recognize alphabet pronunciations, thus facilitating automatic speech recognition and speech disorder therapy. This study evaluates the feasibility of using a CNN-based video classification network to differentiate speech flows corresponding to the first four alphabets: /A/, /B/, /C/, and /D/. A schlieren optical system was developed, and the speech flows of alphabet pronunciations were recorded for two participants at an acquisition rate of 60 frames per second. A total of 640 video clips, each lasting 1 s, were utilized to train and test a hybrid CNN-LSTM network. Acoustic analyses of the recorded sounds were conducted to understand the phonetic differences among the four alphabets. The hybrid CNN-LSTM network was trained separately on four datasets of varying sizes (i.e., 20, 30, 40, 50 videos per alphabet), all achieving over 95% accuracy in classifying videos of the same participant. However, the network's performance declined when tested on speech flows from a different participant, with accuracy dropping to around 44%, indicating significant inter-participant variability in alphabet pronunciation. Retraining the network with videos from both participants improved accuracy to 93% on the second participant. Analysis of misclassified videos indicated that factors such as low video quality and disproportional head size affected accuracy. These results highlight the potential of CNN-assisted speech recognition and speech therapy using articulation flows, although challenges remain in expanding the alphabet set and participant cohort. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Enhancing multimedia management: cloud-based movie type recognition with hybrid deep learning architecture.
- Author
-
Lin, Fangru, Yuan, Jie, Chen, Zhiwei, and Abiri, Maryam
- Subjects
CONVOLUTIONAL neural networks ,DEEP learning ,FILM genres ,INTERACTIVE multimedia - Abstract
Film and movie genres play a pivotal role in captivating relevant audiences across interactive multimedia platforms. With a focus on entertainment, streaming providers are increasingly prioritizing the automatic generation of movie genres within cloud-based media services. In service management, the integration of a hybrid convolutional network proves to be instrumental in effectively distinguishing between a diverse array of video genres. This classification process not only facilitates more refined recommendations and content filtering but also enables targeted advertising. Furthermore, given the frequent amalgamation of components from various genres in cinema, there arises a need for social media networks to incorporate real-time video classification mechanisms for accurate genre identification. In this study, we propose a novel architecture leveraging deep learning techniques for the detection and classification of genres in video films. Our approach entails the utilization of a bidirectional long- and short-term memory (BiLSTM) network, augmented with video descriptors extracted from EfficientNet-B7, an ImageNet pre-trained convolutional neural network (CNN) model. By employing BiLSTM, the network acquires robust video representations and proficiently categorizes movies into multiple genres. Evaluation on the LMTD dataset demonstrates the substantial improvement in the performance of the movie genre classifier system achieved by our proposed architecture. Notably, our approach achieves both computational efficiency and precision, outperforming even the most sophisticated models. Experimental results reveal that EfficientNet-BiLSTM achieves a precision rate of 93.5%. Furthermore, our proposed architecture attains state-of-the-art performance, as evidenced by its F1 score of 0.9012. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. Multi-Modal Low-Data-Based Learning for Video Classification.
- Author
-
Citak, Erol and Karsligil, Mine Elif
- Subjects
COMPUTER vision ,CLASSIFICATION ,VIDEOS - Abstract
Video classification is a challenging task in computer vision that requires analyzing the content of a video to assign it to one or more predefined categories. However, due to the vast amount of visual data contained in videos, the classification process is often computationally expensive and requires a significant amount of annotated data. Because of these reasons, the low-data-based video classification area, which consists of few-shot and zero-shot tasks, is proposed as a potential solution to overcome traditional video classification-oriented challenges. However, existing low-data area datasets, which are either not diverse or have no additional modality context, which is a mandatory requirement for the zero-shot task, do not fulfill the requirements for few-shot and zero-shot tasks completely. To address this gap, in this paper, we propose a large-scale, general-purpose dataset for the problem of multi-modal low-data-based video classification. The dataset contains pairs of videos and attributes that capture multiple facets of the video content. Thus, the new proposed dataset will both enable the study of low-data-based video classification tasks and provide consistency in terms of comparing the evaluations of future studies in this field. Furthermore, to evaluate and provide a baseline for future works on our new proposed dataset, we present a variational autoencoder-based model that leverages the inherent correlation among different modalities to learn more informative representations. In addition, we introduce a regularization technique to improve the baseline model's generalization performance in low-data scenarios. Our experimental results reveal that our proposed baseline model, with the aid of this regularization technique, achieves over 12% improvement in classification accuracy compared to the pure baseline model with only a single labeled sample. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Recognizing online video genres using ensemble deep convolutional learning for digital media service management.
- Author
-
Shao, Yuwen and Guo, Na
- Subjects
STREAMING video & television ,ARTIFICIAL neural networks ,DEEP learning ,DIGITAL media ,DIGITAL learning ,PROCESS capability ,VIDEOS - Abstract
It's evident that streaming services increasingly seek to automate the generation of film genres, a factor profoundly shaping a film's structure and target audience. Integrating a hybrid convolutional network into service management emerges as a valuable technique for discerning various video formats. This innovative approach not only categorizes video content but also facilitates personalized recommendations, content filtering, and targeted advertising. Given the tendency of films to blend elements from multiple genres, there is a growing demand for a real-time video classification system integrated with social media networks. Leveraging deep learning, we introduce a novel architecture for identifying and categorizing video film genres. Our approach utilizes an ensemble gated recurrent unit (ensGRU) neural network, effectively analyzing motion, spatial information, and temporal relationships. Additionally,w we present a sophisticated deep neural network incorporating the recommended GRU for video genre classification. The adoption of a dual-model strategy allows the network to capture robust video representations, leading to exceptional performance in multi-class movie classification. Evaluations conducted on well-known datasets, such as the LMTD dataset, consistently demonstrate the high performance of the proposed GRU model. This integrated model effectively extracts and learns features related to motion, spatial location, and temporal dynamics. Furthermore, the effectiveness of the proposed technique is validated using an engine block assembly dataset. Following the implementation of the enhanced architecture, the movie genre categorization system exhibits substantial improvements on the LMTD dataset, outperforming advanced models while requiring less computing power. With an impressive F1 score of 0.9102 and an accuracy rate of 94.4%, the recommended model consistently delivers outstanding results. Comparative evaluations underscore the accuracy and effectiveness of our proposed model in accurately identifying and classifying video genres, effectively extracting contextual information from video descriptors. Additionally, by integrating edge processing capabilities, our system achieves optimal real-time video processing and analysis, further enhancing its performance and relevance in dynamic media environments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection.
- Author
-
Li, Chenghao, Yang, Xinyan, and Liang, Gang
- Subjects
- *
FEATURE extraction , *COMPUTER vision , *DEEP learning , *CONVOLUTIONAL neural networks , *COMPUTER network architectures - Abstract
Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Using two-stream EfficientNet-BiLSTM network for multiclass classification of disturbing YouTube videos.
- Author
-
Yousaf, Kanwal, Nawaz, Tabassam, and Habib, Adnan
- Abstract
YouTube video recommendation algorithm plays an important role in enhancing user engagement and profitability (or monetization), yet it struggles to deal with disturbing visual content. Real-time analysis using deep learning techniques can play a vital role to identify and filter the disturbing content in online videos. In this paper, we propose an end-to-end trainable two-stream deep learning framework that analyzes and classifies the different categories of disturbing content embedded in child-friendly cartoon videos on YouTube and YouTube Kids platforms. At first, the model extracts the static and motion features from videos through an individual pretrained convolutional neural network (CNN) i.e., EfficientNet-B7. In the next phase, the extracted features are processed by spatio-temporal bidirectional long short-term memory (BiLSTM) network to capture the long-term global temporal dependencies of videos. The learned video representations are forwarded to the classifier for multiclass classification into six categories. Three different types of fusion strategies are investigated to combine the spatial and temporal streams. The evaluation of these methods is performed through extensive experiments on a customized large-scale dataset–YouTube cartoon content filtering (YT-C2F) dataset. The spatio-temporal EfficientNet-BiLSTM network with feature-level fusion displays the best results (f1-score = 0.9316) and shows the efficiency of using a two-stream network compared to single-stream baseline methods. This paper makes two-fold contributions. First, using only static information from videos in a deep learning framework is less effective than using both static and motion information. Second, the feature-level fusion of two streams of networks generates better classification results than early and late fusion techniques. The performance comparison of the proposed model with existing state-of-the-art techniques confirmed the competitive or even better results of our framework. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. An attention mechanism-based CNN-BiLSTM classification model for detection of inappropriate content in cartoon videos.
- Author
-
Yousaf, Kanwal and Nawaz, Tabassam
- Abstract
This paper proposes a novel method that combines an ImageNet pretrained convolutional neural network (CNN) with attention-based bidirectional long short-term memory (BiLSTM) network for accurate detection of inappropriate content in animated cartoon videos. The EfficientNet-B7 architecture is used as a pretrained CNN model for extracting features from videos, whilst the attention-based BiLSTM is implemented to dynamically focus on different parts of video feature sequences that are most relevant for classification. The whole architecture is trained end-to-end with input being the video frames and performed multiclass classification by classifying videos into three different categories namely safe, violent, and sexually explicit videos. This model is validated on a cartoon video dataset retrieved from YouTube by performing a search through YouTube Data API. The experimental results demonstrated that our model performs relatively better than other models by achieving an accuracy of 95.30%. Furthermore, the performance comparison with state-of-the-art algorithms showed that the proposed attention mechanism-based CNN-BiLSTM model achieved competitive results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video.
- Author
-
Hussain, Altaf, Khan, Samee Ullah, Khan, Noman, Ullah, Waseem, Alkhayyat, Ahmed, Alharbi, Meshal, and Baik, Sung Wook
- Subjects
HUMAN activity recognition ,VIDEO surveillance ,TRANSFORMER models ,COMPUTER vision ,CONVOLUTIONAL neural networks ,FEATURE extraction - Abstract
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Enhancing robustness in video recognition models: Sparse adversarial attacks and beyond.
- Author
-
Mu, Ronghui, Marcolino, Leandro, Ni, Qiang, and Ruan, Wenjie
- Subjects
- *
ARTIFICIAL neural networks , *VIDEOS , *STEREO vision (Computer science) , *PRESSURE gages - Abstract
Recent years have witnessed increasing interest in adversarial attacks on images, while adversarial video attacks have seldom been explored. In this paper, we propose a sparse adversarial attack strategy on videos (DeepSAVA). Our model aims to add a small human-imperceptible perturbation to the key frame of the input video to fool the classifiers. To carry out an effective attack that mirrors real-world scenarios, our algorithm integrates spatial transformation perturbations into the frame. Instead of using the l p norm to gauge the disparity between the perturbed frame and the original frame, we employ the structural similarity index (SSIM), which has been established as a more suitable metric for quantifying image alterations resulting from spatial perturbations. We employ a unified optimisation framework to combine spatial transformation with additive perturbation, thereby attaining a more potent attack. We design an effective and novel optimisation scheme that alternatively utilises Bayesian Optimisation (BO) to identify the most critical frame in a video and stochastic gradient descent (SGD) based optimisation to produce both additive and spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very sparse attack on videos for maintaining human imperceptibility while still achieving state-of-the-art performance in terms of both attack success rate and adversarial transferability. Furthermore, built upon the strong perturbations produced by DeepSAVA, we design a novel adversarial training framework to improve the robustness of video classification models. Our intensive experiments on various types of deep neural networks and video datasets confirm the superiority of DeepSAVA in terms of attacking performance and efficiency. When compared to the baseline techniques, DeepSAVA exhibits the highest level of performance in generating adversarial videos for three distinct video classifiers. Remarkably, it achieves an impressive fooling rate ranging from 99.5% to 100% for the I3D model, with the perturbation of just a single frame. Additionally, DeepSAVA demonstrates favourable transferability across various time series models. The proposed adversarial training strategy is also empirically demonstrated with better performance on training robust video classifiers compared with the state-of-the-art adversarial training with projected gradient descent (PGD) adversary. • Sparse attacks on video models: perturb fewer frames to gain high fooling rate. • Combining additive and spatial perturbations to enhance attacking performance. • Using SSIM instead of l p -norm to maintain the human perception. • Applying Bayesian Optimisation to identify the most critical frame to perturb. • A new adversarial training method based on combination of diverse perturbations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Evaluation of Taekwondo Poomsae movements using skeleton points.
- Author
-
Fernando, M., Sandaruwan, K. D., and Athapaththu, A. M. K. B.
- Subjects
TAE kwon do ,MARTIAL arts ,CAMERA phones ,VIDEO excerpts ,CELL phones ,MACHINE learning ,COMPUTER vision - Abstract
Taekwondo is a widely practised martial art and an Olympic sport. In Taekwondo, Poomsae movements are essential, as they form the foundation of the sport and are fundamental for success in competitions. The evaluation of Poomsae movements in Taekwondo has been a subjective process, relying heavily on human judgments. This study addresses the above issue by developing a systematic approach to evaluate Poomsae movements using computer vision. A long short-term memory-based (LSTM-based) machine learning (ML) model was developed and evaluated for its effectiveness in Poomsae movement evaluation. The study also aimed to develop this model as an assistant for self-evaluation, that enables Taekwondo players to enhance their skills at their own pace. For this study, a dataset was created specially by recording Poomsae movements of Taekwondo players from the University of Colombo. The technical infrastructure used to capture skeleton point data was cost-effective and easily replicable in other settings. Small video clips containing Taekwondo movements were recorded using a mobile phone camera and the skeleton point data was extracted using the MediaPipe Python library. The model was able to achieve 61% of accuracy when compared with the domain experts' results. Overall, the study successfully achieved its objectives of defining a selfpaced approach to evaluate Poomsae while overcoming human subjectivity otherwise unavoidable in manual evaluation processes. The feedback of domain experts was also considered to finetune the model for better performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Edge-Enhanced TempoFuseNet: A Two-Stream Framework for Intelligent Multiclass Video Anomaly Recognition in 5G and IoT Environments.
- Author
-
Saleem, Gulshan, Bajwa, Usama Ijaz, Raza, Rana Hammad, and Zhang, Fan
- Subjects
VIDEO surveillance ,CONVOLUTIONAL neural networks ,INTERNET of things ,5G networks ,OPTICAL flow ,FEATURE extraction - Abstract
Surveillance video analytics encounters unprecedented challenges in 5G and IoT environments, including complex intra-class variations, short-term and long-term temporal dynamics, and variable video quality. This study introduces Edge-Enhanced TempoFuseNet, a cutting-edge framework that strategically reduces spatial resolution to allow the processing of low-resolution images. A dual upscaling methodology based on bicubic interpolation and an encoder–bank–decoder configuration is used for anomaly classification. The two-stream architecture combines the power of a pre-trained Convolutional Neural Network (CNN) for spatial feature extraction from RGB imagery in the spatial stream, while the temporal stream focuses on learning short-term temporal characteristics, reducing the computational burden of optical flow. To analyze long-term temporal patterns, the extracted features from both streams are combined and routed through a Gated Recurrent Unit (GRU) layer. The proposed framework (TempoFuseNet) outperforms the encoder–bank–decoder model in terms of performance metrics, achieving a multiclass macro average accuracy of 92.28%, an F1-score of 69.29%, and a false positive rate of 4.41%. This study presents a significant advancement in the field of video anomaly recognition and provides a comprehensive solution to the complex challenges posed by real-world surveillance scenarios in the context of 5G and IoT. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Movie Visual and Speech Analysis Through Multi-Modal LLM for Recommendation Systems
- Author
-
Peixuan Qi
- Subjects
Deep learning ,large language model ,multimodality ,movie analysis ,transformer ,video classification ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Understanding speech as a component of broader video comprehension within audio-visual large language models remains a critical yet underexplored area. Previous research has predominantly tackled this challenge by adapting models developed for conventional video classification tasks, such as action recognition or event detection. However, these models often overlook the linguistic elements present in videos, such as narrations or dialogues, which can implicitly convey high-level semantic information related to movie understanding, including narrative structure or contextual background. Moreover, existing methods are generally configured to encode the entire video content, which can lead to inefficiencies in genre classification tasks. In this paper, we propose a multi-modal Large Language Model (LLM) framework, termed Visual-Speech Multimodal LLM (VSM-LLM), for analyzing movie visual and speech data to predict movie genre. The model incorporates an advanced MGC Q-Former architecture, enabling fine-grained, temporal alignment of audio-visual features across various time scales. On the MovieNet dataset, VSM-LLM attains 40.3% and 55.3% in macro and micro recall@0.5, respectively, outperforming existing baselines. On the Condensed Movies dataset, VSM-LLM achieves 43.5% in macro recall@0.5 and 53.5% in micro recall@0.5, further confirming its superior genre classification performance.
- Published
- 2024
- Full Text
- View/download PDF
40. Automated Detection of Acute Respiratory Distress Using Temporal Visual Information
- Author
-
Wajahat Nawaz, Philippe Jouvet, and Rita Noumeir
- Subjects
Acute respiratory distress ,deep convolution neural networks ,retraction signs ,silver-man scoring ,transfer learning ,video classification ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The Pediatric Intensive Care Unit (PICU) receives critically ill patients with shortness of breath and poor body oxygenation. Various respiratory parameters, such as respiratory rate, oxygen saturation level, and heart rate, are continuously monitored to timely adapt their management. With the advancement in technology, measurements of most parameters are carried out by medical instruments. However, some crucial parameters are still measured via visual examination, particularly the assessment of chest deformation, which is vital in assessing acute respiratory distress (ARD) conditions. However, visual examination is subjective and intermittent, prone to human error, and challenging to monitor patients round the clock. This subjectivity becomes problematic, especially in areas with a shortage of specialists, such as remote locations, developing countries, or during pandemics. In this paper, we propose an automated acute respiratory distress condition detection system, to address challenges associated with visual examination. The proposed approach utilizes a high-definition camera to capture patient temporal visual information and employs advanced deep-learning models to detect ARD condition. In order to test the feasibility, we collected video data of 153 patients, including both with and without ARD in the PICU. As the deep learning models require substantial amounts of data, and collecting data in the medical domain, particularly in the PICU, poses challenges. To overcome data limited problem, we utilized the problem-specific information, opted transfer learning and data augmentation techniques. Additionally, we compute baseline results of various video analysis algorithms for ARD detection task. Experimental results illustrate that the deep learning base video analysis algorithms have the potential to automate the visual examination process for the ARD detection task, by achieving an accuracy of 0.82, precision of 0.80, recall of 0.89, and $F_{1}$ score of 0.84.
- Published
- 2024
- Full Text
- View/download PDF
41. Metric-Based Frame Selection and Deep Learning Model With Multi-Head Self Attention for Classification of Ultrasound Lung Video Images
- Author
-
Ebrahim A. Nehary, Sreeraman Rajan, and Carlos Rossa
- Subjects
Ultrasound ,COVID-19 ,frame selection ,deep learning ,frame classification ,video classification ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Detection of COVID-19 manifestations in lung ultrasound (US) images has gained attention in recent times. The current state-of-the-art technique for distinguishing a healthy lung from COVID-19 infected or bacterial pneumonia infected lung uses non-adjacent frames or equally spaced frames from the video. However, the frame content or correlation between the selected frames has not been taken into consideration for frame selection. In this paper, a metric-based frame selection approach is proposed for three-way classification of lung US videos, and the influence of the frame selection method on image classification accuracy is studied. A deep learning model comprising of a pre-trained model (VGG16) for feature extraction, multi-head attention for feature calibration, global averaging for feature reduction, and a dense layer for classification is proposed. The pre-trained model is re-trained using cross-entropy loss with balanced weights to handle class imbalance. Two types of classification approaches are considered: i) few frames in a video are selected using the proposed metrics; and (ii) all frames in a video are considered. With VGG16 as the pre-trained model, a mean balanced sensitivity of COVID-19, bacterial pneumonia, and healthy classes with 0.82, 0.89, and 0.87, respectively was achieved using 5-fold cross-validation. The results show that even random selection of frames performs better than fixed frame selection and the proposed frame selection method outperforms the state-of-art fixed frame selection irrespective of the type of backbone model used for lung US classification.
- Published
- 2024
- Full Text
- View/download PDF
42. Breathe out the Secret of the Lung: Video Classification of Exhaled Flows from Normal and Asthmatic Lung Models Using CNN-Long Short-Term Memory Networks
- Author
-
Mohamed Talaat, Xiuhua Si, and Jinxiang Xi
- Subjects
video classification ,CNN-LSTM network ,lung diagnosis ,exhaled flows ,vortex dynamics ,heat map ,Internal medicine ,RC31-1245 ,Medicine (General) ,R5-920 - Abstract
In this study, we present a novel approach to differentiate normal and diseased lungs based on exhaled flows from 3D-printed lung models simulating normal and asthmatic conditions. By leveraging the sequential learning capacity of the Long Short-Term Memory (LSTM) network and the automatic feature extraction of convolutional neural networks (CNN), we evaluated the feasibility of the automatic detection and staging of asthmatic airway constrictions. Two asthmatic lung models (D1, D2) with increasing levels of severity were generated by decreasing the bronchiolar calibers in the right upper lobe of a normal lung (D0). Expiratory flows were recorded in the mid-sagittal plane using a high-speed camera at 1500 fps. In addition to the baseline flow rate (20 L/min) with which the networks were trained and verified, two additional flow rates (15 L/min and 10 L/min) were considered to evaluate the network’s robustness to flow deviations. Distinct flow patterns and vortex dynamics were observed among the three disease states (D0, D1, D2) and across the three flow rates. The AlexNet-LSTM network proved to be robust, maintaining perfect performance in the three-class classification when the flow deviated from the recommendation by 25%, and still performed reasonably (72.8% accuracy) despite a 50% flow deviation. The GoogleNet-LSTM network also showed satisfactory performance (91.5% accuracy) at a 25% flow deviation but exhibited low performance (57.7% accuracy) when the deviation was 50%. Considering the sequential learning effects in this classification task, video classifications only slightly outperformed those using still images (i.e., 3–6%). The occlusion sensitivity analyses showed distinct heat maps specific to the disease state.
- Published
- 2023
- Full Text
- View/download PDF
43. Weighted voting ensemble of hybrid CNN-LSTM Models for vision-based human activity recognition
- Author
-
Aggarwal, Sajal, Bhola, Geetanjali, and Vishwakarma, Dinesh Kumar
- Published
- 2024
- Full Text
- View/download PDF
44. TEAM: Transformer Encoder Attention Module for Video Classification.
- Author
-
Hae Sung Park and Yong Suk Choi
- Subjects
VIDEOS ,DEEP learning ,TRANSFORMER models ,FEATURE extraction ,SOCIAL context - Abstract
Much like humans focus solely on object movement to understand actions, directing a deep learning model's attention to the core contexts within videos is crucial for improving video comprehension. In the recent study, VideoMasked Auto-Encoder (VideoMAE) employs a pre-training approach with a high ratio of tube masking and reconstruction, effectively mitigating spatial bias due to temporal redundancy in full video frames. This steers the model's focus toward detailed temporal contexts. However, as the VideoMAE still relies on full video frames during the action recognition stage, it may exhibit a progressive shift in attention towards spatial contexts, deteriorating its ability to capture the main spatio-temporal contexts. To address this issue, we propose an attention-directing module named Transformer Encoder Attention Module (TEAM). This proposed module effectively directs the model's attention to the core characteristics within each video, inherently mitigating spatial bias. The TEAM first figures out the core features among the overall extracted features fromeach video. After that, it discerns the specific parts of the video where those features are located, encouraging themodel to focusmore on these informative parts. Consequently, during the action recognition stage, the proposed TEAM effectively shifts the VideoMAE's attention from spatial contexts towards the core spatio-temporal contexts. This attention-shift manner alleviates the spatial bias in the model and simultaneously enhances its ability to capture precise video contexts. We conduct extensive experiments to explore the optimal configuration that enables the TEAM to fulfill its intended design purpose and facilitates its seamless integrationwith theVideoMAE framework. The integratedmodel, i.e., VideoMAE+TEAM, outperforms the existing VideoMAE by a significant margin on Something-Something-V2 (71.3% vs. 70.3%). Moreover, the qualitative comparisons demonstrate that theTEAMencourages themodel to disregard insignificant features and focus more on the essential video features, capturing more detailed spatio-temporal contexts within the video. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. A Multi-Scale Video Longformer Network for Action Recognition.
- Author
-
Chen, Congping, Zhang, Chunsheng, and Dong, Xin
- Subjects
CONVOLUTIONAL neural networks ,VIDEO monitors ,VIDEO surveillance ,SECURITY classification (Government documents) ,RECOGNITION (Psychology) ,COMPUTATIONAL complexity ,ELECTRIC transformers - Abstract
Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a "local attention + global features" attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Video Classification of Cloth Simulations: Deep Learning and Position-Based Dynamics for Stiffness Prediction.
- Author
-
Mao, Makara, Va, Hongly, and Hong, Min
- Subjects
- *
DEEP learning , *TRANSFORMER models , *COMPUTER graphics , *TEXTILES , *VIRTUAL reality , *VIDEOS - Abstract
In virtual reality, augmented reality, or animation, the goal is to represent the movement of deformable objects in the real world as similar as possible in the virtual world. Therefore, this paper proposed a method to automatically extract cloth stiffness values from video scenes, and then they are applied as material properties for virtual cloth simulation. We propose the use of deep learning (DL) models to tackle this issue. The Transformer model, in combination with pre-trained architectures like DenseNet121, ResNet50, VGG16, and VGG19, stands as a leading choice for video classification tasks. Position-Based Dynamics (PBD) is a computational framework widely used in computer graphics and physics-based simulations for deformable entities, notably cloth. It provides an inherently stable and efficient way to replicate complex dynamic behaviors, such as folding, stretching, and collision interactions. Our proposed model characterizes virtual cloth based on softness-to-stiffness labels and accurately categorizes videos using this labeling. The cloth movement dataset utilized in this research is derived from a meticulously designed stiffness-oriented cloth simulation. Our experimental assessment encompasses an extensive dataset of 3840 videos, contributing to a multi-label video classification dataset. Our results demonstrate that our proposed model achieves an impressive average accuracy of 99.50%. These accuracies significantly outperform alternative models such as RNN, GRU, LSTM, and Transformer. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Motion magnification-inspired feature manipulation for deepfake detection.
- Author
-
MIRZAYEV, Aydamir and DİBEKLİOĞLU, Hamdi
- Subjects
- *
GRAPHICS processing units , *MOTION , *COMPUTER vision , *DEEP learning - Abstract
Recent advances in deep learning, increased availability of large-scale datasets, and improvement of accelerated graphics processing units facilitated creation of an unprecedented amount of synthetically generated media content with impressive visual quality. Although such technology is used predominantly for entertainment, there is widespread practice of using deepfake technology for malevolent ends. This potential for malicious use necessitates the creation of detection methods capable of reliably distinguishing manipulated video content. In this work we aim to create a learning-based detection method for synthetically generated videos. To this end, we attempt to detect spatiotemporal inconsistencies by leveraging a learning-based magnification-inspired feature manipulation unit. Although there is existing literature on the use of motion magnification as a preprocessing step for deepfake detection, in our work, we aim to utilize learning-based magnification elements to develop an end-to-end deepfake detection model. In this research, we investigate different variations of feature manipulation networks, both with spatially constant and spatially varying amplification. To clarify, although the proposed model draws from existing literature on motion magnification, we do not perform motion magnification in our experiments but instead use the underlying architecture of such networks for feature enhancement. Our objective with this work is to take a step towards applying learnable motion manipulation in improving the target accuracy of a task at hand. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification.
- Author
-
Nindam, Somsawut, Na, Seung-Hoon, and Lee, Hyo Jong
- Subjects
DEEP learning ,PSYCHOLOGY of students ,CONVOLUTIONAL neural networks ,ABILITY grouping (Education) ,COLOR space ,DATA augmentation ,PHYSIOLOGY education - Abstract
This research proposes a deep learning method for classifying student behavior in classrooms that follow the professional learning community teaching approach. We collected data on five student activities: hand-raising, interacting, sitting, turning around, and writing. We used the sum of absolute differences (SAD) in the LUV color space to detect scene changes. The K-means algorithm was then applied to select keyframes using the computed SAD. Next, we extracted features using multiple pretrained deep learning models from the convolutional neural network family. The pretrained models considered were InceptionV3, ResNet50V2, VGG16, and EfficientNetB7. We leveraged feature fusion, incorporating optical flow features and data augmentation techniques, to increase the necessary spatial features of selected keyframes. Finally, we classified the students' behavior using a deep sequence model based on the bidirectional long short-term memory network with an attention mechanism (BiLSTM-AT). The proposed method with the BiLSTM-AT model can recognize behaviors from our dataset with high accuracy, precision, recall, and F1-scores of 0.97, 0.97, and 0.97, respectively. The overall accuracy was 96.67%. This high efficiency demonstrates the potential of the proposed method for classifying student behavior in classrooms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Learning Multi-Expert Distribution Calibration for Long-Tailed Video Classification.
- Author
-
Hu, Yufan, Gao, Junyu, and Xu, Changsheng
- Published
- 2024
- Full Text
- View/download PDF
50. A novel keyframe extraction method for video classification using deep neural networks.
- Author
-
Savran Kızıltepe, Rukiye, Gan, John Q., and Escobar, Juan José
- Subjects
- *
ARTIFICIAL neural networks , *RECURRENT neural networks , *CONVOLUTIONAL neural networks , *ONE-way analysis of variance - Abstract
Combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) produces a powerful architecture for video classification problems as spatial–temporal information can be processed simultaneously and effectively. Using transfer learning, this paper presents a comparative study to investigate how temporal information can be utilized to improve the performance of video classification when CNNs and RNNs are combined in various architectures. To enhance the performance of the identified architecture for effective combination of CNN and RNN, a novel action template-based keyframe extraction method is proposed by identifying the informative region of each frame and selecting keyframes based on the similarity between those regions. Extensive experiments on KTH and UCF-101 datasets with ConvLSTM-based video classifiers have been conducted. Experimental results are evaluated using one-way analysis of variance, which reveals the effectiveness of the proposed keyframe extraction method in the sense that it can significantly improve video classification accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.