16,067 results on '"VIDEO processing"'
Search Results
2. Artificial Intelligence Applications in Optical Sensor Technology
- Author
-
Gupta, Soni, Bhatt, Pramod Kumar, Mishra, Sumita, Kumar, Shivam, Das, Swagatam, Series Editor, Bansal, Jagdish Chand, Series Editor, Jaiswal, Ajay, editor, Anand, Sameer, editor, Hassanien, Aboul Ella, editor, and Azar, Ahmad Taher, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Effect of a confidence-based weighted 3D point reconstruction for markerless motion capture with a reduced number of cameras.
- Author
-
Chaumeil, A., Muller, A., Dumas, R., and Robert, T.
- Subjects
MOTION capture (Human mechanics) ,CAMCORDERS ,LIFTING & carrying (Human mechanics) ,MATERIALS handling ,VIDEO processing ,POSE estimation (Computer vision) ,CAMERAS - Abstract
Markerless motion capture has been made available by the development of pose estimation algorithms that provide both an estimate of the location of body keypoints in two-dimensional images and its associated confidence. It seems relevant to use this additional information for three-dimensional (3D) point reconstruction. Yet, it has been little described, nor has its influence on 3D point reconstruction. Eight participants performed a manual material handling task, which was recorded by 10 video cameras. Each video was processed using OpenPose. Different 3D point reconstruction methods were compared: direct linear transform (DLT) and weighted DLT (wDLT), with 10 cameras and with a subset of 4 cameras. For each keypoint, confidence and position deviation from the 3D reconstructed point with 10 cameras projected in the image reference frame were assessed. Results suggest that using confidence information reduces both average and maximum 3D distance between 3D points reconstructed with 4 and 10 cameras. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Advances in Video Analytics.
- Author
-
Şahin, Muhittin
- Subjects
DIGITAL learning ,DIGITAL technology ,STUDENT engagement ,VIDEO processing ,DATA analytics - Abstract
Learners interact with content, assessments, peers, and instructors in digital learning environments. Videos, which are popular due to internet technologies, capture learners' attention, boost motivation, and enhance learning. Learning analytics broadly optimize educational environments by analyzing data, with video analytics focusing specifically on video interactions to enhance learning outcomes. Video-player interactions (e.g., play, pause) and video content interactions (e.g., true-false questions) provide insights into learner behaviors. Lack of interaction is a major reason for high dropout rates in video platforms and MOOCs. Video analytics can help address this issue by analyzing and improving engagement with video content. This special issue has a specific focus on video analytics and impact of this field to the learning experience. Four articles were included in this special issue. The findings reveal that I) the type, length, and purpose of the video are important for student engagement, ii) important tips on video-based learning design are presented, iii) when interacting with the video player, pause, play, rewind and fast forward are the most commonly used interaction types., iv) providing more information about video interaction processes with dashboards would provide much more insight, and v) dividing the videos into more than one section both creates the perception of better structuring of the process and the segmentation of the videos contributes more to learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Hybrid attentive prototypical network for few-shot action recognition.
- Author
-
Ruan, Zanxi, Wei, Yingmei, Guo, Yanming, and Xie, Yuxiang
- Subjects
RECOGNITION (Psychology) ,FEATURE extraction ,CONCEPT learning ,VIDEO processing ,SPINE - Abstract
Most previous few-shot action recognition works tend to process video temporal and spatial features separately, resulting in insufficient extraction of comprehensive features. In this paper, a novel hybrid attentive prototypical network (HAPN) framework for few-shot action recognition is proposed. Distinguished by its joint processing of temporal and spatial information, the HAPN framework strategically manipulates these dimensions from feature extraction to the attention module, consequently enhancing its ability to perform action recognition tasks. Our framework utilizes the R(2+1)D backbone network, coupling the extraction of integrated temporal and spatial features to ensure a comprehensive understanding of video content. Additionally, our framework introduces the novel Residual Tri-dimensional Attention (ResTriDA) mechanism, specifically designed to augment feature information across the temporal, spatial, and channel dimensions. ResTriDA dynamically enhances crucial aspects of video features by amplifying significant channel-wise features for action distinction, accentuating spatial details vital for capturing the essence of actions within frames, and emphasizing temporal dynamics to capture movement over time. We further propose a prototypical attentive matching module (PAM) built on the concept of metric learning to resolve the overfitting issue common in few-shot tasks. We evaluate our HAPN framework on three classical few-shot action recognition datasets: Kinetics-100, UCF101, and HMDB51. The results indicate that our framework significantly outperformed state-of-the-art methods. Notably, the 1-shot task, demonstrated an increase of 9.8% in accuracy on UCF101 and improvements of 3.9% on HMDB51 and 12.4% on Kinetics-100. These gains confirm the robustness and effectiveness of our approach in leveraging limited data for precise action recognition. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Ultrahigh-fidelity full-color holographic display via color-aware optimization.
- Author
-
Chen, Chun, Nam, Seung-Woo, Kim, Dongyeon, Lee, Juhyun, Jeong, Yoonchan, and Lee, Byoungho
- Subjects
HOLOGRAPHIC displays ,HOLOGRAPHY ,DIGITAL holographic microscopy ,VIDEO processing - Abstract
Holographic display offers the capability to generate high-quality images with a wide color gamut since it is laser-driven. However, many existing holographic display techniques fail to fully exploit this potential, primarily due to the system's imperfections. Such flaws often result in inaccurate color representation, and there is a lack of an efficient way to address this color accuracy issue. In this study, we develop a color-aware hologram optimization approach for color-accurate holographic displays. Our approach integrates both laser and camera into the hologram optimization loop, enabling dynamic optimization of the laser's output color and the acquisition of physically captured feedback. Moreover, we improve the efficiency of the color-aware optimization process for holographic video displays. We introduce a cascade optimization strategy, which leverages the redundant neighbor hologram information to accelerate the iterative process. We evaluate our method through both simulation and optical experiments, demonstrating the superiority in terms of image quality, color accuracy, and hologram optimization speed compared to previous algorithms. Our approach verifies a promising way to realize a high-fidelity image in the holographic display, which provides a new direction toward the practical holographic display. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Basketball-SORT: an association method for complex multi-object occlusion problems in basketball multi-object tracking.
- Author
-
Hu, Qingrui, Scott, Atom, Yeung, Calvin, and Fujii, Keisuke
- Subjects
OBJECT recognition (Computer vision) ,BASKETBALL games ,SPORTS films ,COMPUTER vision ,VIDEO processing - Abstract
Recent deep learning-based object detection approaches have led to significant progress in multi-object tracking (MOT) algorithms. The current MOT methods mainly focus on pedestrian or vehicle scenes, but basketball sports scenes are usually accompanied by three or more object occlusion problems with similar appearances and high-intensity complex motions, which we call complex multi-object occlusion (CMOO). Here, we propose an online and robust MOT approach, named Basketball-SORT, which focuses on the CMOO problems in basketball videos. To overcome the CMOO problem, instead of using the intersection-over-union-based (IoU-based) approach, we use the trajectories of neighboring frames based on the projected positions of the players. Our method designs the basketball game restriction (BGR) and reacquiring Long-Lost IDs (RLLI) based on the characteristics of basketball scenes, and we also solve the occlusion problem based on the player trajectories and appearance features. Experimental results show that our method achieves a Higher Order Tracking Accuracy (HOTA) score of 63.48% on the basketball fixed video dataset (5v5) and 66.45% on the 3v3 basketball dataset. These results outperform other recent popular approaches on both datasets, demonstrating the robustness of our method across different basketball game formats. Overall, our approach solved the CMOO problem more effectively than recent MOT algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Runner re-identification from single-view running video in the open-world setting.
- Author
-
Suzuki, Tomohiro, Tsutsui, Kazushi, Takeda, Kazuya, and Fujii, Keisuke
- Subjects
VIDEO processing ,SPORTS films ,RUNNERS (Sports) ,ATHLETES ,COMPUTER vision - Abstract
In many sports, player re-identification is crucial for automatic video processing and analysis. However, most of the current studies on player re-identification in multi- or single-view sports videos focus on re-identification in the closed-world setting using labeled image dataset, and player re-identification in the open-world setting for automatic video analysis is not well developed. In this paper, we propose a runner re-identification system that directly processes single-view video to address the open-world setting. In the open-world setting, we cannot use labeled dataset and have to process video directly. The proposed system automatically processes raw video as input to identify runners, and it can identify runners even when they are framed out multiple times. For the automatic processing, we first detect the runners in the video using the pre-trained YOLOv8 and the fine-tuned EfficientNet. We then track the runners using ByteTrack and detect their shoes with the fine-tuned YOLOv8. Finally, we extract the image features of the runners using an unsupervised method with the gated recurrent unit autoencoder and global and local features mixing. To improve the accuracy of runner re-identification, we use shoe images as local image features and dynamic features of running sequence images. We evaluated the system on a running practice video dataset and showed that the proposed method identified runners with higher accuracy than some state-of-the-art models in unsupervised re-identification. We also showed that our proposed local image feature and running dynamic feature were effective for runner re-identification. Our runner re-identification system can be useful for the automatic analysis of running videos. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. A Systematic Review of Event-Matching Methods for Complex Event Detection in Video Streams.
- Author
-
Honarparvar, Sepehr, Ashena, Zahra Bagheri, Saeedi, Sara, and Liang, Steve
- Abstract
Complex Event Detection (CED) in video streams involves numerous challenges such as object detection, tracking, spatio–temporal relationship identification, and event matching, which are often complicated by environmental variations, occlusions, and tracking losses. This systematic review presents an analysis of CED methods for video streams described in publications from 2012 to 2024, focusing on their effectiveness in addressing key challenges and identifying trends, research gaps, and future directions. A total of 92 studies were categorized into four main groups: training-based methods, object detection and spatio–temporal matching, multi-source solutions, and others. Each method's strengths, limitations, and applicability are discussed, providing an in-depth evaluation of their capabilities to support real-time video analysis and live camera feed applications. This review highlights the increasing demand for advanced CED techniques in sectors like security, safety, and surveillance and outlines the key opportunities for future research in this evolving field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Conditional Font Generation With Content Pre‐Train and Style Filter.
- Author
-
Hong, Yang, Li, Yinfei, Qiao, Xiaojun, and Zhang, Junsong
- Subjects
- *
IMAGE processing , *VIDEO processing , *CHINESE characters - Abstract
Automatic font generation aims to streamline the design process by creating new fonts with minimal style references. This technology significantly reduces the manual labour and costs associated with traditional font design. Image‐to‐image translation has been the dominant approach, transforming font images from a source style to a target style using a few reference images. However, this framework struggles to fully decouple content from style, particularly when dealing with significant style shifts. Despite these limitations, image‐to‐image translation remains prevalent due to two main challenges faced by conditional generative models: (1) inability to handle unseen characters and (2) difficulty in providing precise content representations equivalent to the source font. Our approach tackles these issues by leveraging recent advancements in Chinese character representation research to pre‐train a robust content representation model. This model not only handles unseen characters but also generalizes to non‐existent ones, a capability absent in traditional image‐to‐image translation. We further propose a Transformer‐based Style Filter that not only accurately captures stylistic features from reference images but also handles any combination of them, fostering greater convenience for practical automated font generation applications. Additionally, we incorporate content loss with commonly used pixel‐ and perceptual‐level losses to refine the generated results from a comprehensive perspective. Extensive experiments validate the effectiveness of our method, particularly its ability to handle unseen characters, demonstrating significant performance gains over existing state‐of‐the‐art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. U-Net and Its Variants Based Automatic Tracking of Radial Artery in Ultrasonic Short-Axis Views: A Pilot Study.
- Author
-
Tian, Yuan, Gao, Ruiyang, Shi, Xinran, Lang, Jiaxin, Xue, Yang, Wang, Chunrong, Zhang, Yuelun, Shen, Le, Yu, Chunhua, and Zhou, Zhuhuang
- Subjects
- *
RADIAL artery , *AUTOMATIC tracking , *DEEP learning , *ARTERIAL catheterization , *VIDEO processing - Abstract
Background/Objectives: Radial artery tracking (RAT) in the short-axis view is a pivotal step for ultrasound-guided radial artery catheterization (RAC), which is widely employed in various clinical settings. To eliminate disparities and lay the foundations for automated procedures, a pilot study was conducted to explore the feasibility of U-Net and its variants in automatic RAT. Methods: Approved by the institutional ethics committee, patients as potential RAC candidates were enrolled, and the radial arteries were continuously scanned by B-mode ultrasonography. All acquired videos were processed into standardized images, and randomly divided into training, validation, and test sets in an 8:1:1 ratio. Deep learning models, including U-Net and its variants, such as Attention U-Net, UNet++, Res-UNet, TransUNet, and UNeXt, were utilized for automatic RAT. The performance of the deep learning architectures was assessed using loss functions, dice similarity coefficient (DSC), and Jaccard similarity coefficient (JSC). Performance differences were analyzed using the Kruskal–Wallis test. Results: The independent datasets comprised 7233 images extracted from 178 videos of 135 patients (53.3% women; mean age: 41.6 years). Consistent convergence of loss functions between the training and validation sets was achieved for all models except Attention U-Net. Res-UNet emerged as the optimal architecture in terms of DSC and JSC (93.14% and 87.93%), indicating a significant improvement compared to U-Net (91.79% vs. 86.19%, p < 0.05) and Attention U-Net (91.20% vs. 85.02%, p < 0.05). Conclusions: This pilot study validates the feasibility of U-Net and its variants in automatic RAT, highlighting the predominant performance of Res-UNet among the evaluated architectures. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Human Visual Pathways for Action Recognition versus Deep Convolutional Neural Networks: Representation Correspondence in Late but Not Early Layers.
- Author
-
Peng, Yujia, Gong, Xizi, Lu, Hongjing, and Fang, Fang
- Subjects
- *
HUMAN activity recognition , *CONVOLUTIONAL neural networks , *VISUAL pathways , *VISUAL cortex , *VIDEO processing - Abstract
Deep convolutional neural networks (DCNNs) have attained human-level performance for object categorization and exhibited representation alignment between network layers and brain regions. Does such representation alignment naturally extend to other visual tasks beyond recognizing objects in static images? In this study, we expanded the exploration to the recognition of human actions from videos and assessed the representation capabilities and alignment of two-stream DCNNs in comparison with brain regions situated along ventral and dorsal pathways. Using decoding analysis and representational similarity analysis, we show that DCNN models do not show hierarchical representation alignment to human brain across visual regions when processing action videos. Instead, later layers of DCNN models demonstrate greater representation similarities to the human visual cortex. These findings were revealed for two display formats: photorealistic avatars with full-body information and simplified stimuli in the point-light display. The discrepancies in representation alignment suggest fundamental differences in how DCNNs and the human brain represent dynamic visual information related to actions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM.
- Author
-
Kavitha, N., Soundar, K. Ruba, Karthick, R., and Kohila, J.
- Subjects
- *
CONVOLUTIONAL neural networks , *RECURRENT neural networks , *VIDEO surveillance , *VIDEO processing , *PERSONALLY identifiable information - Abstract
The development of automatic video understanding technology is highly needed due to the rise of mass video data, like surveillance videos, personal video data. Several methods have been presented previously for automatic video captioning. But, the existing methods have some problems, like more time consume during processing a huge number of frames, and also it contains over fitting problem. This is a difficult task to automate the process of video caption. So, it affects final result (Caption) accuracy. To overcome these issues, Automatic Video Captioning using Tree Hierarchical Deep Convolutional Neural Network and attention segmental recurrent neural network-bi-directional Long Short-Term Memory (ASRNN-bi-directional LSTM) is proposed in this paper. The captioning part contains two phases: Feature Encoder and Decoder. In feature encoder phase, the tree hierarchical Deep Convolutional Neural Network (Tree CNN) encodes the vector representation of video and extract three kinds of features. In decoder phase, the attention segmental recurrent neural network (ASRNN) decode vector into textual description. ASRNN-base methods struck with long-term dependency issue. To deal this issue, focuses on all generated words from the bi-directional LSTM and caption generator for extracting global context information presented by concealed state of caption generator is local and unfinished. Hence, Golden Eagle Optimization is exploited to enhance ASRNN weight parameters. The proposed method is executed in Python. The proposed technique achieves 34.89%, 29.06% and 20.78% higher accuracy, 23.65%, 22.10% and 29.68% lesser Mean Squared Error compared to the existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. MNCATM: A Multi-Layer Non-Uniform Coding-Based Adaptive Transmission Method for 360° Video.
- Author
-
Li, Xiang, Nie, Junfeng, Zhang, Xinmiao, Li, Chengrui, Zhu, Yichen, Liu, Yang, Tian, Kun, and Guo, Jia
- Subjects
SMART devices ,STREAMING video & television ,USER experience ,VIDEO processing ,PROBLEM solving ,VIDEO coding - Abstract
With the rapid development of multimedia services and smart devices, 360-degree video has enhanced the user viewing experience, ushering in a new era of immersive human–computer interaction. These technologies are increasingly integrating everyday life, including gaming, education, and healthcare. However, the uneven spatiotemporal distribution of wireless resources presents significant challenges for the transmission of ultra-high-definition 360-degree video streaming. To address this issue, this paper proposes a multi-layer non-uniform coding-based adaptive transmission method for 360° video (MNCATM). This method optimizes video caching and transmission by dividing non-uniform tiles and leveraging users' dynamic field of view (FoV) information and the multi-bitrate characteristics of video content. First, the video transmission process is formalized and modeled, and an adaptive transmission optimization framework for a non-uniform video is proposed. Based on this, the optimization problem required by the paper is summarized, and an algorithm is proposed to solve the problem. Simulation experiments demonstrate that the proposed method, MNCATM, outperforms existing transmission schemes in terms of bandwidth utilization and user quality of experience (QoE). MNCATM can effectively utilize network bandwidth, reduce latency, improve transmission efficiency, and maximize user experience quality. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. APPLICATION AND EFFECT EVALUATION OF INFORMATION TECHNOLOGY IN PHYSICAL EDUCATION.
- Author
-
HUIJUAN WANG and MIN ZHOU
- Subjects
KALMAN filtering ,SPORTS films ,VIDEO processing ,PHYSICAL education ,PROBLEM solving - Abstract
The Kalman filter and extended Kalman filter algorithm are widely used in physical education video processing, but their performance still needs to be improved for complex backgrounds or high dynamic targets. This paper presents an interactive multi-model algorithm. The displacement detection Kalman filter is used to track moving objects. This method can solve the problem of a single model not being able to match the motion features well. Finally, the simulation experiment of football match video proves that the proposed method can significantly improve the tracking accuracy of moving objects in video. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Deep‐Learning‐Based Facial Retargeting Using Local Patches.
- Author
-
Choi, Yeonsoo, Lee, Inyup, Cha, Sihun, Kim, Seonghyeon, Jung, Sunjin, and Noh, Junyong
- Subjects
- *
MOTION capture (Cinematography) , *DIGITAL technology , *FACIAL expression , *VIDEO processing , *RANGE of motion of joints - Abstract
In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character's facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch‐based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re‐enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Emotion recognition in user‐generated videos with long‐range correlation‐aware network.
- Author
-
Yi, Yun, Zhou, Jin, Wang, Hanli, Tang, Pengjie, and Wang, Min
- Subjects
- *
VIDEO processing , *EMOTION recognition , *EMOTIONS , *COMPUTER vision , *VIDEO excerpts , *AFFECTIVE computing - Abstract
Emotion recognition in user‐generated videos plays an essential role in affective computing. In general, visual information directly affects human emotions, so the visual modality is significant for emotion recognition. Most classic approaches mainly focus on local temporal information of videos, which potentially restricts their capacity to encode the correlation of long‐range context. To address this issue, a novel network is proposed to recognize emotions in videos. To be specific, a spatio‐temporal correlation‐aware block is designed to depict the long‐range correlations between input tokens, where the convolutional layers are used to learn the local correlations and the inter‐image cross‐attention is designed to learn the long‐range and spatio‐temporal correlations between input tokens. To generate diverse and challenging samples, a dual‐augmentation fusion layer is devised, which fuses each frame with its corresponding frame in the temporal domain. To produce rich video clips, a long‐range sampling layer is designed, which generates clips in a wide range of spatial and temporal domains. Extensive experiments are conducted on two challenging video emotion datasets, namely VideoEmotion‐8 and Ekman‐6. The experimental results demonstrate that the proposed method obtains better performance than baseline methods. Moreover, the proposed method achieves state‐of‐the‐art results on the two datasets. The source code of the proposed network is available at: https://github.com/JinChow/LRCANet. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Fast CU Partition Decision Algorithm Based on Bayesian and Texture Features.
- Author
-
Tian, Erlin, Yang, Yifan, and Zhang, Qiuwen
- Subjects
BLOCK codes ,VIDEO coding ,INTERNET speed ,VIDEO processing ,VIDEO compression ,PARALLEL algorithms - Abstract
As internet speeds increase and user demands for video quality grow, video coding standards continue to evolve. H.266/Versatile Video Coding (VVC), as the new generation of video coding standards, further improves compression efficiency but also brings higher computational complexity. Despite the significant advancements VVC has made in compression ratio and video quality, the introduction of new coding techniques and complex coding unit (CU) partitioning methods have also led to increased encoding complexity. This complexity not only extends encoding time but also increases hardware resource consumption, limiting the application of VVC in real-time video processing and low-power devices.To alleviate the encoding complexity of VVC, this paper puts forward a Bayesian and texture-feature-based fast splitting algorithm for coding intraframe bloc of VVC, which aims to reduce unnecessary computational steps, enhance encoding efficiency, and maintain video quality as much as possible. In the stage of rapid coding, the video frames are coded by the original VVC test model (VTM), and Joint Rough Mode Decision (JRMD) evaluation cost is used to update the parameter in the Bayesian algorithm to come and set the two thresholds to judge whether the current coding block continues to be split or not. Then, for coding blocks larger than those satisfying the above threshold conditions, the predominant direction of the texture within the coding block is ascertained by calculating the standard deviations along both the horizontal and vertical axes so as to skip some unnecessary splits in the current coding block patterns. The findings from our experiments demonstrate that our proposed approach improves the encoding rate by 1.40% on average, and the execution time of the encoder has been reduced by 49.50%. The overall algorithm has optimized the VVC intraframe coding technology and reduced the coding complexity of VVC. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. A lightweight defect detection algorithm for escalator steps.
- Author
-
Yu, Hui, Chen, Jiayan, Yu, Ping, and Feng, Da
- Subjects
- *
PROCESS capability , *STREAMING video & television , *ESCALATORS , *VIDEO processing , *COMPUTATIONAL complexity - Abstract
In this paper, we propose an efficient target detection algorithm, ASF-Sim-YOLO, to address issues encountered in escalator step defect detection, such as an excessive number of parameters in the detection network model, poor adaptability, and difficulties in real-time processing of video streams. Firstly, to address the characteristics of escalator step defects, we designed the ASF-Sim-P2 structure to improve the detection accuracy of small targets, such as step defects. Additionally, we incorporated the SimAM (Similarity-based Attention Mechanism) by combining SimAM with SPPF (Spatial Pyramid Pooling-Fast) to enhance the model's ability to capture key information by assigning importance weights to each pixel. Furthermore, to address the challenge posed by the small size of step defects, we replaced the traditional CIoU (Complete-Intersection-over-Union) loss function with NWD (Normalized Wasserstein Distance), which alleviated the problem of defect missing. Finally, to meet the deployment requirements of mobile devices, we performed channel pruning on the model. The experimental results showed that the improved ASF-Sim-YOLO model achieved an average accuracy (mAP50) of 96.8% on the test data set, which was a 22.1% improvement in accuracy compared to the baseline model. Meanwhile, the computational complexity (in GFLOPS) of the model was reduced to a quarter of that of the baseline model, while the frame rate (FPS) was improved to 575.1. Compared with YOLOv3-tiny, YOLOv5s, YOLOv8s, Faster-RCNN, TOOD, RTMDET and other deep learning-based target recognition algorithms, ASF-Sim-YOLO has better detection accuracy and real-time processing capability. These results demonstrate that ASF-Sim-YOLO effectively balances lightweight design and performance improvement, making it highly suitable for real-time detection of step defects, which can meet the demands of escalator inspection operations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Adaptive spatial down-sampling method based on object occupancy distribution for video coding for machines.
- Author
-
An, Eun-bin, Kim, Ayoung, Jung, Soon-heung, Kwak, Sangwoon, Lee, Jin Young, Cheong, Won-Sik, Choo, Hyon-Gon, and Seo, Kwang-deok
- Subjects
- *
COMPUTER vision , *VIDEO processing , *MACHINE performance , *MACHINE tools , *DATA reduction , *VIDEO coding - Abstract
As the performance of machine vision continues to improve, it is being used in various industrial fields to analyze and generate massive amounts of video data. Although the demand for and consumption of video data by machines has increased significantly, video coding for machines needs to be improved. It is therefore necessary to consider a new codec that differs from conventional codecs based on the human visual system (HVS). Spatial down-sampling plays a critical role in video coding for machines because it reduces the volume of the video data to be processed while maintaining the shape of the data's features that are important for the machine to reference when processing the video. An effective method of determining the intensity of spatial down-sampling as an efficient coding tool for machines is still in the early stages. Here, we propose a method of determining an optimal scale factor for spatial down-sampling by collecting and analyzing information on the number of objects and the ratio of the area occupied by the object within a picture. We compare the data reduction ratio to the machine accuracy error ratio (DRAER) to evaluate the performance of the proposed method. By applying the proposed method, the DRAER was found to be a maximum of 21.40 dB and a minimum of 11.94 dB. This shows that video coding gain for the machines could be achieved through the proposed method while maintaining the accuracy of machine vision tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Human activity-based anomaly detection and recognition by surveillance video using kernel local component analysis with classification by deep learning techniques.
- Author
-
Praveena, M. D. Anto, Udayaraju, P., Chaitanya, R. Krishna, Jayaprakash, S., Kalaiyarasi, M., and Ramesh, S.
- Subjects
ANOMALY detection (Computer security) ,VIDEO surveillance ,BAYESIAN analysis ,DEEP learning ,VIDEO processing ,HUMAN activity recognition - Abstract
Abnormal behavior methods have attempted to reduce execution time, computational complexity, efficiency, robustness against pixel occlusion, and generalizability. This research proposed a novel method in human activity-based anomaly detection and recognition by surveillance video utilizing DL methods. Input is collected as video and processed for noise removal and smoothening. Then kernel local component analysis extracts these video features for human activity monitoring. Then the extracted features are classified using Bayesian network-based spatiotemporal neural networks. The classified output shows the anomaly activities of the selected input surveillance video dataset. The simulation results are obtained for various crowd datasets regarding the mean average error, mean square error, training accuracy, validation accuracy, specificity, and F_measure. The proposed technique attained an MAE of 58%, MSE of 63%, specificity of 89%, and F-measure of 68%. training and validation accuracy of 92% and 96% respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Fast and accurate detection of kiwifruits in the natural environment using improved YOLOv4.
- Author
-
Jinpeng Wang, Lei Xu, Song Mei, Haoruo Hu, Jialiang Zhou, and Qing Chen
- Subjects
- *
CONVOLUTIONAL neural networks , *VIDEO processing , *KIWIFRUIT , *FRUIT , *PYRAMIDS , *ORCHARDS - Abstract
Real-time detection of kiwifruits in natural environments is essential for automated kiwifruit harvesting. In this study, a lightweight convolutional neural network called the YOLOv4-GS algorithm was proposed for kiwifruit detection. The backbone network CSPDarknet-53 of YOLOv4 was replaced with GhostNet to improve accuracy and reduce network computation. To improve the detection accuracy of small targets, the upsampling of feature map fusion was performed for network layers 151 and 154, and the spatial pyramid pooling network was removed to reduce redundant computation. A total of 2766 kiwifruit images from different environments were used as the dataset for training and testing. The experiment results showed that the F1-score, average accuracy, and Intersection over Union (IoU) of YOLOv4-GS were 98.00%, 99.22%, and 88.92%, respectively. The average time taken to detect a 416x416 kiwifruit image was 11.95 ms, and the model's weight was 28.8 MB. The average detection time of GhostNet was 31.44 ms less than that of CSPDarknet-53. In addition, the model weight of GhostNet was 227.2 MB less than that of CSPDarknet-53. YOLOv4-GS improved the detection accuracy by 8.39% over Faster R-CNN and 8.36% over SSD-300. The detection speed of YOLOv4-GS was 11.3 times and 2.6 times higher than Faster R-CNN and SSD-300, respectively. In the indoor picking experiment and the orchard picking experiment, the average speed of the YOLOv4-GS processing video was 28.4 fps. The recognition accuracy was above 90%. The average time spent for recognition and positioning was 6.09 s, accounting for about 29.03% of the total picking time. The overall results showed that the YOLOv4-GS proposed in this study can be applied for kiwifruit detection in natural environments because it improves the detection speed without compromising detection accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. SkatingVerse: A large‐scale benchmark for comprehensive evaluation on human action understanding.
- Author
-
Gan, Ziliang, Jin, Lei, Cheng, Yi, Cheng, Yu, Teng, Yinglei, Li, Zun, Li, Yawen, Yang, Wenhan, Zhu, Zheng, Xing, Junliang, and Zhao, Jian
- Subjects
- *
FIGURE skating competitions , *VIDEO processing , *FIGURE skating , *HUMAN behavior , *COMPUTER vision - Abstract
Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large‐scale and comprehensive benchmark, namely SkatingVerse is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine‐grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine‐grained human action classification and assessment results from a raw figure skating competition video. In addition, SkatingVerse can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into SkatingVerse. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Lessons learned from naturalistic driving data processing in a secure data enclave: Preliminary discoveries from analyzing dash camera videos.
- Author
-
Mahmood, Kaiser, Pang, Jiajun, Shahriar Ahmed, Sheikh, Yu, Gongda, Sarwar, Md Tawfiq, Benedyk, Irina, and Ch. Anastasopoulos, Panagiotis
- Subjects
- *
PERSONALLY identifiable information , *DISTRACTED driving , *VIDEO processing , *CAMCORDERS , *ELECTRONIC data processing - Abstract
• SHRP2 naturalistic driving study (NDS) data contains personally identifiable information (PII) and processing those data needs special attention. • Processing PII data can be challenging due to its potential for directly or indirectly identifying individuals. • Naturalistic driving studies are important to identify distracted driving and its impact. • Lessons learned from processing SHRP2 NDS data can help researchers intend to utilize similar data for future research. This paper provides preliminary insights on the challenges of processing Strategic Highway Research Program 2 (SHRP2) Naturalistic Driving Study (NDS) videos and data, particularly those with Personally Identifiable Information (PII). Insights and lessons learned are presented from a study designed to evaluate the effectiveness of High Visibility Crosswalks (HVCs). Over a one-month period, 15,379 videos were processed in the secure data enclave of Virginia Tech Transportation Institute (VTTI). As these videos are not available outside of the secure data enclave due to PII restrictions, researchers visiting the secure data enclave for the first time may face several challenges: navigating the software interface; identifying the video views and frames of interest; and identifying and extracting information of interest from the video views, etc. These challenges, the procedures followed to address them, and the process for identifying and classifying distracted driving behaviors are discussed. Lastly, hypothesis tests are conducted to investigate distracted driving behavior, with the results revealing that HVCs have the potential to make drivers more cautious in their proximity. The information presented in this paper is expected to aid researchers who intend to utilize SHRP2 NDS or similar videos for future research, to preemptively plan for the video processing phase. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Remote photoplethysmography (rPPG) in the wild: Remote heart rate imaging via online webcams.
- Author
-
Di Lernia, Daniele, Finotti, Gianluca, Tsakiris, Manos, Riva, Giuseppe, and Naber, Marnix
- Subjects
- *
HEART beat , *STREAMING video & television , *TECHNOLOGICAL innovations , *VIDEO processing , *VIDEO recording , *INTEROCEPTION - Abstract
Remote photoplethysmography (rPPG) is a low-cost technique to measure physiological parameters such as heart rate by analyzing videos of a person. There has been growing attention to this technique due to the increased possibilities and demand for running psychological experiments on online platforms. Technological advancements in commercially available cameras and video processing algorithms have led to significant progress in this field. However, despite these advancements, past research indicates that suboptimal video recording conditions can severely compromise the accuracy of rPPG. In this study, we aimed to develop an open-source rPPG methodology and test its performance on videos collected via an online platform, without control of the hardware of the participants and the contextual variables, such as illumination, distance, and motion. Across two experiments, we compared the results of the rPPG extraction methodology to a validated dataset used for rPPG testing. Furthermore, we then collected 231 online video recordings and compared the results of the rPPG extraction to finger pulse oximeter data acquired with a validated mobile heart rate application. Results indicated that the rPPG algorithm was highly accurate, showing a significant degree of convergence with both datasets thus providing an improved tool for recording and analyzing heart rate in online experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Non-Intrusive Water Surface Velocity Measurement Based on Deep Learning.
- Author
-
An, Guocheng, Du, Tiantian, He, Jin, and Zhang, Yanwei
- Subjects
FLOOD control ,OPTICAL flow ,FLOW velocity ,DEEP learning ,HAZARD mitigation - Abstract
Accurate assessment of water surface velocity (WSV) is essential for flood prevention, disaster mitigation, and erosion control within hydrological monitoring. Existing image-based velocimetry techniques largely depend on correlation principles, requiring users to input and adjust parameters to achieve reliable results, which poses challenges for users lacking relevant expertise. This study presents RivVideoFlow, a user-friendly, rapid, and precise method for WSV. RivVideoFlow combines two-dimensional and three-dimensional orthorectification based on Ground Control Points (GCPs) with a deep learning-based multi-frame optical flow estimation algorithm named VideoFlow, which integrates temporal cues. The orthorectification process employs a homography matrix to convert images from various angles into a top-down view, aligning the image coordinates with actual geographical coordinates. VideoFlow achieves superior accuracy and strong dataset generalization compared to two-frame RAFT models due to its more effective capture of flow velocity continuity over time, leading to enhanced stability in velocity measurements. The algorithm has been validated on a flood simulation experimental platform, in outdoor settings, and with synthetic river videos. Results demonstrate that RivVideoFlow can robustly estimate surface velocity under various camera perspectives, enabling continuous real-time dynamic measurement of the entire flow field. Moreover, RivVideoFlow has demonstrated superior performance in low, medium, and high flow velocity scenarios, especially in high-velocity conditions where it achieves high measurement precision. This method provides a more effective solution for hydrological monitoring. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Unbox the Black-Box: Predict and Interpret YouTube Viewership Using Deep Learning.
- Author
-
Xie, Jiaheng, Chai, Yidong, and Liu, Xiao
- Subjects
DEEP learning ,VIDEO production & direction ,PREDICTION models ,VIDEO processing ,TRUST ,SOCIAL media - Abstract
As video-sharing sites emerge as a critical part of the social media landscape, video viewership prediction becomes essential for content creators and businesses to optimize influence and marketing outreach with minimum budgets. Although deep learning champions viewership prediction, it lacks interpretability, which is required by regulators and is fundamental to the prioritization of the video production process and promoting trust in algorithms. Existing interpretable predictive models face the challenges of imprecise interpretation and negligence of unstructured data. Following the design-science paradigm, we propose a novel Precise Wide-and-Deep Learning (PrecWD) to accurately predict viewership with unstructured video data and well-established features while precisely interpreting feature effects. PrecWD's prediction outperforms benchmarks in two case studies and achieves superior interpretability in two user studies. We contribute to IS knowledge base by enabling precise interpretability in video-based predictive analytics and contribute nascent design theory with generalizable model design principles. Our system is deployable to improve video-based social media presence. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Signsability: Enhancing Communication through a Sign Language App
- Author
-
Din Ezra, Shai Mastitz, and Irina Rabaev
- Subjects
sign language recognition ,deep learning ,computer vision ,MediaPipe ,video processing ,ISL dataset ,Computer software ,QA76.75-76.765 - Abstract
The integration of sign language recognition systems into digital platforms has the potential to bridge communication gaps between the deaf community and the broader population. This paper introduces an advanced Israeli Sign Language (ISL) recognition system designed to interpret dynamic motion gestures, addressing a critical need for more sophisticated and fluid communication tools. Unlike conventional systems that focus solely on static signs, our approach incorporates both deep learning and Computer Vision techniques to analyze and translate dynamic gestures captured in real-time video. We provide a comprehensive account of our preprocessing pipeline, detailing every stage from video collection to the extraction of landmarks using MediaPipe, including the mathematical equations used for preprocessing these landmarks and the final recognition process. The dataset utilized for training our model is unique in its comprehensiveness and is publicly accessible, enhancing the reproducibility and expansion of future research. The deployment of our model on a publicly accessible website allows users to engage with ISL interactively, facilitating both learning and practice. We discuss the development process, the challenges overcome, and the anticipated societal impact of our system in promoting greater inclusivity and understanding.
- Published
- 2024
- Full Text
- View/download PDF
29. Violent crowd flow detection from surveillance cameras using deep transfer learning-gated recurrent unit
- Author
-
Elly Matul Imah and Riskyana Dewi Intan Puspitasari
- Subjects
deep learning ,deep transfer learning ,video processing ,violence detection ,Telecommunication ,TK5101-6720 ,Electronics ,TK7800-8360 - Abstract
Violence can be committed anywhere, even in crowded places. It is hence necessary to monitor human activities for public safety. Surveillance cameras can monitor surrounding activities but require human assistance to continuously monitor every incident. Automatic violence detection is needed for early warning and fast response. However, such automation is still challenging because of low video resolution and blind spots. This paper uses ResNet50v2 and the gated recurrent unit (GRU) algorithm to detect violence in the Movies, Hockey, and Crowd video datasets. Spatial features were extracted from each frame sequence of the video using a pretrained model from ResNet50V2, which was then classified using the optimal trained model on the GRU architecture. The experimental results were then compared with wavelet feature extraction methods and classification models, such as the convolutional neural network and long short-term memory. The results show that the proposed combination of ResNet50V2 and GRU is robust and delivers the best performance in terms of accuracy, recall, precision, and F1-score. The use of ResNet50V2 for feature extraction can improve model performance.
- Published
- 2024
- Full Text
- View/download PDF
30. Deep Learning-Based Human Action Recognition in Videos.
- Author
-
Li, Song and Shi, Qian
- Subjects
- *
HUMAN activity recognition , *CONVOLUTIONAL neural networks , *FEATURE extraction , *LEARNING , *VIDEO processing , *DEEP learning - Abstract
In order to solve the problem of low accuracy and efficiency in video human behavior recognition algorithm, a deep learning video human behavior recognition algorithm is proposed, which is based on an improved time division network. This method innovates on the classical two-stream convolutional neural network framework, and the core is to enhance the performance of the time division network by implementing the sliding window sampling technique with multiple time scales. This sampling strategy not only effectively integrates the full time-series information of the video, but also accurately captures the long-term dependencies hidden in human behavior, which further improves the accuracy and efficiency of behavior recognition. Experimental results show that the method proposed in this paper has achieved good advantages in multiple data sets. On HMDB51, our method achieves 84% recognition accuracy, while on the more complex Kinetics and UCF101 datasets, it also achieves 94% and significant recognition results, respectively. In the face of complex scenes and changeable human body structure, the proposed algorithm shows excellent robustness and stability. In terms of real-time, it can meet the high requirements of real-time video processing. Through the validation of experimental data, our method has made significant progress in extracting spatiotemporal features, capturing long-term dependencies, and focusing on key information. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Adaptive QP algorithm for depth range prediction and encoding output in virtual reality video encoding process.
- Author
-
Yang, Hui, Liu, Qiuming, and Song, Chao
- Subjects
- *
VIRTUAL reality , *ELECTRONIC data processing , *VIDEO processing , *VIDEO coding , *ENCODING , *DECISION making , *VIDEO compression - Abstract
In order to reduce the encoding complexity and stream size, improve the encoding performance and further improve the compression performance, the depth prediction partition encoding is studied in this paper. In terms of pattern selection strategy, optimization analysis is carried out based on fast strategic decision-making methods to ensure the comprehensiveness of data processing. In the design of adaptive strategies, different adaptive quantization parameter adjustment strategies are adopted for the equatorial and polar regions by considering the different levels of user attention in 360 degree virtual reality videos. The purpose is to achieve the optimal balance between distortion and stream size, thereby managing the output stream size while maintaining video quality. The results showed that this strategy achieved a maximum reduction of 2.92% in bit rate and an average reduction of 1.76%. The average coding time could be saved by 39.28%, and the average reconstruction quality was 0.043, with almost no quality loss detected by the audience. At the same time, the model demonstrated excellent performance in sequences of 4K, 6K, and 8K. The proposed deep partitioning adaptive strategy has significant improvements in video encoding quality and efficiency, which can improve encoding efficiency while ensuring video quality. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Evaluating fine tuned deep learning models for real-time earthquake damage assessment with drone-based images.
- Author
-
Kizilay, Furkan, Narman, Mina R., Song, Hwapyeong, Narman, Husnu S., Cosgun, Cumhur, and Alzarrad, Ammar
- Subjects
OBJECT recognition (Computer vision) ,EMERGENCY management ,VIDEO processing ,DEEP learning ,RESOURCE allocation ,COMPUTATIONAL complexity ,EARTHQUAKE damage - Abstract
Earthquakes pose a significant threat to life and property worldwide. Rapid and accurate assessment of earthquake damage is crucial for effective disaster response efforts. This study investigates the feasibility of employing deep learning models for damage detection using drone imagery. We explore the adaptation of models like VGG16 for object detection through transfer learning and compare their performance to established object detection architectures like YOLOv8 (You Only Look Once) and Detectron2. Our evaluation, based on various metrics including mAP, mAP50, and recall, demonstrates the superior performance of YOLOv8 in detecting damaged buildings within drone imagery, particularly for cases with moderate bounding box overlap. This finding suggests its potential suitability for real-world applications due to the balance between accuracy and efficiency. Furthermore, to enhance real-world feasibility, we explore two strategies for enabling the simultaneous operation of multiple deep learning models for video processing: frame splitting and threading. In addition, we optimize model size and computational complexity to facilitate real-time processing on resource-constrained platforms, such as drones. This work contributes to the field of earthquake damage detection by (1) demonstrating the effectiveness of deep learning models, including adapted architectures, for damage detection from drone imagery, (2) highlighting the importance of evaluation metrics like mAP50 for tasks with moderate bounding box overlap requirements, and (3) proposing methods for ensemble model processing and model optimization to enhance real-world feasibility. The potential for real-time damage assessment using drone-based deep learning models offers significant advantages for disaster response by enabling rapid information gathering to support resource allocation, rescue efforts, and recovery operations in the aftermath of earthquakes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. A comparative analysis on major key-frame extraction techniques.
- Author
-
Sunuwar, Jhuma and Borah, Samarjeet
- Subjects
AMERICAN Sign Language ,VIDEO processing ,SIGN language ,EXTRACTION techniques ,SAMPLING (Process) - Abstract
Real-time hand gesture recognition involves analyzing static and dynamic gesture videos. Video is a sequential arrangement of images, captured and eventually displayed at a given frequency. Not all video frames are useful and including all frames makes video processing complex. Methods have been devised to remove redundant and identical frames for simplifying video processing. One such approach is key-frame extraction, which involves identifying and retaining only those frames that accurately represent the original content of the video. In this paper, we have empirically analyzed different methods for performing key-frame extraction. Experiment analysis of five key-frame extraction methods based on Simple Frame Extraction, Uniform Sampling, Structural Similarity Index, Absolute Two Frame Difference, Motion Detection, and Error correction based key-frame extraction technique using Visual Geometry Group-16 has been done. Three publicly available datasets DVS gesture, American Sign Language (ASL) gesture, IPN gesture, and two self-constructed NSL_Consonent and NSL_Vowel datasets have been used to evaluate the performance of key-frame extraction methods. NSL_Consonent and NSL_Vowel comprise 37 consonants and 17 vowels of the Nepali Sign Language. Analyzing the experimental results shows that uniform sampling is only suitable for static gestures that don't require any other structural information for selecting keyframes. Performance of Structural Similarity Index, KCKFE based on VGG16, and motion detection-based key-frame extraction is found suitable for dynamic gestures. The two-frame absolute difference method results in poor key-frame generation due to an equal number of frames being generated as present in the video. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Hybrid time-spatial video saliency detection method to enhance human action recognition systems.
- Author
-
Gharahbagh, Abdorreza Alavi, Hajihashemi, Vahid, Ferreira, Marta Campos, Machado, J. J. M., and Tavares, João Manuel R. S.
- Subjects
HUMAN activity recognition ,MACHINE learning ,OPTICAL flow ,VIDEO processing ,GENETIC algorithms - Abstract
Since digital media has become increasingly popular, video processing has expanded in recent years. Video processing systems require high levels of processing, which is one of the challenges in this field. Various approaches, such as hardware upgrades, algorithmic optimizations, and removing unnecessary information, have been suggested to solve this problem. This study proposes a video saliency map based method that identifies the critical parts of the video and improves the system's overall performance. Using an image registration algorithm, the proposed method first removes the camera's motion. Subsequently, each video frame's color, edge, and gradient information are used to obtain a spatial saliency map. Combining spatial saliency with motion information derived from optical flow and color-based segmentation can produce a saliency map containing both motion and spatial data. A nonlinear function is suggested to properly combine the temporal and spatial saliency maps, which was optimized using a multi-objective genetic algorithm. The proposed saliency map method was added as a preprocessing step in several Human Action Recognition (HAR) systems based on deep learning, and its performance was evaluated. Furthermore, the proposed method was compared with similar methods based on saliency maps, and the superiority of the proposed method was confirmed. The results show that the proposed method can improve HAR efficiency by up to 6.5% relative to HAR methods with no preprocessing step and 3.9% compared to the HAR method containing a temporal saliency map. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. ULSR-UV: an ultra-lightweight super-resolution networks for UAV video.
- Author
-
Yang, Xin, Wu, Lingxiao, and Wang, Xiangchen
- Subjects
- *
DRONE aircraft , *NETWORK performance , *BLOCK designs , *VIDEO processing , *GENERALIZATION , *VIDEO compression - Abstract
Existing lightweight video super-resolution network architectures are often simple in structure and lack generalization ability when dealing with complex and varied real scenes in aerial videos of unmanned aerial vehicle. Furthermore, these networks may cause issues such as the checkerboard effect and loss of texture information when processing drone videos. To address these challenges, we propose a super-lightweight video super-resolution reconstruction network based on convolutional pyramids and progressive residual blocks: ULSR-UV. The ULSR-UV network significantly reduces model redundancy and achieves high levels of lightness by incorporating a 3D lightweight spatial pyramid structure and more efficient residual block designs. This network utilizes a specific optimizer to efficiently process drone videos from both multi-frame and single-frame dimensions. Additionally, the ULSR-UV network incorporates a multidimensional feature loss calculation module that enhances network performance and significantly improves the reconstruction quality of drone aerial videos. Extensive experimental verification has demonstrated ULSR-UV's outstanding performance in the field of drone video super-resolution reconstruction. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. DiffRank: Enhancing efficiency in discontinuous frame rate analysis for urban surveillance systems.
- Author
-
Cheng, Ziying, Li, Zhe, Zhang, Tianfan, Zhao, Xiaochao, and Jing, Xiao
- Subjects
OBJECT recognition (Computer vision) ,SHOW windows ,VIDEO surveillance ,IMAGE processing ,VIDEO processing - Abstract
Urban public safety management relies heavily on video surveillance systems, which provide crucial visual data for resolving a wide range of incidents and controlling unlawful activities. Traditional methods for target detection predominantly employ a two-stage approach, focusing on precision in identifying objects such as pedestrians and vehicles. These objects, typically sparse in large-scale, lower-quality surveillance footage, induce considerable redundant computation during the initial processing stage. This redundancy constrains real-time detection capabilities and escalates processing costs. Furthermore, transmitting raw images and videos laden with superfluous information to centralized back-end systems significantly burdens network communications and fails to capitalize on the computational resources available at diverse surveillance nodes. This study introduces DiffRank, a novel preprocessing method for fixed-angle video imagery in urban surveillance. The method strategically generates candidate regions during preprocessing, thereby reducing redundant object detection and improving the efficiency of the detection algorithm. Drawing upon change detection principles, a background feature learning approach utilizing shallow features has been developed. This approach prioritizes learning the characteristics of fixed-area backgrounds over direct background identification. As a result, alterations in ROI are efficiently discerned using computationally efficient shallow features, markedly accelerating the generation of proposed Regions of Interest (ROIs) and diminishing the computational demands for subsequent object detection and classification. Comparative analysis on various public and private datasets illustrates that DiffRank, while maintaining high accuracy, substantially outperforms existing baselines in terms of speed, particularly with larger image sizes (e.g., an improvement exceeding 300 % at 1920×1080 resolution). Moreover, the method demonstrates enhanced robustness compared to baseline methods, efficiently disregarding static targets like mannequins in display windows. The advancements in candidate area preprocessing enable a balanced approach between detection accuracy and overall detection speed, making the algorithm highly applicable for real-time on-site analysis in edge computing scenarios and cloud-edge collaborative computing environments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Low Power Multiplier Using Approximate Adder for Error Tolerant Applications.
- Author
-
Hemanth, C., Sangeetha, R. G., Kademani, Sagar, and Shahbaz Ali, Meer
- Subjects
- *
DIGITAL signal processing , *VIDEO processing , *LOGIC - Abstract
In embedded applications and digital signal processing systems, multipliers are crucial components. In these applications, there is an increasing need for energy-efficient circuits. We use an approximate adder for error tolerance in the computational process to improve performance and reduce power consumption. Due to human perceptual constraints, computational errors do not significantly affect applications like image, audio, and video processing. Adiabatic logic (AL), which recycles energy, can also be used to build circuits that require less energy. In this work, we propose a carry save array multiplier employing an approximate adder based on CMOS logic and clocked CMOS adiabatic logic (CCAL) to conserve power. Additionally, using a precise full adder, multiplier parameters like average power and power delay product are calculated and compared with the multiplier. We performed simulations using 180 nm technology in Cadence Virtuoso. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. On Developing a Machine Learning-Based Approach for the Automatic Characterization of Behavioral Phenotypes for Dairy Cows Relevant to Thermotolerance.
- Author
-
Inadagbo, Oluwatosin, Makowski, Genevieve, Ahmed, Ahmed Abdelmoamen, and Daigle, Courtney
- Subjects
- *
COMPUTER vision , *DAIRY cattle , *AUTONOMIC nervous system , *ARTIFICIAL intelligence , *VIDEO processing - Abstract
The United States is predicted to experience an annual decline in milk production due to heat stress of 1.4 and 1.9 kg/day by the 2050s and 2080s, with economic losses of USD 1.7 billion and USD 2.2 billion, respectively, despite current cooling efforts implemented by the dairy industry. The ability of cattle to withstand heat (i.e., thermotolerance) can be influenced by physiological and behavioral factors, even though the factors contributing to thermoregulation are heritable, and cows vary in their behavioral repertoire. The current methods to gauge cow behaviors are lacking in precision and scalability. This paper presents an approach leveraging various machine learning (ML) (e.g., CNN and YOLOv8) and computer vision (e.g., Video Processing and Annotation) techniques aimed at quantifying key behavioral indicators, specifically drinking frequency and brush use- behaviors. These behaviors, while challenging to quantify using traditional methods, offer profound insights into the autonomic nervous system function and an individual cow's coping mechanisms under heat stress. The developed approach provides an opportunity to quantify these difficult-to-measure drinking and brush use behaviors of dairy cows milked in a robotic milking system. This approach will open up a better opportunity for ranchers to make informed decisions that could mitigate the adverse effects of heat stress. It will also expedite data collection regarding dairy cow behavioral phenotypes. Finally, the developed system is evaluated using different performance metrics, including classification accuracy. It is found that the YoloV8 and CNN models achieved a classification accuracy of 93% and 96% for object detection and classification, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Directional Texture Editing for 3D Models.
- Author
-
Liu, Shengqi, Chen, Zhuo, Gao, Jingnan, Yan, Yichao, Zhu, Wenhan, Lyu, Jiangjing, and Yang, Xiaokang
- Subjects
- *
VIDEO editing , *VIDEO processing , *TEXTURE mapping , *SURFACES (Technology) , *PROBLEM solving - Abstract
Texture editing is a crucial task in 3D modelling that allows users to automatically manipulate the surface materials of 3D models. However, the inherent complexity of 3D models and the ambiguous text description lead to the challenge of this task. To tackle this challenge, we propose ITEM3D, a Texture Editing Model designed for automatic 3D object editing according to the text Instructions. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge between text and 3D representation and further optimizes the disentangled texture and environment map. Previous methods adopted the absolute editing direction, namely score distillation sampling (SDS) as the optimization objective, which unfortunately results in noisy appearances and text inconsistencies. To solve the problem caused by the ambiguous text, we introduce a relative editing direction, an optimization objective defined by the noise difference between the source and target texts, to release the semantic ambiguity between the texts and images. Additionally, we gradually adjust the direction during optimization to further address the unexpected deviation in the texture domain. Qualitative and quantitative experiments show that our ITEM3D outperforms the state‐of‐the‐art methods on various 3D objects. We also perform text‐guided relighting to show explicit control over lighting. Our project page: https://shengqiliu1.github.io/ITEM3D/. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Mix‐Max: A Content‐Aware Operator for Real‐Time Texture Transitions.
- Author
-
Fournier, Romain and Sauvage, Basile
- Subjects
- *
DISTRIBUTION (Probability theory) , *VIDEO processing , *ALGORITHMS - Abstract
Mixing textures is a basic and ubiquitous operation in data‐driven algorithms for real‐time texture generation and rendering. It is usually performed either by linear blending, or by cutting. We propose a new mixing operator which encompasses and extends both, creating more complex transitions that adapt to the texture's contents. Our mixing operator takes as input two or more textures along with two or more priority maps, which encode how the texture patterns should interact. The resulting mixed texture is defined pixel‐wise by selecting the maximum of both priorities. We show that it integrates smoothly into two widespread applications: transition between two different textures, and texture synthesis that mixes pieces of the same texture. We provide constant‐time and parallel evaluation of the resulting mix over square footprints of MIP‐maps, making our operator suitable for real‐time rendering. We also develop a micro‐priority model, inspired by micro‐geometry models in rendering, which represents sub‐pixel priorities by a statistical distribution, and which allows for tuning between sharp cuts and smooth blend. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. SMFS‐GAN: Style‐Guided Multi‐class Freehand Sketch‐to‐Image Synthesis.
- Author
-
Cheng, Zhenwei, Wu, Lei, Li, Xiang, and Meng, Xiangxu
- Subjects
- *
VIDEO processing , *CLASS differences - Abstract
Freehand sketch‐to‐image (S2I) is a challenging task due to the individualized lines and the random shape of freehand sketches. The multi‐class freehand sketch‐to‐image synthesis task, in turn, presents new challenges for this research area. This task requires not only the consideration of the problems posed by freehand sketches but also the analysis of multi‐class domain differences in the conditions of a single model. However, existing methods often have difficulty learning domain differences between multiple classes, and cannot generate controllable and appropriate textures while maintaining shape stability. In this paper, we propose a style‐guided multi‐class freehand sketch‐to‐image synthesis model, SMFS‐GAN, which can be trained using only unpaired data. To this end, we introduce a contrast‐based style encoder that optimizes the network's perception of domain disparities by explicitly modelling the differences between classes and thus extracting style information across domains. Further, to optimize the fine‐grained texture of the generated results and the shape consistency with freehand sketches, we propose a local texture refinement discriminator and a Shape Constraint Module, respectively. In addition, to address the imbalance of data classes in the QMUL‐Sketch dataset, we add 6K images by drawing manually and obtain QMUL‐Sketch+ dataset. Extensive experiments on SketchyCOCO Object dataset, QMUL‐Sketch+ dataset and Pseudosketches dataset demonstrate the effectiveness as well as the superiority of our proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Evaluation in Neural Style Transfer: A Review.
- Author
-
Ioannou, Eleftherios and Maddock, Steve
- Subjects
- *
LANDSCAPE assessment , *VIDEO processing , *EVALUATION methodology , *HUMAN experimentation , *ALGORITHMS - Abstract
The field of neural style transfer (NST) has witnessed remarkable progress in the past few years, with approaches being able to synthesize artistic and photorealistic images and videos of exceptional quality. To evaluate such results, a diverse landscape of evaluation methods and metrics is used, including authors' opinions based on side‐by‐side comparisons, human evaluation studies that quantify the subjective judgements of participants, and a multitude of quantitative computational metrics which objectively assess the different aspects of an algorithm's performance. However, there is no consensus regarding the most suitable and effective evaluation procedure that can guarantee the reliability of the results. In this review, we provide an in‐depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices. We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons among NST methods but will also enhance the comprehension and interpretation of research findings in the field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection.
- Author
-
Chandran, P., Zoss, G., Gotardo, P., and Bradley, D.
- Subjects
- *
VIDEO processing , *TRANSFORMER models , *DETECTORS , *WEARABLE video devices , *ANNOTATIONS - Abstract
In this paper, we examine three important issues in the practical use of state‐of‐the‐art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require a face normalization step as a pre‐process, often accomplished by a separately trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre‐trained network performs optimal face normalization for the task of landmark detection. Thus, we instead analyse the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, jointly learning an optimal face normalization and landmark detection by a single neural network. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space rather than directly in 2D can further improve accuracy. To convert the predicted 3D landmarks into screen‐space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Third, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a sub‐optimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state‐of‐the‐art on standard benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. 面向边缘计算的轻量级母猪分娩识别模型.
- Author
-
尹令, 蒋圣政, 叶诚至, 吴珍芳, 杨杰2. 3., 张素敏, and 蔡更元
- Subjects
- *
ANIMAL culture , *ANIMAL breeding , *ANIMAL breeds , *IMAGE processing , *VIDEO processing - Abstract
The reproductive performance of sows can play a critical role in animal breeding, particularly in the efficiency and effectiveness of selection. However, manual recordings of piglet births and their survival rates cannot fully meet the large-scale production in recent years. The high precision is often required to capture more nuanced data, such as the intervals between births. The advanced technologies can be expected to enhance both the accuracy and efficiency of animal breeding programs. In this study, a lightweight network was developed to rapidly and accurately monitor the sow birthing in real time. Specifically, essential birthing metrics were engineered to analyze, such as the number of piglets born and the precise intervals between each birth. The lightweight network was tailored for the real-time monitoring of sow birthing activities. The critical birthing parameters were obtained to significantly enhance the efficiency and accuracy of breeding programs. Initially, the efficacy of different monitoring views—specifically, single versus double-column views—were evaluated on the accuracy of the improved model. A single-column view was significantly improved to accurately monitor the birthing events. The real-time decisionmaking and direct implications were obtained from the breeding outcomes. Advanced video processing techniques were incorporated, such as horizontal and vertical flipping. Some challenges were remained on the dynamic changes in the sow posture and varying camera perspectives during monitoring. Moreover, different lighting conditions were adapted to capture the inherent motion blur of active piglets during birth. Color jittering and Gaussian blur were then employed to significantly enhance the robustness of the model. The reliable performance was obtained under diverse operational conditions. Further advancements were achieved through a comparative analysis of classification networks. The results revealed that ResNet50 was greatly contributed to the recognition accuracy. MobileNetV3-S was performed the best with the compact model size and superior processing speed of 505.14 frames per second, indicating the optimal operational efficiency. Furthermore, MobileNetV3-S was refined to apply with the masked generative distillation—a sophisticated technique that was effectively enhanced the network's ability to capture and interpret essential birthing features. ResNet50 was utilized as the teacher model in the practical application, while MobileNetV3-S as the student model. The training was conducted using masked generative distillation followed by dependency graph pruning. The tests were carried out on a DELL OptiPlex microcomputer. An impressive detection speed of 83.10 frames per second was achieved with a test accuracy in a single-column field of view of 91.48%. Although there was a slight decrease in the accuracy of 0.98 percentage points, the detection speed was improved by 67.13 frames per second. This improved model was then deployed at the edge for testing. The better performance was achieved in the managed farrowing intervals with a detection error of just 0.31 seconds and the duration of piglet birth events with a mere 0.02-second error. Highly efficient and exceptionally precise real-time monitoring was obtained to promote the management practices of breeding activities in complex farm environments. In conclusion, the advanced computational techniques were integrated for the transformative potential to the monitoring of sow birthing. Real-time data was acquired to combine the image processing and machine learning. Some standards can be offered for the accuracy and efficiency in livestock management. The reproductive dynamics can greatly contribute to the sustainable and scientifically-informed animal husbandry. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. The (null) effects of video questions on applicant reactions in asynchronous video interviews: Evidence from an actual hiring context.
- Author
-
Niemitz, Nelli, Rietheimer, Lucas, König, Cornelius, and Langer, Markus
- Subjects
- *
EMPLOYEE selection , *IMPRESSION management , *SOCIAL influence , *VIDEO processing , *EMPLOYMENT interviewing , *POPULARITY - Abstract
Asynchronous video interviews (AVIs) are growing in popularity, but tend to suffer from negative applicant reactions, possibly due to lower social presence compared to other interview formats. Research has suggested that specific design features may influence applicant reactions by increasing perceived social presence. In this study, we manipulated the question format (video vs. text) during an actual hiring process (N = 76), testing whether video questions influence social presence, applicant reactions, impression management, and interview performance. There was no evidence that video (vs. text) questions affected any of these variables. We discuss how specific AVI design choices may have affected our results and suggest that future research could investigate the additive and interactive effects of different AVI design features. Practitioner points: Using video questions did not increase social presence when also using an introduction video.Using video questions did not affect applicant reactions, impression management, or interview performance.Our results suggest that for organizations using an introduction video and offering flexibility in the asynchronous video interview process (e.g., the opportunity to rerecord responses), it may not be worth the effort and cost to produce video questions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Survey on vision-based dynamic hand gesture recognition.
- Author
-
Tripathi, Reena and Verma, Bindu
- Subjects
- *
GESTURE , *DEEP learning , *RECOGNITION (Psychology) , *SOCIOCULTURAL factors , *HUMAN body , *HAND - Abstract
To communicate with one another hand, gesture is very important. The task of using the hand gesture in technology is influenced by a very common way humans communicate with the natural environment. The recognizing and finding pose estimation of hand comes under the area of hand gesture analysis. To find out the gesturing hand is very difficult than finding the another part of the human body because the hand is smaller in size. The hand has greater complexity and more challenges due to differences between the cultural or individual factors of users and gestures invented from ad hoc. The complication and divergences of finding hand gestures will deeply affect the recognition rate and accuracy. This paper emphasizes on summary of hand gestures technique, recognition methods, merits and demerits, various applications, available data sets, and achieved accuracy rate, classifiers, algorithm, and gesture types. This paper also scrutinizes the performance of traditional and deep learning methods on dynamic hand gesture recognition. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Signsability: Enhancing Communication through a Sign Language App.
- Author
-
Ezra, Din, Mastitz, Shai, and Rabaev, Irina
- Subjects
SIGN language ,MOBILE apps ,FLUID dynamics ,COMPUTER vision ,DEEP learning - Abstract
The integration of sign language recognition systems into digital platforms has the potential to bridge communication gaps between the deaf community and the broader population. This paper introduces an advanced Israeli Sign Language (ISL) recognition system designed to interpret dynamic motion gestures, addressing a critical need for more sophisticated and fluid communication tools. Unlike conventional systems that focus solely on static signs, our approach incorporates both deep learning and Computer Vision techniques to analyze and translate dynamic gestures captured in real-time video. We provide a comprehensive account of our preprocessing pipeline, detailing every stage from video collection to the extraction of landmarks using MediaPipe, including the mathematical equations used for preprocessing these landmarks and the final recognition process. The dataset utilized for training our model is unique in its comprehensiveness and is publicly accessible, enhancing the reproducibility and expansion of future research. The deployment of our model on a publicly accessible website allows users to engage with ISL interactively, facilitating both learning and practice. We discuss the development process, the challenges overcome, and the anticipated societal impact of our system in promoting greater inclusivity and understanding. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Enhancing Urban Road Safety: Pothole Detection Using YOLO.
- Author
-
Patil, Avila and Japtap, Vandana
- Subjects
OBJECT recognition (Computer vision) ,CONVOLUTIONAL neural networks ,COMPUTER vision ,DEEP learning ,VIDEO processing ,ROAD safety measures - Abstract
Potholes are a major safety concern on roads as they often lead to accidents. Identifying them promptly is vital in preventing accidents. This research focuses on potholes that are very evident during the rainy season because These road defects pose great difficulties for drivers. This study presents the creation of an automatic pothole segmentation model for real time road damage assessment. Potholes have severe safety implications and infrastructure problems, which indicate a need for effective monitoring and maintenance strategies. A YOLOv8based segmentation model was trained using computer vision and machine learning techniques with a curated dataset of road images. Then, we fine-tuned this model through transfer learning while evaluating its performance using various metrics to detect and segment potholes accurately. After that, we integrated the model into a real time video processing pipeline which is combined with road monitoring systems so as to continuously assess the state of roads. Finally, we discuss deployment architecture, real time performance evaluation, use cases as well as future research directions towards automated pothole segmentation’s potential in enhancing road safety and infrastructure management. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. SUSTAINABLE ADVANCEMENTS IN IMAGE AND VIDEO PROCESSING FOR MODERN APPLICATIONS.
- Author
-
Mahde Alahmar, Haeder Talib
- Subjects
GENERATIVE adversarial networks ,TRANSFORMER models ,CONVOLUTIONAL neural networks ,ARTIFICIAL intelligence ,VIDEO processing - Abstract
The speedy improvement of artificial Genius (AI) has basically modified the area of picture and video processing and opened up new possibilities for a range of applications. This paper appears at the state-of-the-art techniques shaping the field, focusing on innovation and sustainability. We review trendy technologies, along with deep getting to know frameworks, generative models, and facet AI, and learn about their have an impact on on real-time processing, useful resource efficiency, and scalability. For example, the Visual Transformer (ViT) has attracted interest for its optimal potential to capture global dependencies in visible data, outperforming common convolutional neural networks (CNNs) on more than a few tasks. In addition, we talk about the use of generative adversarial networks (GANs) to improve scientific picture quality, thereby appreciably enhancing diagnostic accuracy. The paper also addresses urgent sustainability issues via exploring methods to reduce the environmental affect of AI-driven picture and video processing. Techniques such as mannequin pruning, quantization, and integration of renewable strength into statistics centers are explored, and practical options are provided to balance overall performance and electricity consumption. This lookup offers realistic insights that can be directly utilized to industries such as healthcare, self-sufficient vehicles, and safety systems, and affords a roadmap for adopting energy-efficient and moral AI practices. Through these analyses, this paper presents precious insights into modern-day trends, sensible applications, and future instructions of AI-driven photo and video processing. By combining empirical proof and case studies, we aim to make contributions to the ongoing dialogue on the role of AI in developing a extra sustainable and revolutionary future. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Detection and localization of anomalous objects in video sequences using vision transformers and U-Net model.
- Author
-
Berroukham, Abdelhafid, Housni, Khalid, and Lahraichi, Mohammed
- Abstract
The detection and localization of anomalous objects in video sequences remain a challenging task in video analysis. Recent years have witnessed a surge in deep learning approaches, especially with recurrent neural networks (RNNs). However, RNNs have limitations that vision transformers (ViTs) can address. We propose a novel solution that leverages ViTs, which have recently achieved remarkable success in various computer vision tasks. Our approach involves a two-step process. First, we utilize a pre-trained ViT model to generate an intermediate representation containing an attention map, highlighting areas critical for anomaly detection. In the second step, this attention map is concatenated with the original video frame, creating a richer representation that guides the U-Net model towards anomaly-prone regions. This enriched data is then fed into a U-Net model for precise localization of the anomalous objects. The model achieved a mean Intersection over Union (IoU) of 0.70, indicating a strong overlap between the predicted bounding boxes and the ground truth annotations. In the field of anomaly detection, a higher IoU score signifies better performance. Moreover, the pixel accuracy of 0.99 demonstrates a high level of precision in classifying individual pixels. Concerning localization accuracy, we conducted a comparison of our method with other approaches. The results obtained show that our method outperforms most of the previous methods and achieves a very competitive performance in terms of localization accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.