1,114 results on '"video sequences"'
Search Results
2. CoFFEE: a codec-based forensic feature extraction and evaluation software for H.264 videos.
- Author
-
Bertazzini, Giulia, Baracchi, Daniele, Shullani, Dasara, Iuliani, Massimo, and Piva, Alessandro
- Subjects
VIDEO codecs ,FEATURE extraction ,DIGITAL video ,SOCIAL networks ,SCIENTIFIC community ,VIDEO compression - Abstract
The forensic analysis of digital videos is becoming increasingly relevant to deal with forensic cases, propaganda, and fake news. The research community has developed numerous forensic tools to address various challenges, such as integrity verification, manipulation detection, and source characterization. Each tool exploits characteristic traces to reconstruct the video life-cycle. Among these traces, a significant source of information is provided by the specific way in which the video has been encoded. While several tools are available to analyze codec-related information for images, a similar approach has been overlooked for videos, since video codecs are extremely complex and involve the analysis of a huge amount of data. In this paper, we present a new tool designed for extracting and parsing a plethora of video compression information from H.264 encoded files, including macroblocks structure, prediction residuals, and motion vectors. We demonstrate how the extracted features can be effectively exploited to address various forensic tasks, such as social network identification, source characterization, and double compression detection. We provide a detailed description of the developed software, which is released free of charge to enable its use by the research community to create new tools for forensic analysis of video files. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. CoFFEE: a codec-based forensic feature extraction and evaluation software for H.264 videos
- Author
-
Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Massimo Iuliani, and Alessandro Piva
- Subjects
Video forensics ,Multimedia forensics ,Video sequences ,Feature extraction ,H.264/AVC video codec ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract The forensic analysis of digital videos is becoming increasingly relevant to deal with forensic cases, propaganda, and fake news. The research community has developed numerous forensic tools to address various challenges, such as integrity verification, manipulation detection, and source characterization. Each tool exploits characteristic traces to reconstruct the video life-cycle. Among these traces, a significant source of information is provided by the specific way in which the video has been encoded. While several tools are available to analyze codec-related information for images, a similar approach has been overlooked for videos, since video codecs are extremely complex and involve the analysis of a huge amount of data. In this paper, we present a new tool designed for extracting and parsing a plethora of video compression information from H.264 encoded files, including macroblocks structure, prediction residuals, and motion vectors. We demonstrate how the extracted features can be effectively exploited to address various forensic tasks, such as social network identification, source characterization, and double compression detection. We provide a detailed description of the developed software, which is released free of charge to enable its use by the research community to create new tools for forensic analysis of video files.
- Published
- 2024
- Full Text
- View/download PDF
4. Umpire's Signal Recognition in Cricket Using an Attention based DC-GRU Network.
- Author
-
Dey, A., Biswas, S., and Abualigah, L.
- Subjects
COMPUTER vision ,CRICKET umpires ,VIDEOS ,EVALUATION - Abstract
Computer vision has extensive applications in various sports domains, and cricket, a complex game with different event types, is no exception. Recognizing umpire signals during cricket matches is essential for fair and accurate decision-making in gameplay. This paper presents the Cricket Umpire Action Video dataset (CUAVd), a novel dataset designed for detecting umpire postures in cricket matches. As the umpire possesses the power to make crucial judgments concerning incidents that occur on the field, this dataset aims to contribute to the advancement of automated systems for umpire recognition in cricket. The proposed Attention-based Deep Convolutional GRU Network accurately detects and classifies various umpire signal actions in video sequences. The method achieved remarkable results on our prepared CUAVd dataset and publicly available datasets, namely HMDB51, Youtube Actions, and UCF101. The DC-GRU Attention model demonstrated its effectiveness in capturing temporal dependencies and accurately recognizing umpire signal actions. Compared to other advanced models like traditional CNN architectures, CNN-LSTM with Attention, and the 3DCNN+GRU model, the proposed model consistently outperformed them in recognizing umpire signal actions. It achieved a high validation accuracy of 94.38% in classifying umpire signal videos correctly. The paper also evaluated the models using performance metrics like F1-Measure and Confusion Matrix, confirming their effectiveness in recognizing umpire signal actions. The suggested model has practical applications in real-life situations such as sports analysis, referee training, and automated referee assistance systems where precise identification of umpire signals in videos is vital. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Single Shot Detector CNN and Deep Dilated Masks for Vision-Based Hand Gesture Recognition From Video Sequences
- Author
-
Fahmid Al Farid, Noramiza Hashim, Junaidi Bin Abdullah, Md. Roman Bhuiyan, Magzhan Kairanbay, Zulfadzli Yusoff, Hezerul Abdul Karim, Sarina Mansor, MD. Tanjil Sarker, and Gobbi Ramasamy
- Subjects
Gesture recognition ,video sequences ,SVM ,SSD-CNN ,deep dilated mask ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With an increasing number of people on the planet today, innovative human-computer interaction technologies and approaches may be employed to assist individuals in leading more fulfilling lives. Gesture-based technology has the potential to improve the safety and well-being of impaired people, as well as the general population. Recognizing gestures from video streams is a difficult problem because of the large degree of variation in the characteristics of each motion across individuals. In this article, we propose applying deep learning methods to recognize automated hand gestures using RGB and depth data. To train neural networks to detect hand gestures, any of these forms of data may be utilized. Gesture-based interfaces are more natural, intuitive, and straightforward. Earlier study attempted to characterize hand motions in a number of contexts. Our technique is evaluated using a vision-based gesture recognition system. In our suggested technique, image collection starts with RGB video and depth information captured with the Kinect sensor and is followed by tracking the hand using a single shot detector Convolutional Neural Network (SSD-CNN). When the kernel is applied, it creates an output value at each of the m $\times $ n locations. Using a collection of convolutional filters, each new feature layer generates a defined set of gesture detection predictions. After that, we perform deep dilation to make the gesture in the image masks more visible. Finally, hand gestures have been detected using the well-known classification technique SVM. Using deep learning we recognize hand gestures with higher accuracy of 93.68% in RGB passage, 83.45% in the depth passage, and 90.61% in RGB-D conjunction on the SKIG dataset compared to the state-of-the-art. In the context of our own created Different Camera Orientation Gesture (DCOG) dataset we got higher accuracy of 92.78% in RGB passage, 79.55% in the depth passage, and 88.56% in RGB-D conjunction for the gestures collected in 0-degree angle. Moreover, the framework intends to use unique methodologies to construct a superior vision-based hand gesture recognition system.
- Published
- 2024
- Full Text
- View/download PDF
6. An Open CV-Based Social Distance Violation Identification Using a Deep Learning Technique
- Author
-
Pillalamarri, Syam Sundar, Saikumar, K., Al-Ameedee, Sarah A., Abdul Kadeem, Sahar R., Hussein, Mohammed J., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Sharma, Devendra Kumar, editor, Peng, Sheng-Lung, editor, Sharma, Rohit, editor, and Jeon, Gwanggil, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Generative Models for Low-Dimensional Video Representation and Reconstruction
- Author
-
Hyder, Rakib and Asif, M Salman
- Subjects
Generators ,Optimization ,Image reconstruction ,Video sequences ,Compressed sensing ,Electronics packaging ,Measurement uncertainty ,Compressive sensing ,generative model ,video reconstruction ,manifold embedding ,cs.CV ,cs.LG ,stat.ML ,Networking & Telecommunications - Abstract
Generative models have received considerable attention in signal processing and compressive sensing for their ability to generate high-dimensional natural image using low-dimensional codes. In the context of compressive sensing, if the unknown image belongs to the range of a pretrained generative network, then we can recover the image by estimating the underlying compact latent code from the available measurements. In practice, however, a given pretrained generator can only reliably generate images that are similar to the training data. To overcome this challenge, a number of methods have been proposed recently to use untrained generator structure as prior while solving the signal recovery problem. In this paper, we propose a similar method for jointly updating the weights of the generator and latent codes while recovering a video sequence from compressive measurements. We use a single generator to generate the entire video. To exploit the temporal redundancy in a video sequence, we use a low-rank constraint on the latent codes that imposes a low-dimensional manifold model on the generated video sequence. We evaluate the performance of our proposed methods on different video compressive sensing problems under different settings and compared them against some state-of-the-art methods. Our results demonstrate that our proposed methods provide better or comparable accuracy and low computational and memory complexity compared to the existing methods.
- Published
- 2020
8. IMPLEMENTATION AND EVALUATION OF THE EFFECTIVENESS OF DYNAMIC OBJECT DETECTION TOOLS FOR MULTITHREADED AND DISTRIBUTED PROCESSING OF GRAPHICAL DATA.
- Author
-
Kravets, O. Ja., Aksenov, I. A., Atlasov, I. V., Rahman, P. A., Redkin, Yu. V., Frantsisko, O. Yu., and Bozhko, L. M.
- Subjects
ELECTRONIC data processing ,IMAGING systems ,STRUCTURAL components - Abstract
Modern monitoring and operational decision-making systems based on image series have become widely used for a wide variety of applications. At the stage of identification of graphic objects, it is established that the found dynamic object belongs to the class of objects of interest based on a comparative analysis of its contours with a given template. The article is the final in a series of articles devoted to theoretical, algorithmic and software components of the video analytics system. The description of the structural components of the video analytics system, the features of the implementation of the dynamic object detection module on video sequences, and the evaluation of the efficiency of the system is presented. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Multi-geometry embedded transformer for facial expression recognition in videos.
- Author
-
Chen, Dongliang, Wen, Guihua, Li, Huihui, Yang, Pei, Chen, Chuyun, and Wang, Bao
- Subjects
- *
FACIAL expression , *HYPERBOLIC spaces , *MULTILEVEL models , *VIDEOS , *EMOTIONAL state - Abstract
Dynamic facial expressions in videos express more realistic emotional states, and recognizing emotions from in-the-wild facial expression videos is a challenging task due to the changeable posture, partial occlusion and various light conditions. Although current methods have designed transformer-based models to learn spatial–temporal features, they cannot explore useful local geometry structures from both spatial and temporal views to capture subtle emotional features for the videos with varied poses and facial occlusion. To this end, we propose a novel multi-geometry embedded transformer (MGET), which adapts multi-geometry knowledge into transformers and excavates spatial–temporal geometry information as complementary to learn effective emotional features. Specifically, from a new perspective, we first design a multi-geometry distance learning (MGDL) to capture emotion-related geometry structure knowledge under Euclidean and Hyperbolic spaces. Especially based on the advantages of hyperbolic geometry, it finds the more subtle emotional changes among local spatial and temporal features. Secondly, we combine MGDL with transformer to design spatial–temporal MGETs, which capture important spatial and temporal multi-geometry features to embed them into their corresponding original features, and then perform cross-regions and cross-frame interaction on these multi-level features. Finally, MGET gains superior performance on DFEW, FERV39k and AFEW datasets, where the unweighted average recall (UAR) and weighted average recall (WAR) are 58.65%/69.91%, 41.91%/50.76% and 53.23%/55.40%, respectively, and the gained improvements are 2.55%/0.66%, 3.69%/2.63% and 3.66%/1.14% compared to M3DFEL, Logo-Forme and EST methods. • A multi-geometry embedded transformer is proposed for in-the-wild FER in videos. • MGDL captures multi-geometry structures under Euclidean and Hyperbolic spaces. • MGET combines MGDL with transformer to model multi-level spatial-temporal features. • MGET shows superior performance on in-the-wild video-based FER databases. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Multitask Multigranularity Aggregation With Global-Guided Attention for Video Person Re-Identification.
- Author
-
Sun, Dengdi, Huang, Jiale, Hu, Lei, Tang, Jin, and Ding, Zhuanlian
- Subjects
- *
VIDEOS , *CONVOLUTIONAL neural networks - Abstract
The goal of video-based person re-identification (Re-ID) is to identify the same person across multiple non-overlapping cameras. The key to accomplishing this challenging task is to sufficiently exploit both spatial and temporal cues in video sequences. However, most current methods are incapable of accurately locating semantic regions or efficiently filtering discriminative spatio-temporal features; so it is difficult to handle issues such as spatial misalignment and occlusion. Thus, we propose a novel feature aggregation framework, multi-task and multi-granularity aggregation with global-guided attention (MMA-GGA), which aims to adaptively generate more representative spatio-temporal aggregation features. Specifically, we develop a multi-task multi-granularity aggregation (MMA) module to extract features at different locations and scales to identify key semantic-aware regions that are robust to spatial misalignment. Then, to determine the importance of the multi-granular semantic information, we propose a global-guided attention (GGA) mechanism to learn weights based on the global features of the video sequence, allowing our framework to identify stable local features while ignoring occlusions. Therefore, the MMA-GGA framework can efficiently and effectively capture more robust and representative features. Extensive experiments on four benchmark datasets demonstrate that our MMA-GGA framework outperforms current state-of-the-art methods. In particular, our method achieves a rank-1 accuracy of 91.0% on the MARS dataset, the most widely used database, significantly outperforming existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
11. HFF6D: Hierarchical Feature Fusion Network for Robust 6D Object Pose Tracking.
- Author
-
Liu, Jian, Sun, Wei, Liu, Chongpei, Zhang, Xing, Fan, Shimeng, and Wu, Wei
- Subjects
- *
DATA augmentation , *ARTIFICIAL satellite tracking , *FEATURE extraction - Abstract
Tracking the 6-degree-of-freedom (6D) object pose in video sequences is gaining attention because it has a wide application in multimedia and robotic manipulation. However, current methods often perform poorly in challenging scenes, such as incorrect initial pose, sudden re-orientation, and severe occlusion. In contrast, we present a robust 6D object pose tracking method with a novel hierarchical feature fusion network, refer it as HFF6D, which aims to predict the object’s relative pose between adjacent frames. Instead of extracting features from adjacent frames separately, HFF6D establishes sufficient spatial-temporal information interaction between adjacent frames. In addition, we propose a novel subtraction feature fusion (SFF) module with attention mechanism to leverage feature subtraction during feature fusion. It explicitly highlights the feature differences between adjacent frames, thus improving the robustness of relative pose estimation in challenging scenes. Besides, we leverage data augmentation technology to make HFF6D be used more effectively in the real world by training only with synthetic data, thereby reducing manual effort in data annotation. We evaluate HFF6D on the well-known YCB-Video and YCBInEOAT datasets. Quantitative and qualitative results demonstrate that HFF6D outperforms state-of-the-art (SOTA) methods in both accuracy and efficiency. Moreover, it is also proved to achieve high-robustness tracking under the above-mentioned challenging scenes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. Counting People by Estimating People Flows.
- Author
-
Liu, Weizhe, Salzmann, Mathieu, and Fua, Pascal
- Subjects
- *
OPTICAL flow , *ACTIVE learning , *COUNTING , *DEEP learning , *VIDEO compression - Abstract
Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
13. Implicit Motion-Compensated Network for Unsupervised Video Object Segmentation.
- Author
-
Xi, Lin, Chen, Weihai, Wu, Xingming, Liu, Zhong, and Li, Zhengguo
- Subjects
- *
RUNNING speed , *OPTICAL flow , *MOTION , *VIDEOS - Abstract
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence. Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based). To overcome the limitations, we propose an implicit motion-compensated network (IMCNet) combining complementary cues (i.e., appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level without estimating optical flows. The proposed IMCNet consists of an affinity computing module (ACM), an attention propagation module (APM), and a motion compensation module (MCM). The light-weight ACM extracts commonality between neighboring input frames based on appearance features. The APM then transmits global correlation in a top-down manner. Through coarse-to-fine iterative inspiring, the APM will refine object regions from multiple resolutions so as to efficiently avoid losing details. Finally, the MCM aligns motion information from temporally adjacent frames to the current frame which achieves implicit motion compensation at the feature level. We perform extensive experiments on $\textit {DAVIS}_{\textit {16}}$ and $\textit {YouTube-Objects}$. Our network achieves favorable performance while running at a faster speed compared to the state-of-the-art methods. Our code is available at https://github.com/xilin1991/IMCNet. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. Saliency and Granularity: Discovering Temporal Coherence for Video-Based Person Re-Identification.
- Author
-
Chen, Cuiqun, Ye, Mang, Qi, Meibin, Wu, Jingjing, Liu, Yimin, and Jiang, Jianguo
- Subjects
- *
FEATURE extraction , *NOISE measurement - Abstract
Video-based person re-identification (ReID) matches the same people across the video sequences with rich spatial and temporal information in complex scenes. It is highly challenging to capture discriminative information when occlusions and pose variations exist between frames. A key solution to this problem rests on extracting the temporal invariant features of video sequences. In this paper, we propose a novel method for discovering temporal coherence by designing a region-level saliency and granularity mining network (SGMN). Firstly, to address the varying noisy frame problem, we design a temporal spatial-relation module (TSRM) to locate frame-level salient regions, adaptively modeling the temporal relations on spatial dimension through a probe-buffer mechanism. It avoids the information redundancy between frames and captures the informative cues of each frame. Secondly, a temporal channel-relation module (TCRM) is proposed to further mine the small granularity information of each frame, which is complementary to TSRM by concentrating on discriminative small-scale regions. TCRM exploits a one-and-rest difference relation on channel dimension to enhance the granularity features, leading to stronger robustness against misalignments. Finally, we evaluate our SGMN with four representative video-based datasets, including iLIDS-VID, MARS, DukeMTMC-VideoReID, and LS-VID, and the results indicate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration.
- Author
-
Yang, Zongxin, Wei, Yunchao, and Yang, Yi
- Subjects
- *
CONVOLUTIONAL neural networks - Abstract
This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Unlike previous practices that focus on exploring the embedding learning of foreground object (s), we consider background should be equally treated. Thus, we propose a Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. CFBI separates the feature embedding into the foreground object region and its corresponding background region, implicitly promoting them to be more contrastive and improving the segmentation results accordingly. Moreover, CFBI performs both pixel-level matching processes and instance-level attention mechanisms between the reference and the predicted sequence, making CFBI robust to various object scales. Based on CFBI, we introduce a multi-scale matching structure and propose an Atrous Matching strategy, resulting in a more robust and efficient framework, CFBI+. We conduct extensive experiments on two popular benchmarks, i.e., DAVIS and YouTube-VOS. Without applying any simulated data for pre-training, our CFBI+ achieves the performance ($\mathcal {J}$ J & $\mathcal {F}$ F ) of 82.9 and 82.8 percent, outperforming all the other state-of-the-art methods. Code: https://github.com/z-x-yang/CFBI. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Spatio-Temporal Video Segmentation
- Author
-
Mashtalir, Sergii, Mashtalir, Volodymyr, Kacprzyk, Janusz, Series Editor, Mashtalir, Vladimir, editor, Ruban, Igor, editor, and Levashenko, Vitaly, editor
- Published
- 2020
- Full Text
- View/download PDF
17. Video Hand Gestures Recognition Using Depth Camera and Lightweight CNN.
- Author
-
Leon, David Gonzalez, Groli, Jade, Yeduri, Sreenivasa Reddy, Rossier, Daniel, Mosqueron, Romuald, Pandey, Om Jee, and Cenkeramaddi, Linga Reddy
- Abstract
Hand gestures are a well-known and intuitive method of human-computer interaction. The majority of the research has concentrated on hand gesture recognition from the RGB images, however, little work has been done on recognition from videos. In addition, RGB cameras are not robust in varying lighting conditions. Motivated by this, we present the video based hand gestures recognition using the depth camera and a light weight convolutional neural network (CNN) model. We constructed a dataset and then used a light weight CNN model to detect and classify hand movements efficiently. We also examined the classification accuracy with a limited number of frames in a video gesture. We compare the depth camera’s video gesture recognition performance to that of the RGB camera. We evaluate the proposed model’s performance on edge computing devices and compare to benchmark models in terms of accuracy and inference time. The proposed model results in an accuracy of 99.48% on the RGB version of test dataset and 99.18% on the depth version of test dataset. Finally, we compare the accuracy of the proposed light weight CNN model with the state-of-the hand gesture classification models. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. Adaptive Online Mutual Learning Bi-Decoders for Video Object Segmentation.
- Author
-
Guo, Pinxue, Zhang, Wei, Li, Xiaoqiang, and Zhang, Wenqiang
- Subjects
- *
ONLINE education , *VIDEOS , *OBJECT tracking (Computer vision) , *VIDEO coding , *PIXELS - Abstract
One of the major challenges facing video object segmentation (VOS) is the gap between the training and test datasets due to unseen category in test set, as well as object appearance change over time in the video sequence. To overcome such challenges, an adaptive online framework for VOS is developed with bi-decoders mutual learning. We learn object representation per pixel with bi-level attention features in addition to CNN features, and then feed them into mutual learning bi-decoders whose outputs are further fused to obtain the final segmentation result. We design an adaptive online learning mechanism via a deviation correcting trigger such that bi-decoders online mutual learning will be activated when the previous frame is segmented well meanwhile the current frame is segmented relatively worse. Knowledge distillation from the well segmented previous frames, along with mutual learning between bi-decoders, improves generalization ability and robustness of VOS model. Thus, the proposed model adapts to the challenging scenarios including unseen categories, object deformation, and appearance variation during inference. We extensively evaluate our model on widely-used VOS benchmarks including DAVIS-2016, DAVIS-2017, YouTubeVOS-2018, YouTubeVOS-2019, and UVO. Experimental results demonstrate the superiority of the proposed model over state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. Delving Deeper Into Mask Utilization in Video Object Segmentation.
- Author
-
Wang, Mengmeng, Mei, Jianbiao, Liu, Lina, Tian, Guanzhong, Liu, Yong, and Pan, Zaisheng
- Subjects
- *
VIDEOS , *TASK analysis - Abstract
This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization.
- Author
-
Zhu, Wencheng, Han, Yucheng, Lu, Jiwen, and Zhou, Jie
- Subjects
- *
VIDEO summarization , *REPRESENTATIONS of graphs , *MACHINE learning - Abstract
In this paper, we propose a dynamic graph modeling approach to learn spatial-temporal representations for video summarization. Most existing video summarization methods extract image-level features with ImageNet pre-trained deep models. Differently, our method exploits object-level and relation-level information to capture spatial-temporal dependencies. Specifically, our method builds spatial graphs on the detected object proposals. Then, we construct a temporal graph by using the aggregated representations of spatial graphs. Afterward, we perform relational reasoning over spatial and temporal graphs with graph convolutional networks and extract spatial-temporal representations for importance score prediction and key shot selection. To eliminate relation clutters caused by densely connected nodes, we further design a self-attention edge pooling module, which disregards meaningless relations of graphs. We conduct extensive experiments on two popular benchmarks, including the SumMe and TVSum datasets. Experimental results demonstrate that the proposed method achieves superior performance against state-of-the-art video summarization methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. VidSfM: Robust and Accurate Structure-From-Motion for Monocular Videos.
- Author
-
Cui, Hainan, Tu, Diantao, Tang, Fulin, Xu, Pengfei, Liu, Hongmin, and Shen, Shuhan
- Subjects
- *
MONOCULARS , *POSE estimation (Computer vision) , *VIDEOS , *IMAGE reconstruction , *VIDEO compression , *PROBLEM solving - Abstract
With the popularization of smartphones, larger collection of videos with high quality is available, which makes the scale of scene reconstruction increase dramatically. However, high-resolution video produces more match outliers, and high frame rate video brings more redundant images. To solve these problems, a tailor-made framework is proposed to realize an accurate and robust structure-from-motion based on monocular videos. The key ideas include two points: one is to use the spatial and temporal continuity of video sequences to improve the accuracy and robustness of reconstruction; the other is to use the redundancy of video sequences to improve the efficiency and scalability of system. Our technical contributions include an adaptive way to identify accurate loop matching pairs, a cluster-based camera registration algorithm, a local rotation averaging scheme to verify the pose estimate and a local images extension strategy to reboot the incremental reconstruction. In addition, our system can integrate data from different video sequences, allowing multiple videos to be simultaneously reconstructed. Extensive experiments on both indoor and outdoor monocular videos demonstrate that our method outperforms the state-of-the-art approaches in robustness, accuracy and scalability. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Variational Abnormal Behavior Detection With Motion Consistency.
- Author
-
Li, Jing, Huang, Qingwang, Du, Yingjun, Zhen, Xiantong, Chen, Shengyong, and Shao, Ling
- Subjects
- *
GENERATIVE adversarial networks , *COLLECTIVE behavior , *MOTION , *OPTICAL flow , *COMPUTER vision , *DATA augmentation , *VIDEO coding - Abstract
Abnormal crowd behavior detection has recently attracted increasing attention due to its wide applications in computer vision research areas. However, it is still an extremely challenging task due to the great variability of abnormal behavior coupled with huge ambiguity and uncertainty of video contents. To tackle these challenges, we propose a new probabilistic framework named variational abnormal behavior detection (VABD), which can detect abnormal crowd behavior in video sequences. We make three major contributions: (1) We develop a new probabilistic latent variable model that combines the strengths of the U-Net and conditional variational auto-encoder, which also are the backbone of our model; (2) We propose a motion loss based on an optical flow network to impose the motion consistency of generated video frames and input video frames; (3) We embed a Wasserstein generative adversarial network at the end of the backbone network to enhance the framework performance. VABD can accurately discriminate abnormal video frames from video sequences. Experimental results on UCSD, CUHK Avenue, IITB-Corridor, and ShanghaiTech datasets show that VABD outperforms the state-of-the-art algorithms on abnormal crowd behavior detection. Without data augmentation, our VABD achieves 72.24% in terms of AUC on IITB-Corridor, which surpasses the state-of-the-art methods by nearly 5%. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. Motion-Adaptive Detection of HEVC Double Compression With the Same Coding Parameters.
- Author
-
Xu, Qiang, Jiang, Xinghao, Sun, Tanfeng, and Kot, Alex C.
- Abstract
High Efficiency Video Coding (HEVC) double compression detection is of prime significance in video forensics. However, double compression with the same parameters and video content with high motion displacement intensity have become two main factors that limit the performance of existing algorithms. To address these issues, a novel motion-adaptive algorithm is proposed in this paper. Firstly, the analysis of GOP structure in HEVC standard and the coding process of HEVC double compression are provided. Next, sub-features composed of fluctuation intensities of intra prediction modes and unstable Prediction Units (PUs) in normal Intra-Frames (I-frames) and optical flow in adaptive I-frames are exploited in our algorithm. Each sub-feature is extracted during the process of multiple decompression. We further combine these sub-features into a 27-dimensional detection feature, which is finally fed to the Support Vector Machine (SVM) classifier. By following a separation-fusion detection strategy, the experimental result shows that the proposed algorithm outperforms the existing state-of-the-art methods and demonstrates superior robustness to various motion displacement intensities and a wide variety of coding parameter settings. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. SANet: Statistic Attention Network for Video-Based Person Re-Identification.
- Author
-
Bai, Shutao, Ma, Bingpeng, Chang, Hong, Huang, Rui, Shan, Shiguang, and Chen, Xilin
- Subjects
- *
FEATURE extraction , *SOURCE code , *PEDESTRIANS , *MACHINE learning , *IDENTIFICATION , *POSE estimation (Computer vision) - Abstract
Capturing long-range dependencies during feature extraction is crucial for video-based person re-identification (re-id) since it would help to tackle many challenging problems such as occlusion and dramatic pose variation. Moreover, capturing subtle differences, such as bags and glasses, is indispensable to distinguish similar pedestrians. In this paper, we propose a novel and efficacious Statistic Attention (SA) block which can capture both the long-range dependencies and subtle differences. SA block leverages high-order statistics of feature maps, which contain both long-range and high-order information. By modeling relations with these statistics, SA block can explicitly capture long-range dependencies with less time complexity. In addition, high-order statistics usually concentrate on details of feature maps and can perceive the subtle differences between pedestrians. In this way, SA block is capable of discriminating pedestrians with subtle differences. Furthermore, this lightweight block can be conveniently inserted into existing deep neural networks at any depth to form Statistic Attention Network (SANet). To evaluate its performance, we conduct extensive experiments on two challenging video re-id datasets, showing that our SANet outperforms the state-of-the-art methods. Furthermore, to show the generalizability of SANet, we evaluate it on three image re-id datasets and two more general image classification datasets, including ImageNet. The source code is available at http://vipl.ict.ac.cn/resources/codes/code/SANet_code.zip. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. Temporal Group Fusion Network for Deep Video Inpainting.
- Author
-
Liu, Ruixin, Li, Bairong, and Zhu, Yuesheng
- Subjects
- *
INPAINTING , *VIDEOS , *DEEP learning , *IMAGE reconstruction , *TASK analysis - Abstract
Video inpainting is a task of synthesizing spatio-temporal coherent content in missing regions of the given video sequence, which has recently drawn increasing attention. To utilize the temporal information across frames, most recent deep learning-based methods align reference frames to target frame firstly with explicit or implicit motion estimation and then integrate the information from the aligned frames. However, their performance relies heavily on the accuracy of frame-to-frame alignment. To alleviate the above problem, in this paper, a novel Temporal Group Fusion Network (TGF-Net) is proposed to effectively integrate temporal information through a two-stage fusion strategy. Specifically, the input frames are reorganized into different groups, where each group is followed by an intra-group fusion module to integrate information within the group. Different groups provide complementary information for the missing region. A temporal attention model is further designed to adaptively integrate the information across groups. Such a temporal information fusion way gets rid of the dependence on alignment operations, greatly improving the visual quality and temporal consistency of the inpainted results. In addition, a coarse alignment model is introduced at the beginning of the network to handle videos with large motion. Extensive experiments on DAVIS and Youtube-VOS datasets demonstrate the superiority of our proposed method in terms of PSNR/SSIM values, visual quality and temporal consistency, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. 4D Epanechnikov Mixture Regression in LF Image Compression.
- Author
-
Liu, Boning, Zhao, Yan, Jiang, Xiaomeng, Wang, Shigang, and Wei, Jian
- Subjects
- *
IMAGE compression , *GAUSSIAN mixture models , *VIDEO coding , *REGRESSION analysis , *JPEG (Image coding standard) , *IMAGE reconstruction , *MIXTURES - Abstract
With the emergence of light field imaging in recent years, the compression of its elementary image array (EIA) has become a significant problem. Our coding framework includes modeling and reconstruction. For the modeling, the covariance-matrix form of the 4D Epanechnikov kernel (4D EK) and its correlated statistics were deduced to obtain the 4-D Epanechnikov mixture models (4-D EMMs). A 4D Epanechnikov mixture regression (4D EMR) was proposed based on this 4D EK, and a 4D adaptive model selection (4D AMLS) algorithm was designed to realize the optimal modeling for a pseudo video sequence (PVS) of the extracted key-EIA. A linear function based reconstruction (LFBR) was proposed based on the correlation between adjacent elementary images (EIs). The decoded images realized a clear outline reconstruction and superior coding efficiency compared to high-efficiency video coding (HEVC) and JPEG 2000 below approximately 0.05 bpp. This work realized an unprecedented theoretical application by (1) proposing the 4D Epanechnikov kernel theory, (2) exploiting the 4D Epanechnikov mixture regression and its application in the modeling of the pseudo video sequence of light field images, (3) using 4D adaptive model selection for the optimal number of models, and (4) employing a linear function-based reconstruction according to the content similarity. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. A Review on Deep Learning Techniques for Video Prediction.
- Author
-
Oprea, Sergiu, Martinez-Gonzalez, Pablo, Garcia-Garcia, Alberto, Castro-Vargas, John Alejandro, Orts-Escolano, Sergio, Garcia-Rodriguez, Jose, and Argyros, Antonis
- Subjects
- *
DEEP learning , *SUPERVISED learning , *COMPUTER vision , *FORECASTING , *VIDEOS , *PREDICTION models - Abstract
The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We first define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. Inferring End-to-End Latency in Live Videos.
- Author
-
Wang, Hengchao, Zhang, Xu, Chen, Hao, Xu, Yiling, and Ma, Zhan
- Subjects
- *
PEARSON correlation (Statistics) , *VIDEOS , *VIDEO compression - Abstract
This paper provides an end-to-end latency model of real-time generated live videos composed of sub-models of capturing, encoding, network, decoding, rendering, and display refreshing. We build the model by deeply analyzing each sub part’s latency with parameters like video’s Spatial (frame scale $s$), Temporal (frame rate $t$), Amplitude (quantization step-size $q$) Resolutions (STAR), Compressed Frame Bit Numbers ($n_{F}$), and other necessary parameters. The latency of 640 combinations of STAR for eight representative video sequences is measured to fit accurate models. We primarily focus on processing procedures, whose latency is becoming the principal affection, as network latency decreases due to the its development. We then introduce a baseline video sequence set to reduce the complexity when fitting the encoding and decoding sub-models by predicting the model parameters on new hardware from parameters of a known one, and improve models’ robustness crossing different scenarios. These sub-models correlate well with the realistic latency data, with the average Pearson Correlation Coefficient (PCC) of 0.99 and the average Spearman’s Rank Correlation Coefficient (SRCC) of 0.98, according to an independent validation test. Furthermore, we validate our end-to-end latency model in a simulated end-to-end live video transmission from generating to displaying, introducing the latency as a strict condition in adaptive bit-rate selection. The simulation shows that our model could significantly increase time during which the latency can satisfy the given tolerance at a low cost. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. High-Resolution Spatiotemporal Quantification of Intestinal Motility With Free-Form Deformation.
- Author
-
Kuruppu, Sachira, Cheng, Leo K., Nielsen, Poul M.F., Gamage, Thiranja P. Babarenda, Avci, Recep, Angeli, Timothy R., and Paskaranandavadivel, Niranchan
- Subjects
- *
DEFORMATION of surfaces , *INTESTINES , *STRAIN rate , *IMAGE registration , *SPATIAL resolution - Abstract
Objective: To develop a method to quantify strain fields from in vivo intestinal motility recordings that mitigate accumulation of tracking error. Methods: The deforming geometry of the intestine in video sequences was modeled by a biquadratic B-spline mesh. Green-Lagrange strain fields were computed to quantify the surface deformations. A nonlinear optimization scheme was applied to mitigate the accumulation of tracking error associated with image registration. Results: The optimization scheme maintained the RMS strain error under 1% and reduced the rate of strain error by 97% during synthetic tests. The algorithm was applied to map 64 segmental, 12 longitudinal, and 23 propagating circular contractions in the jejunum. Coordinated activity of the two muscle layers could be identified and the strain fields were able to map and quantify the anisotropic contractions of the intestine. Frequency and velocity were also quantified, from which two types of propagating circular contractions were identified: (i) $-\text{0.36}\pm \text{0.04}$ strain contractions that originated spontaneously and propagated at $\text{3} \pm \text{1}$ mm/s in two pigs, and (ii) cyclic propagating contractions of $-\text{0.17} \pm \text{0.02}$ strain occurred at $\text{11.0} \pm \text{0.6}$ cpm and propagated at $\text{16} \pm \text{4}$ mm/s in a rabbit. Conclusion: The algorithm simultaneously mapped the circular, longitudinal activity of the intestine with high spatial resolution and quantified anisotropic contractions and relaxations. Significance: The proposed algorithm can now be used to define the interactions of muscle layers during motility patterns. It can be integrated with high-resolution bioelectrical recordings to investigate the regulatory mechanisms of motility. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. A Commonality Modeling Framework for Enhanced Video Coding Leveraging on the Cuboidal Partitioning Based Representation of Frames.
- Author
-
Ahmmed, Ashek, Murshed, Manzur, Paul, Manoranjan, and Taubman, David
- Abstract
Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. Modern video coding systems are block-based wherein commonality modeling is carried out only from the perspective of the block that need be coded next. In this work, we argue for a commonality modeling approach that can provide a seamless blending between global and local homogeneity information. For this purpose, at first the frame that need be coded, is recursively partitioned into rectangular regions based on the homogeneity information of the entire frame. After that each obtained rectangular region’s feature descriptor is taken to be the average value of all the pixels’ intensities encompassing the region. In this way, the proposed approach generates a coarse representation of the current frame by minimizing both global and local commonality. This coarse frame is computationally simple and has a compact representation. It attempts to preserve important structural properties of the current frame which can be viewed subjectively as well as from improved rate-distortion performance of a reference scalable HEVC coder that employs the coarse frame as a reference frame for encoding the current frame. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. MRS-Net+ for Enhancing Face Quality of Compressed Videos.
- Author
-
Liu, Tie, Xu, Mai, Li, Shengxi, Ding, Rui, and Liu, Huaida
- Subjects
- *
VIDEO coding , *VIDEOCONFERENCING , *REVUES , *VIDEOS , *SIGNAL-to-noise ratio , *SOCIAL networks - Abstract
During the past few years, face videos, e.g., video conference, interviews and variety shows, have grown explosively with millions of users over social media networks. Unfortunately, the existing compression algorithms are applied to these videos for reducing bandwidth, which also bring annoying artifacts to face regions. This paper addresses the problem of face quality enhancement in compressed videos by reducing the artifacts of face regions. Specifically, we establish a compressed face video (CFV) database, which includes 196,337 faces in 214 high-quality video sequences and their corresponding 1,712 compressed sequences. We find that the faces of compressed videos exhibit tremendous scale variation and quality fluctuation. Motivated by scalable video coding, we propose a multi-scale recurrent scalable network (MRS-Net+) to enhance the quality of multi-scale faces in compressed videos. The MRS-Net+ is comprised by one base and two refined enhancement levels, corresponding to the quality enhancement of small-, medium- and large-scale faces, respectively. In the multi-level architecture of our MRS-Net+, small-/medium-scale face quality enhancement serves as the basis for facilitating the quality enhancement of medium-/large-scale faces. We further develop a landmark-assisted pyramid alignment (LPA) subnet to align faces across consecutive frames, and then apply the mask-guided quality enhancement (QE) subnet for enhancing multi-scale faces. Finally, experimental results show that our MRS-Net+ method achieves averagely 1.196 dB improvement of peak signal-to-noise ratio (PSNR) and 23.54% saving of Bjøntegaard distortion-rate (BD-rate), significantly outperforming other state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. Trajectory Saliency Detection Using Consistency-Oriented Latent Codes From a Recurrent Auto-Encoder.
- Author
-
Maczyta, Leo, Bouthemy, Patrick, and Le Meur, Olivier
- Subjects
- *
GENERATIVE adversarial networks , *RAILROAD stations - Abstract
In this paper, we are concerned with the detection of progressive dynamic saliency from video sequences. More precisely, we are interested in saliency related to motion and likely to appear progressively over time. It can be relevant to trigger alarms, to dedicate additional processing or to detect specific events. Trajectories represent the best way to support progressive dynamic saliency detection. Accordingly, we will talk about trajectory saliency. A trajectory will be qualified as salient if it deviates from normal trajectories that share a common motion pattern related to a given context. First, we need a compact while discriminative representation of trajectories. We adopt a (nearly) unsupervised learning-based approach. The latent code estimated by a recurrent auto-encoder provides the desired representation. In addition, we enforce consistency for normal (similar) trajectories through the auto-encoder loss function. The distance of the trajectory code to a prototype code accounting for normality is the means to detect salient trajectories. We validate our trajectory saliency detection method on synthetic and real trajectory datasets, and highlight the contributions of its different components. We compare our method favourably to existing methods on several saliency configurations constructed from the publicly available large dataset of pedestrian trajectories acquired in a railway station. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. A Dynamic Frame Selection Framework for Fast Video Recognition.
- Author
-
Wu, Zuxuan, Li, Hengduo, Xiong, Caiming, Jiang, Yu-Gang, and Davis, Larry S.
- Subjects
- *
REWARD (Psychology) , *FRAMES (Social sciences) , *DECISION making , *REINFORCEMENT learning - Abstract
We introduce AdaFrame, a conditional computation framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame, which contains a Long Short-Term Memory augmented with a global memory to provide context information, operates as an agent to interact with video sequences aiming to search over time which frames to use. Trained with policy search methods, at each time step, AdaFrame computes a prediction, decides where to observe next, and estimates a utility, i.e., expected future rewards, of viewing more frames in the future. Exploring predicted utilities at testing time, AdaFrame is able to achieve adaptive lookahead inference so as to minimize the overall computational cost without incurring a degradation in accuracy. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet. With a vanilla ResNet-101 model, AdaFrame achieves similar performance of using all frames while only requiring, on average, 8.21 and 8.65 frames on FCVID and ActivityNet, respectively. We also demonstrate AdaFrame is compatible with modern 2D and 3D networks for video recognition. Furthermore, we show, among other things, learned frame usage can reflect the difficulty of making prediction decisions both at instance-level within the same class and at class-level among different categories. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Sequence-Level Reference Frames in Video Coding.
- Author
-
Jubran, Mohammad, Abbas, Alhabib, and Andreopoulos, Yiannis
- Subjects
- *
VIDEO coding , *STRUCTURAL frames - Abstract
The proliferation of low-cost DRAM chipsets now begins to allow for the consideration of substantially-increased decoded picture buffers in advanced video coding standards such as HEVC, VVC, and Google VP9. At the same time, the increasing demand for rapid scene changes and multiple scene repetitions in entertainment or broadcast content indicates that extending the frame referencing interval to tens of minutes or even the entire video sequence may offer coding gains, as long as one is able to identify frame similarity in a computationally- and memory-efficient manner. Motivated by these observations, we propose a “stitching” method that defines a reference buffer and a reference frame selection algorithm. Our proposal extends the referencing interval of inter-frame video coding to the entire length of video sequences. Our reference frame selection algorithm uses well-established feature descriptor methods that describe frame structural elements in a compact and semantically-rich manner. We propose to combine such compact descriptors with a similarity scoring mechanism in order to select the frames to be “stitched” to reference picture buffers of advanced inter-frame encoders like HEVC, VVC, and VP9 without breaking standard compliance. Our evaluation on synthetic and real-world video sequences with the HEVC and VVC reference encoders shows that our method offers significant rate gains, with complexity and memory requirements that remain manageable for practical encoders and decoders. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Perceptual Quality Assessment of HEVC and VVC Standards for 8K Video.
- Author
-
Bonnineau, Charles, Hamidouche, Wassim, Fournier, Jerome, Sidaty, Naty, Travers, Jean-Francois, and Deforges, Olivier
- Subjects
- *
VIDEO compression , *VIDEO coding , *PERCEIVED quality , *VIDEOS - Abstract
With the growing data consumption of emerging video applications and users’ requirement for higher resolutions, up to 8K, a huge effort has been made in video compression technologies. Recently, versatile video coding (VVC) has been standardized by the moving picture expert group (MPEG), providing a significant improvement in compression performance over its predecessor high efficiency video coding (HEVC). In this paper, we provide a comparative subjective quality evaluation between VVC and HEVC standards for 8K resolution videos. In addition, we evaluate the perceived quality improvement offered by 8K over UHD 4K resolution. The compression performance of both VVC and HEVC standards has been conducted in random access (RA) coding configuration, using their respective reference software, VVC test model (VTM-11) and HEVC test model (HM-16.20). Objective measurements, using PSNR, MS-SSIM and VMAF metrics have shown that the bitrate gains offered by VVC over HEVC for 8K video content are around 31%, 26% and 35%, respectively. Subjectively, VVC offers an average of around 41% of bitrate reduction over HEVC for the same visual quality. A compression gain of 50% has been reached for some tested video sequences regarding a Student’s t-test analysis. In addition, for most tested scenes, a significant visual difference between uncompressed 4K and 8K has been noticed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Video-Based Content Extraction Algorithm from Bank Cards for iOS Mobile Devices
- Author
-
Bohush, Rykhard, Kurilovich, Alexander, Ablameyko, Sergey, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Ablameyko, Sergey V., editor, Krasnoproshin, Viktor V., editor, and Lukashevich, Maryna M., editor
- Published
- 2019
- Full Text
- View/download PDF
37. Image Data Extraction Using Image Similarities
- Author
-
Saravanan, D., Joseph, Dennis, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Ruediger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Liang, Qilian, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Panda, Ganapati, editor, Satapathy, Suresh Chandra, editor, Biswal, Birendra, editor, and Bansal, Ramesh, editor
- Published
- 2019
- Full Text
- View/download PDF
38. Human‐like evaluation method for object motion detection algorithms
- Author
-
Abimael Guzman‐Pando, Mario Ignacio Chacon‐Murguia, and Lucia B. Chacon‐Diaz
- Subjects
evaluation method ,object motion detection algorithms ,object detection ,video sequences ,human performance metric intervals ,ideal metric values ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Computer software ,QA76.75-76.765 - Abstract
This study proposes a new method to evaluate the performance of algorithms for moving object detection (MODA) in video sequences. The proposed method is based on human performance metric intervals, instead of ideal metric values (0 or 1) which are commonly used in the literature. These intervals are proposed to establish a more reliable evaluation and comparison, and to identify areas of improvement in the evaluation of MODA. The contribution of the study includes the determination of human segmentation performance metric intervals and their comparison with state‐of‐the‐art MODA, and the evaluation of their segmentation results in a tracking task to establish the impact between performance and practical utility. Results show that human participants had difficulty with achieving a perfect segmentation score. Deep learning algorithms achieved performance above the human average, while other techniques achieved a performance between 88 and 92%. Furthermore, the authors demonstrate that algorithms not ranked at the top of the quantitative metrics worked satisfactorily in a tracking experiment; and therefore, should not be discarded for real applications.
- Published
- 2020
- Full Text
- View/download PDF
39. Spatio-Temporal Online Matrix Factorization for Multi-Scale Moving Objects Detection.
- Author
-
Wang, Jingyu, Zhao, Yue, Zhang, Ke, Wang, Qi, and Li, Xuelong
- Subjects
- *
MATRIX decomposition , *OBJECT recognition (Computer vision) , *LOW-rank matrices , *COMPUTER vision , *MATHEMATICAL optimization - Abstract
Detecting moving objects from the video sequences has been treated as a challenging computer vision task, since the problems of dynamic background, multi-scale moving objects and various noise interference impact the corresponding feasibility and efficiency. In this paper, a novel spatio-temporal online matrix factorization (STOMF) method is proposed to detect multi-scale moving objects under dynamic background. To accommodate a wide range of the real noise distractions, we apply a specific mixture of exponential power (MoEP) distributions to the framework of low-rank matrix factorization (LRMF). For the optimization of solution algorithm, a temporal difference motion prior (TDMP) model is proposed, which estimates the motion matrix and calculates the weight matrix. Moreover, a partial spatial motion information (PSMI) post-processing method is further designed to implement multi-scale objects extraction in varieties of complex dynamic scenes, which utilizes partial background and motion information. The superiority of the STOMF method is validated by massive experiments on practical datasets, as compared with state-of-the-art moving objects detection approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
40. Multi-Connection Based Scalable Video Streaming in UDNs: A Multi-Agent Multi-Armed Bandit Approach.
- Author
-
Zhu, Kun, Li, Lujiu, Xu, Yuanyuan, Zhang, Tong, and Zhou, Lu
- Abstract
Scalable video coding (SVC) has received much attention for video transmission over wireless due to its flexibility. However, most previous work only considered SVC video streaming from a single base station (BS). At present, the densification of BSs enables a user equipment (UE) to connect to multiple BSs in ultra-dense networks (UDNs). In this paper, we consider the problem of SVC video streaming in a UDN, which allows different layers of a video block to be downloaded from different BSs. An optimization problem is formulated aiming to maximize the quality of experience (QoE) of users by selecting the optimal connection strategy and optimal number of video layers. Considering the complexity, to efficiently solve the problem in a distributed manner, the problem of choosing connection strategy is formulated as a multi-agent multi-armed bandit (MA-MAB) problem with only few information exchange. Each user can adapt its connection strategy in a distributed self-learning system. To obtain the optimal arm for the MA-MAB problem, we propose a multi-user arm decision algorithm. To avoid large computation and handover costs, we adopt the same connection strategy for the entire video sequence. Then for each video block, with the given connection strategy, the number of video layers is adjusted adaptively according to dynamic network conditions. Finally, based on the above designs, we provide the SVC-based video downloading scheme to obtain an approximate optimal solution to the original optimization problem. Extensive simulations and comparisons show the feasibility and superiority of the proposed scheme. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
41. Comprehensive Underwater Object Tracking Benchmark Dataset and Underwater Image Enhancement With GAN.
- Author
-
Panetta, Karen, Kezebou, Landry, Oludare, Victor, and Agaian, Sos
- Subjects
OBJECT tracking (Computer vision) ,IMAGE intensifiers ,GENERATIVE adversarial networks ,TRACKING algorithms ,ATTENUATION of light ,SUBMERSIBLES ,ARTIFICIAL satellite tracking - Abstract
Current state-of-the-art object tracking methods have largely benefited from the public availability of numerous benchmark datasets. However, the focus has been on open-air imagery and much less on underwater visual data. Inherent underwater distortions, such as color loss, poor contrast, and underexposure, caused by attenuation of light, refraction, and scattering, greatly affect the visual quality of underwater data, and as such, existing open-air trackers perform less efficiently on such data. To help bridge this gap, this article proposes a first comprehensive underwater object tracking (UOT100) benchmark dataset to facilitate the development of tracking algorithms well-suited for underwater environments. The proposed dataset consists of 104 underwater video sequences and more than 74 000 annotated frames derived from both natural and artificial underwater videos, with great varieties of distortions. We benchmark the performance of 20 state-of-the-art object tracking algorithms and further introduce a cascaded residual network for underwater image enhancement model to improve tracking accuracy and success rate of trackers. Our experimental results demonstrate the shortcomings of existing tracking algorithms on underwater data and how our generative adversarial network (GAN)-based enhancement model can be used to improve tracking performance. We also evaluate the visual quality of our model's output against existing GAN-based methods using well-accepted quality metrics and demonstrate that our model yields better visual data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. A Pose-Based Feature Fusion and Classification Framework for the Early Prediction of Cerebral Palsy in Infants.
- Author
-
McCay, Kevin D., Hu, Pengpeng, Shum, Hubert P. H., Woo, Wai Lok, Marcroft, Claire, Embleton, Nicholas D., Munteanu, Adrian, and Ho, Edmond S. L.
- Subjects
CEREBRAL palsy ,CHILDREN with cerebral palsy ,INFANT development ,CLASSIFICATION ,INFANTS ,RETINAL blood vessels ,MOTION analysis ,EARLY diagnosis - Abstract
The early diagnosis of cerebral palsy is an area which has recently seen significant multi-disciplinary research. Diagnostic tools such as the General Movements Assessment (GMA), have produced some very promising results. However, the prospect of automating these processes may improve accessibility of the assessment and also enhance the understanding of movement development of infants. Previous works have established the viability of using pose-based features extracted from RGB video sequences to undertake classification of infant body movements based upon the GMA. In this paper, we propose a series of new and improved features, and a feature fusion pipeline for this classification task. We also introduce the RVI-38 dataset, a series of videos captured as part of routine clinical care. By utilising this challenging dataset we establish the robustness of several motion features for classification, subsequently informing the design of our proposed feature fusion framework based upon the GMA. We evaluate our proposed framework’s classification performance using both the RVI-38 dataset and the publicly available MINI-RGBD dataset. We also implement several other methods from the literature for direct comparison using these two independent datasets. Our experimental results and feature analysis show that our proposed pose-based method performs well across both datasets. The proposed features afford us the opportunity to include finer detail than previous methods, and further model GMA specific body movements. These new features also allow us to take advantage of additional body-part specific information as a means of improving the overall classification performance, whilst retaining GMA relevant, interpretable, and shareable features. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. On-the-Fly Facial Expression Prediction Using LSTM Encoded Appearance-Suppressed Dynamics.
- Author
-
Baddar, Wissam J., Lee, Sangmin, and Ro, Yong Man
- Abstract
Encoding the facial expression dynamics is efficient in classifying and recognizing facial expressions. Most facial dynamics-based methods assume that a sequence is temporally segmented before prediction. This requires the prediction to wait until a full sequence is available, resulting in prediction delay. To reduce the prediction delay and enable prediction “on-the-fly” (as frames are fed to the system), we propose new dynamics feature learning method that allows prediction with partial (incomplete) sequences. The proposed method utilizes the readiness of recurrent neural networks (RNNs) for on-the-fly prediction, and introduces novel learning constraints to induce early prediction with partial sequences. We further show that a delay in accurate prediction using RNNs could originate from the effect that the subject appearance has on the spatio-temporal features encoded by the RNN. We refer to that effect as “appearance bias”. We propose the appearance suppressed dynamics feature, which utilizes a static sequence to suppress the appearance bias. Experimental results have shown that the proposed method achieved higher recognition rates compared to the state-of-the-art methods on publicly available datasets. The results also verified that the proposed method improved on-the-fly prediction at subtle expression frames early in the sequence, using partial sequence inputs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. RecapNet: Action Proposal Generation Mimicking Human Cognitive Process.
- Author
-
Wang, Tian, Chen, Yang, Lin, Zhiwei, Zhu, Aichun, Li, Yong, Snoussi, Hichem, and Wang, Hui
- Abstract
Generating action proposals in untrimmed videos is a challenging task, since video sequences usually contain lots of irrelevant contents and the duration of an action instance is arbitrary. The quality of action proposals is key to action detection performance. The previous methods mainly rely on sliding windows or anchor boxes to cover all ground-truth actions, but this is infeasible and computationally inefficient. To this end, this article proposes a RecapNet—a novel framework for generating action proposal, by mimicking the human cognitive process of understanding video content. Specifically, this RecapNet includes a residual causal convolution module to build a short memory of the past events, based on which the joint probability actionness density ranking mechanism is designed to retrieve the action proposals. The RecapNet can handle videos with arbitrary length and more important, a video sequence will need to be processed only in one single pass in order to generate all action proposals. The experiments show that the proposed RecapNet outperforms the state of the art under all metrics on the benchmark THUMOS14 and ActivityNet-1.3 datasets. The code is available publicly at https://github.com/tianwangbuaa/RecapNet. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
45. Robust Video Authentication Mechanism Based On Statistical Local Feature.
- Author
-
Wadhe, Avinash and Ambhaikar, Asha
- Subjects
DIGITAL watermarking ,DIGITAL signatures ,VIDEO monitors ,DIGITAL video ,BROADCAST channels ,VIDEO editing ,VIDEOS ,VIDEO surveillance - Abstract
Advances in technology have made it easier to edit videos today, making it harder to maintain the reliability of video information. Videos can be created using many digital devices and such videos can be broadcast on any channel through high-speed internet. Video captured from CCTV can be used as evidence but must first be protected from tampering. Video authentication can be done using watermarking and digital signatures but this technique is used for copyright. It is difficult to monitor videos and authenticate videos captured by the user's cameras. In this study, we will develop a strong technique based on video local information for video authentication that can be used in real applications. The proposed technique processes the dataset of videos captured by the user's cameras using the statistical local information of the video. The results of the proposed technique show that the proposed technique for video authentication is reliable, faster and less expensive than other techniques. [ABSTRACT FROM AUTHOR]
- Published
- 2021
46. A comparison on visual prediction models for MAMO (multi activity-multi object) recognition using deep learning
- Author
-
Budi Padmaja, Madhu Bala Myneni, and Epili Krishna Rao Patro
- Subjects
Multi-activity ,Human activity recognition ,Computer vision ,YOLO ,Video sequences ,Computer engineering. Computer hardware ,TK7885-7895 ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Multi activity-multi object recognition (MAMO) is a challenging task in visual systems for monitoring, recognizing and alerting in various public places, such as universities, hospitals and airports. While both academic and commercial researchers are aiming towards automatic tracking of human activities in intelligent video surveillance using deep learning frameworks. This is required for many real time applications to detect unusual/suspicious activities like tracking of suspicious behaviour in crime events etc. The primary purpose of this paper is to render a multi class activity prediction in individuals as well as groups from video sequences by using the state-of-the-art object detector You Look only Once (YOLOv3). By optimum utilization of the geographical information of cameras and YOLO object detection framework, a Deep Landmark model recognize a simple to complex human actions on gray scale to RGB image frames of video sequences. This model is tested and compared with various benchmark datasets and found to be the most precise model for detecting human activities in video streams. Upon analysing the experimental results, it has been observed that the proposed method shows superior performance as well as high accuracy.
- Published
- 2020
- Full Text
- View/download PDF
47. Rate Control for Predictive Transform Screen Content Video Coding Based on RANSAC.
- Subjects
- *
VIDEO coding , *BIT rate , *STATISTICAL sampling , *POINT set theory - Abstract
In predictive transform video coding, optimal bit allocation and quantization parameter (QP) estimation are important to control the bit rate of blocks, frames and the whole sequence. Common solutions to this problem rely on trained models to approximate the rate-distortion (R-D) characteristics of the video content during coding. Moreover, these solutions are mainly targeted for natural content sequences, whose characteristics differ greatly from those of screen content (SC) sequences. In this article, we depart from such trained R-D models and propose a low-complexity RC method for SC sequences that leverages the availability of information about the R-D characteristics of previously coded blocks within a frame. Namely, our method first allocates bits at the frame- and block-levels based on their motion and texture characteristics. It then approximates the R-D and R-QP curves of each block by a set control points and random sample consensus (RANSAC). Finally, it computes the appropriate block-level QP values to attain a target bit rate with the minimum distortion possible. The proposed RC method is embedded into a standard High-Efficiency Video Coding (H.265/HEVC) encoder and evaluated on several SC sequences. Our results show that our method not only attains better R-D performance than that of H.265/HEVC and other methods designed for SC sequences but also attains a more constant and higher reconstruction quality on all frames. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
48. Learning Expression Features via Deep Residual Attention Networks for Facial Expression Recognition From Video Sequences.
- Author
-
Zhao, Xiaoming, Chen, Gang, Chuang, Yuelong, Tao, Xin, and Zhang, Shiqing
- Subjects
- *
FACIAL expression , *COMPUTER vision , *ARTIFICIAL intelligence , *MACHINE learning , *VIDEOS - Abstract
Facial expression recognition from video sequences is currently an interesting research topic in computer vision, pattern recognition, artificial intelligence, etc. Considering the problem of semantic gap between the extracted hand-designed features in affective videos and subjective emotions, recognizing facial expressions from video sequences is a challenging subject. To tackle this problem, this paper proposes a new method of facial expression recognition from video sequences via deep residual attention network. Firstly, due to the difference in the intensity of emotional representation of each local area in a facial image, deep residual attention networks are employed to learn high-level affective expression features for each frame of facial expression images in video sequences. The used deep residual attention networks integrate deep residual networks with a spatial attention mechanism. Then, average-pooling is performed to produce fixed-length global video-level feature representations. Finally, the global video-level feature representations are utilized as inputs of a multi-layer perceptron to conduct facial expression classification tasks in video sequences. Experimental results on two public video emotional datasets, i.e. BAUM-1s and RML, demonstrate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
49. Supervised-learning-Based QoE Prediction of Video Streaming in Future Networks: A Tutorial with Comparative Study.
- Author
-
Ahmad, Arslan, Mansoor, Atif Bin, Barakabitze, Alcardo Alex, Hines, Andrew, Atzori, Luigi, and Walshe, Ray
- Subjects
- *
SUPERVISED learning , *SOFTWARE-defined networking , *NEXT generation networks , *COMPARATIVE studies , *PREDICTION models , *FORECASTING - Abstract
Quality of experience (QoE)-based service management remains key for successful provisioning of multimedia services in next-generation networks such as 5G/6G, which requires proper tools for quality monitoring, prediction, and resource management where machine learning (ML) can play a crucial role. In this article, we provide a tutorial on the development and deployment of the QoE measurement and prediction solutions for video streaming services based on supervised learning ML models. First, we provide a detailed pipeline for developing and deploying super-vised-learning-based video streaming QoE prediction models that covers several stages including data collection, feature engineering, model optimization and training, testing and prediction, and evaluation. Second, we discuss the deployment of the ML model for QoE prediction/measurement in 5G/6G networks using network-enabling technologies such as software-defined networking, network function virtualization, and multi-access edge computing by proposing reference architecture. Third, we present a comparative study of the state-of-the-art supervised learning ML models for QoE prediction of video streaming applications based on multiple performance metrics. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
50. Video Summarization Using Deep Neural Networks: A Survey.
- Author
-
Apostolidis, Evlampios, Adamantidou, Eleni, Metsai, Alexandros I., Mezaris, Vasileios, and Patras, Ioannis
- Subjects
RECURRENT neural networks ,VIDEO summarization ,VIDEOS - Abstract
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.