Author: "He, Zhenyu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"He, Zhenyu"' showing total 1,012 results

Start Over Author "He, Zhenyu"

1,012 results on '"He, Zhenyu"'

1. LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Author: Ding, Henghui, Hong, Lingyi, Liu, Chang, Xu, Ning, Yang, Linjie, Fan, Yuchen, Miao, Deshui, Gu, Yameng, Li, Xin, He, Zhenyu, Wang, Yaowei, Yang, Ming-Hsuan, Chai, Jinming, Ma, Qin, Zhang, Junpei, Jiao, Licheng, Liu, Fang, Liu, Xinyu, Zhang, Jing, Zhang, Kexin, Liu, Xu, Li, LingLing, Fang, Hao, Pan, Feiyu, Lu, Xiankai, Zhang, Wei, Cong, Runmin, Tran, Tuyen, Cao, Bin, Zhang, Yisi, Wang, Hanyi, He, Xingjian, and Liu, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/., Comment: ECCV 2024 LSVOS Challenge Report: https://lsvos.github.io/
Published: 2024

2. Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Author: Miao, Deshui, Gu, Yameng, Li, Xin, He, Zhenyu, Wang, Yaowei, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a discriminative spatial-temporal VOS model that utilizes discriminative object features as query representations. The semantic understanding of spatial-semantic modules enables it to recognize object parts, while salient features highlight more distinctive object characteristics. Our model, trained on extensive VOS datasets, achieved first place (\textbf{80.90\%} $\mathcal{J \& F}$) on the test set of the 6th LSVOS challenge in the VOS Track, demonstrating its effectiveness in tackling the aforementioned challenges. The code will be available at \href{https://github.com/yahooo-m/VOS-Solution}{code}., Comment: 1st Place Solution for 6th LSVOS VOS Track. arXiv admin note: substantial text overlap with arXiv:2406.04600
Published: 2024

3. Data Generation Scheme for Thermal Modality with Edge-Guided Adversarial Conditional Diffusion Model

Author: Zhu, Guoqing, Pan, Honghu, Wang, Qiang, Tian, Chao, Yang, Chao, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In challenging low light and adverse weather conditions,thermal vision algorithms,especially object detection,have exhibited remarkable potential,contrasting with the frequent struggles encountered by visible vision algorithms. Nevertheless,the efficacy of thermal vision algorithms driven by deep learning models remains constrained by the paucity of available training data samples. To this end,this paper introduces a novel approach termed the edge guided conditional diffusion model. This framework aims to produce meticulously aligned pseudo thermal images at the pixel level,leveraging edge information extracted from visible images. By utilizing edges as contextual cues from the visible domain,the diffusion model achieves meticulous control over the delineation of objects within the generated images. To alleviate the impacts of those visible-specific edge information that should not appear in the thermal domain,a two-stage modality adversarial training strategy is proposed to filter them out from the generated images by differentiating the visible and thermal modality. Extensive experiments on LLVIP demonstrate ECDM s superiority over existing state-of-the-art approaches in terms of image generation quality., Comment: accepted by ACM MM 2024/ACM MM24
Published: 2024

4. Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors

Author: Pei, Qizhi, Wu, Lijun, He, Zhenyu, Zhu, Jinhua, Xia, Yingce, Xie, Shufang, and Yan, Rui
Subjects: Quantitative Biology - Biomolecules, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. In this work, inspired by the recent success of retrieval methods, we propose $k$NN-DTA, a non-parametric embedding-based retrieval method adopted on a pre-trained DTA prediction model, which can extend the power of the DTA model with no or negligible cost. Different from existing methods, we introduce two neighbor aggregation ways from both embedding space and label space that are integrated into a unified framework. Specifically, we propose a \emph{label aggregation} with \emph{pair-wise retrieval} and a \emph{representation aggregation} with \emph{point-wise retrieval} of the nearest neighbors. This method executes in the inference phase and can efficiently boost the DTA prediction performance with no training cost. In addition, we propose an extension, Ada-$k$NN-DTA, an instance-wise and adaptive aggregation with lightweight learning. Results on four benchmark datasets show that $k$NN-DTA brings significant improvements, outperforming previous state-of-the-art (SOTA) results, e.g, on BindingDB IC$_{50}$ and $K_i$ testbeds, $k$NN-DTA obtains new records of RMSE $\bf{0.684}$ and $\bf{0.750}$. The extended Ada-$k$NN-DTA further improves the performance to be $\bf{0.675}$ and $\bf{0.735}$ RMSE. These results strongly prove the effectiveness of our method. Results in other settings and comprehensive studies/analyses also show the great potential of our $k$NN-DTA approach., Comment: Accepted by 33rd ACM International Conference on Information and Knowledge Management 2024 (CIKM 2024)
Published: 2024
Full Text: View/download PDF

5. GRAPE: Generalizable and Robust Multi-view Facial Capture

Author: Li, Jing, Kang, Di, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep learning-based multi-view facial capture methods have shown impressive accuracy while being several orders of magnitude faster than a traditional mesh registration pipeline. However, the existing systems (e.g. TEMPEH) are strictly restricted to inference on the data captured by the same camera array used to capture their training data. In this study, we aim to improve the generalization ability so that a trained model can be readily used for inference (i.e. capture new data) on a different camera array. To this end, we propose a more generalizable initialization module to extract the camera array-agnostic 3D feature, including a visual hull-based head localization and a visibility-aware 3D feature aggregation module enabled by the visual hull. In addition, we propose an ``update-by-disagreement'' learning strategy to better handle data noise (e.g. inaccurate registration, scan noise) by discarding potentially inaccurate supervision signals during training. The resultant generalizable and robust topologically consistent multi-view facial capture system (GRAPE) can be readily used to capture data on a different camera array, reducing great effort on data collection and processing. Experiments on the FaMoS and FaceScape datasets demonstrate the effectiveness of the proposed method.
Published: 2024

6. Learning Spatial-Semantic Features for Robust Video Object Segmentation

Author: Li, Xin, Miao, Deshui, He, Zhenyu, Wang, Yaowei, Lu, Huchuan, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available., Comment: Winner solution of the VOTS2024 Challenge
Published: 2024

7. Let the Code LLM Edit Itself When You Edit the Code

Author: He, Zhenyu, Zhang, Jun, Luo, Shengjie, Xu, Jingjing, Zhang, Zhi, and He, Di
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Software Engineering
Abstract: In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline{\textbf{Positional \textbf{I}ntegrity \textbf{E}ncoding} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance., Comment: Preprint. Work in Progress
Published: 2024

8. PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

Author: Ding, Henghui, Liu, Chang, Wei, Yunchao, Ravi, Nikhila, He, Shuting, Bai, Song, Torr, Philip, Miao, Deshui, Li, Xin, He, Zhenyu, Wang, Yaowei, Yang, Ming-Hsuan, Xu, Zhensong, Yao, Jiangtao, Wu, Chengjing, Liu, Ting, Liu, Luoqi, Liu, Xinyu, Zhang, Jing, Zhang, Kexin, Yang, Yuting, Jiao, Licheng, Yang, Shuyuan, Gao, Mingqi, Luo, Jingnan, Yang, Jinyu, Han, Jungong, Zheng, Feng, Cao, Bin, Zhang, Yisi, Lin, Xuanxu, He, Xingjian, Zhao, Bo, Liu, Jing, Pan, Feiyu, Fang, Hao, and Lu, Xiankai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios. The MOSE challenge had 140 registered teams in total, 65 teams participated the validation phase and 12 teams made valid submissions in the final challenge phase. The MeViS challenge had 225 registered teams in total, 50 teams participated the validation phase and 5 teams made valid submissions in the final challenge phase., Comment: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024
Published: 2024

9. 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation

Author: Miao, Deshui, Li, Xin, He, Zhenyu, Wang, Yaowei, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Tracking and segmenting multiple objects in complex scenes has always been a challenge in the field of video object segmentation, especially in scenarios where objects are occluded and split into parts. In such cases, the definition of objects becomes very ambiguous. The motivation behind the MOSE dataset is how to clearly recognize and distinguish objects in complex scenes. In this challenge, we propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. The semantic understanding helps the model to recognize parts of the objects and the salient feature captures the more discriminative features of the objects. Trained on a large-scale video object segmentation dataset, our model achieves first place (\textbf{84.45\%}) in the test set of PVUW Challenge 2024: Complex Video Object Segmentation Track.
Published: 2024

10. Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models

Author: Zhou, Zikun, Xiong, Wentao, Zhou, Li, Li, Xin, He, Zhenyu, and Wang, Yaowei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pre-trained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language~(VL) relation modeling from scratch. Witnessing the success of Vision-Language Pre-trained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pre-training task (image/region-level prediction) and the RVOS task (pixel-level prediction in videos). In this work, we introduce a framework named VLP-RVOS to address this transfer challenge. We first propose a temporal-aware prompt-tuning method, which not only adapts pre-trained representations for pixel-level prediction but also empowers the vision encoder to model temporal clues. We further propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Besides, we customize a cube-frame attention mechanism for spatial-temporal reasoning. Extensive experiments demonstrate that our method outperforms state-of-the-art algorithms and exhibits strong generalization abilities.
Published: 2024

11. Spatial-Temporal Multi-level Association for Video Object Segmentation

Author: Miao, Deshui, Li, Xin, He, Zhenyu, Lu, Huchuan, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available.
Published: 2024

12. RTracker: Recoverable Tracking via PN Tree Structured Memory

Author: Huang, Yuqing, Li, Xin, Zhou, Zikun, Wang, Yaowei, He, Zhenyu, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing tracking methods mainly focus on learning better target representation or developing more robust prediction models to improve tracking performance. While tracking performance has significantly improved, the target loss issue occurs frequently due to tracking failures, complete occlusion, or out-of-view situations. However, considerably less attention is paid to the self-recovery issue of tracking methods, which is crucial for practical applications. To this end, we propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery ability. Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples. Upon the PN tree memory, we develop corresponding walking rules for determining the state of the target and define a set of control flows to unite the tracker and the detector in different tracking scenarios. Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss. The favorable performance in comparison against the state-of-the-art methods on numerous challenging benchmarks demonstrates the effectiveness of the proposed algorithm., Comment: accepted by CVPR 2024
Published: 2024

13. Do Efficient Transformers Really Save Computation?

Author: Yang, Kai, Ackermann, Jan, He, Zhenyu, Feng, Guhao, Zhang, Bohang, Feng, Yunzhen, Ye, Qiwei, He, Di, and Wang, Liwei
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
Published: 2024

14. Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation

Author: He, Zhenyu, Feng, Guhao, Luo, Shengjie, Yang, Kai, Wang, Liwei, Xu, Jingjing, Zhang, Zhi, Yang, Hongxia, and He, Di
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities., Comment: 17 pages, 7 figures, 8 tables; ICML 2024 Camera Ready version; Code: https://github.com/zhenyuhe00/BiPE
Published: 2024

15. REST: Retrieval-Based Speculative Decoding

Author: He, Zhenyu, Zhong, Zexuan, Cai, Tianle, Lee, Jason D., and He, Di
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST., Comment: NAACL 2024, camera ready
Published: 2023

16. Self-supervised discriminative model prediction for visual tracking

Author: Yuan, Di, Geng, Gu, Shu, Xiu, Liu, Qiao, Chang, Xiaojun, He, Zhenyu, and Shi, Guangming
Published: 2024
Full Text: View/download PDF

17. Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models

Author: Gao, Kaiyuan, He, Sunan, He, Zhenyu, Lin, Jiacheng, Pei, QiZhi, Shao, Jie, and Zhang, Wei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Generative pre-trained transformer (GPT) models have revolutionized the field of natural language processing (NLP) with remarkable performance in various tasks and also extend their power to multimodal domains. Despite their success, large GPT models like GPT-4 face inherent limitations such as considerable size, high computational requirements, complex deployment processes, and closed development loops. These constraints restrict their widespread adoption and raise concerns regarding their responsible development and usage. The need for user-friendly, relatively small, and open-sourced alternative GPT models arises from the desire to overcome these limitations while retaining high performance. In this survey paper, we provide an examination of alternative open-sourced models of large GPTs, focusing on user-friendly and relatively small models that facilitate easier deployment and accessibility. Through this extensive survey, we aim to equip researchers, practitioners, and enthusiasts with a thorough understanding of user-friendly and relatively small open-sourced models of large GPTs, their current state, challenges, and future research directions, inspiring the development of more efficient, accessible, and versatile GPT models that cater to the broader scientific community and advance the field of general artificial intelligence. The source contents are continuously updating in https://github.com/GPT-Alternatives/gpt_alternatives.
Published: 2023

18. Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

Author: Zhou, Zikun, Wu, Shukun, Zhu, Guoqing, Wang, Hongpeng, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: RGB-Thermal (RGB-T) semantic segmentation has shown great potential in handling low-light conditions where RGB-based segmentation is hindered by poor RGB imaging quality. The key to RGB-T semantic segmentation is to effectively leverage the complementarity nature of RGB and thermal images. Most existing algorithms fuse RGB and thermal information in feature space via concatenation, element-wise summation, or attention operations in either unidirectional enhancement or bidirectional aggregation manners. However, they usually overlook the modality gap between RGB and thermal images during feature fusion, resulting in modality-specific information from one modality contaminating the other. In this paper, we propose a Channel and Spatial Relation-Propagation Network (CSRPNet) for RGB-T semantic segmentation, which propagates only modality-shared information across different modalities and alleviates the modality-specific information contamination issue. Our CSRPNet first performs relation-propagation in channel and spatial dimensions to capture the modality-shared features from the RGB and thermal features. CSRPNet then aggregates the modality-shared features captured from one modality with the input feature from the other modality to enhance the input feature without the contamination issue. While being fused together, the enhanced RGB and thermal features will be also fed into the subsequent RGB or thermal feature extraction layers for interactive feature fusion, respectively. We also introduce a dual-path cascaded feature refinement module that aggregates multi-layer features to produce two refined features for semantic and boundary prediction. Extensive experimental results demonstrate that CSRPNet performs favorably against state-of-the-art algorithms.
Published: 2023

19. Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection

Author: Tian, Chao, Zhou, Zikun, Huang, Yuqing, Li, Gaojun, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: RGB-Thermal (RGB-T) pedestrian detection aims to locate the pedestrians in RGB-T image pairs to exploit the complementation between the two modalities for improving detection robustness in extreme conditions. Most existing algorithms assume that the RGB-T image pairs are well registered, while in the real world they are not aligned ideally due to parallax or different field-of-view of the cameras. The pedestrians in misaligned image pairs may locate at different positions in two images, which results in two challenges: 1) how to achieve inter-modality complementation using spatially misaligned RGB-T pedestrian patches, and 2) how to recognize the unpaired pedestrians at the boundary. To deal with these issues, we propose a new paradigm for unregistered RGB-T pedestrian detection, which predicts two separate pedestrian locations in the RGB and thermal images, respectively. Specifically, we propose a cross-modality proposal-guided feature mining (CPFM) mechanism to extract the two precise fusion features for representing the pedestrian in the two modalities, even if the RGB-T image pair is unaligned. It enables us to effectively exploit the complementation between the two modalities. With the CPFM mechanism, we build a two-stream dense detector; it predicts the two pedestrian locations in the two modalities based on the corresponding fusion feature mined by the CPFM mechanism. Besides, we design a data augmentation method, named Homography, to simulate the discrepancy in scales and views between images. We also investigate two non-maximum suppression (NMS) methods for post-processing. Favorable experimental results demonstrate the effectiveness and robustness of our method in dealing with unregistered pedestrians with different shifts.
Published: 2023

20. CiteTracker: Correlating Image and Text for Visual Tracking

Author: Li, Xin, Huang, Yuqing, He, Zhenyu, Wang, Yaowei, Lu, Huchuan, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking. However, a single image patch cannot provide a complete and precise concept of the target object as images are limited in their ability to abstract and can be ambiguous, which makes it difficult to track targets with drastic variations. In this paper, we propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target. In addition, a dynamic description module is designed to adapt to target variations for more effective target representation. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference. Extensive experiments on five diverse datasets are conducted to evaluate the proposed algorithm and the favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed tracking method., Comment: accepted by ICCV 2023
Published: 2023

21. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning

Author: Fei, Junjie, Wang, Teng, Zhang, Jinrui, He, Zhenyu, Wang, Chengjie, and Zheng, Feng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap, Comment: Accepted by ICCV 2023
Published: 2023

22. ZeroPose: CAD-Model-based Zero-Shot Pose Estimation

Author: Chen, Jianqiu, Sun, Mingshan, Bao, Tianpeng, Zhao, Rui, Wu, Liwei, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present a CAD model-based zero-shot pose estimation pipeline called ZeroPose. Existing pose estimation methods remain to require expensive training when applied to an unseen object, which greatly hinders their scalability in the practical application of industry. In contrast, the proposed method enables the accurate estimation of pose parameters for previously unseen objects without the need for training. Specifically, we design a two-step pipeline consisting of CAD model-based zero-shot instance segmentation and a zero-shot pose estimator. For the first step, there is a simple but effective way to leverage CAD models and visual foundation models SAM and Imagebind to segment the interest unseen object at the instance level. For the second step, we based on the intensive geometric information in the CAD model of the rigid object to propose a lightweight hierarchical geometric structure matching mechanism achieving zero-shot pose estimation. Extensive experimental results on the seven core datasets on the BOP challenge show that the proposed zero-shot instance segmentation methods achieve comparable performance with supervised MaskRCNN and the zero-shot pose estimation results outperform the SOTA pose estimators with better efficiency.
Published: 2023

23. Development of highly adaptable RT-PCR methods for identifying Delta and BA.1 variants in inactivated COVID-19 vaccines

Author: Wang, Zhanhui, He, Yao, He, Zhenyu, Guo, Yancen, Zhao, Yuxiu, and Zhang, Yuntao
Published: 2024
Full Text: View/download PDF

24. Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation

Author: Zhou, Zikun, Mao, Kaige, Pei, Wenjie, Wang, Hongpeng, Wang, Yaowei, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only trained by the sparse scribble annotations but also initialized with the sparse target scribbles for inference. Thus, the annotation burdens for both training and initialization can be substantially lightened. The difficulties of scribble-supervised VOS lie in two aspects. On the one hand, it requires the powerful ability to learn from the sparse scribble annotations during training. On the other hand, it demands strong reasoning capability during inference given only a sparse initial target scribble. In this work, we propose a Reliability-Hierarchical Memory Network (RHMNet) to predict the target mask in a step-wise expanding strategy w.r.t. the memory reliability level. To be specific, RHMNet first only uses the memory in the high-reliability level to locate the region with high reliability belonging to the target, which is highly similar to the initial target scribble. Then it expands the located high-reliability region to the entire target conditioned on the region itself and the memories in all reliability levels. Besides, we propose a scribble-supervised learning mechanism to facilitate the learning of our model to predict dense results. It mines the pixel-level relation within the single frame and the frame-level relation within the sequence to take full advantage of the scribble annotations in sequence training samples. The favorable performance on two popular benchmarks demonstrates that our method is promising., Comment: This project is available at https://github.com/mkg1204/RHMNet-for-SSVOS
Published: 2023

25. Joint Visual Grounding and Tracking with Natural Language Specification

Author: Zhou, Li, Zhou, Zikun, Mao, Kaige, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT., Comment: Accepted by CVPR 2023
Published: 2023

26. Audio2Gestures: Generating Diverse Gestures from Audio

Author: Li, Jing, Kang, Di, Pei, Wenjie, Zhe, Xuefei, Zhang, Ying, Bao, Linchao, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (\textit{i.e.} RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (\textit{e.g.} STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (\textit{e.g.} PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline., Comment: arXiv admin note: substantial text overlap with arXiv:2108.06720
Published: 2023

27. Geo6D: Geometric Constraints Learning for 6D Pose Estimation

Author: Chen, Jianqiu, Sun, Mingshan, Zheng, Ye, Bao, Tianpeng, He, Zhenyu, Li, Donghai, Jin, Guoqiang, Zhao, Rui, Wu, Liwei, and Jiang, Xiaoke
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Numerous 6D pose estimation methods have been proposed that employ end-to-end regression to directly estimate the target pose parameters. Since the visible features of objects are implicitly influenced by their poses, the network allows inferring the pose by analyzing the differences in features in the visible region. However, due to the unpredictable and unrestricted range of pose variations, the implicitly learned visible feature-pose constraints are insufficiently covered by the training samples, making the network vulnerable to unseen object poses. To tackle these challenges, we proposed a novel geometric constraints learning approach called Geo6D for direct regression 6D pose estimation methods. It introduces a pose transformation formula expressed in relative offset representation, which is leveraged as geometric constraints to reconstruct the input and output targets of the network. These reconstructed data enable the network to estimate the pose based on explicit geometric constraints and relative offset representation mitigates the issue of the pose distribution gap. Extensive experimental results show that when equipped with Geo6D, the direct 6D methods achieve state-of-the-art performance on multiple datasets and demonstrate significant effectiveness, even with only 10% amount of data.
Published: 2022

28. How Image Generation Helps Visible-to-Infrared Person Re-Identification?

Author: Pan, Honghu, Chen, Yongyong, He, Yunqi, Li, Xin, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Compared to visible-to-visible (V2V) person re-identification (ReID), the visible-to-infrared (V2I) person ReID task is more challenging due to the lack of sufficient training samples and the large cross-modality discrepancy. To this end, we propose Flow2Flow, a unified framework that could jointly achieve training sample expansion and cross-modality image generation for V2I person ReID. Specifically, Flow2Flow learns bijective transformations from both the visible image domain and the infrared domain to a shared isotropic Gaussian domain with an invertible visible flow-based generator and an infrared one, respectively. With Flow2Flow, we are able to generate pseudo training samples by the transformation from latent Gaussian noises to visible or infrared images, and generate cross-modality images by transformations from existing-modality images to latent Gaussian noises to missing-modality images. For the purpose of identity alignment and modality alignment of generated images, we develop adversarial training strategies to train Flow2Flow. Specifically, we design an image encoder and a modality discriminator for each modality. The image encoder encourages the generated images to be similar to real images of the same identity via identity adversarial training, and the modality discriminator makes the generated images modal-indistinguishable from real images via modality adversarial training. Experimental results on SYSU-MM01 and RegDB demonstrate that both training sample expansion and cross-modality image generation can significantly improve V2I ReID accuracy., Comment: Submitted to IEEE Transactions on Image Processing
Published: 2022

29. An Interpretable Model With Forgetting Matrix For Deep Knowledge Tracing

Author: Guan, Quanlong, Bian, Kaiquan, Fang, Liangda, Li, Sheng, He, Zhenyu, Zheng, Hua, Wu, Lusheng, and Luo, Weiqi
Subjects: Cognitive Neuroscience, Computer Science, Education, Machine learning, cognitive neuropsychology, Computational neuroscience, Neural Networks
Abstract: Given that the current knowledge tracking model still has problems such as insufficient interpretability of the knowledge forgetting process and ignoring the relationship between different topics with the same knowledge concept. we propose an interpretable knowledge tracing model (KVFKT) based on key-value and forgetting. This approach keeps trace of the students' knowledge state matrices in each memory unit and also keeps trace of the forgetting matrices to monitor when the students' knowledge concepts were last reviewed. In addition, this model introduces the IRT two-parameter model from the perspective of educational psychology to further enhance the interpretability of the model. Finally, four real-world datasets are used to test the model, experimental results show that KVFKT can better trace students’ knowledge state and outperforms existing models in terms of ACC and AUC.
Published: 2023

30. Multi-Granularity Graph Pooling for Video-based Person Re-Identification

Author: Pan, Honghu, Chen, Yongyong, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The video-based person re-identification (ReID) aims to identify the given pedestrian video sequence across multiple non-overlapping cameras. To aggregate the temporal and spatial features of the video samples, the graph neural networks (GNNs) are introduced. However, existing graph-based models, like STGCN, perform the \textit{mean}/\textit{max pooling} on node features to obtain the graph representation, which neglect the graph topology and node importance. In this paper, we propose the graph pooling network (GPNet) to learn the multi-granularity graph representation for the video retrieval, where the \textit{graph pooling layer} is implemented to downsample the graph. We first construct a multi-granular graph, whose node features denote image embedding learned by backbone, and edges are established between the temporal and Euclidean neighborhood nodes. We then implement multiple graph convolutional layers to perform the neighborhood aggregation on the graphs. To downsample the graph, we propose a multi-head full attention graph pooling (MHFAPool) layer, which integrates the advantages of existing node clustering and node selection pooling methods. Specifically, MHFAPool takes the main eigenvector of full attention matrix as the aggregation coefficients to involve the global graph information in each pooled nodes. Extensive experiments demonstrate that our GPNet achieves the competitive results on four widely-used datasets, i.e., MARS, DukeMTMC-VideoReID, iLIDS-VID and PRID-2011.
Published: 2022

31. Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Author: Pan, Honghu, Liu, Qiao, Chen, Yongyong, He, Yunqi, Zheng, Yuan, Zheng, Feng, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body proportions, we propose to learn the discriminative pose feature beyond the appearance feature for video retrieval. Specifically, we implement a two-branch architecture to separately learn the appearance feature and pose feature, and then concatenate them together for inference. To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector, and construct a temporal graph using the pose sequence. We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph, which devises a global information propagation mechanism to simultaneously achieve the neighborhood aggregation of intra-frame nodes and message passing among inter-frame graphs. Finally, we propose a dual-attention method consisting of node-attention and time-attention to obtain the temporal graph representation from the node embeddings, where the self-attention mechanism is employed to learn the importance of each node and each frame. We verify the proposed method on three video-based ReID datasets, i.e., Mars, DukeMTMC and iLIDS-VID, whose experimental results demonstrate that the learned pose feature can effectively improve the performance of existing appearance models.
Published: 2022

32. Towards Complete-View and High-Level Pose-based Gait Recognition

Author: Pan, Honghu, Chen, Yongyong, Xu, Tingyang, He, Yunqi, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The model-based gait recognition methods usually adopt the pedestrian walking postures to identify human beings. However, existing methods did not explicitly resolve the large intra-class variance of human pose due to camera views changing. In this paper, we propose to generate multi-view pose sequences for each single-view pose sample by learning full-rank transformation matrices via lower-upper generative adversarial network (LUGAN). By the prior of camera imaging, we derive that the spatial coordinates between cross-view poses satisfy a linear transformation of a full-rank matrix, thereby, this paper employs the adversarial training to learn transformation matrices from the source pose and target views to obtain the target pose sequences. To this end, we implement a generator composed of graph convolutional (GCN) layers, fully connected (FC) layers and two-branch convolutional (CNN) layers: GCN layers and FC layers encode the source pose sequence and target view, then CNN branches learn a lower triangular matrix and an upper triangular matrix, respectively, finally they are multiplied to formulate the full-rank transformation matrix. For the purpose of adversarial training, we further devise a condition discriminator that distinguishes whether the pose sequence is true or generated. To enable the high-level correlation learning, we propose a plug-and-play module, named multi-scale hypergraph convolution (HGC), to replace the spatial graph convolutional layer in baseline, which could simultaneously model the joint-level, part-level and body-level correlations. Extensive experiments on two large gait recognition datasets, i.e., CASIA-B and OUMVLP-Pose, demonstrate that our method outperforms the baseline model and existing pose-based methods by a large margin.
Published: 2022

33. SSORN: Self-Supervised Outlier Removal Network for Robust Homography Estimation

Author: Li, Yi, Pei, Wenjie, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the homography estimation problem, they lack the feature matching step and/or the outlier removal step, which are important steps in the traditional homography estimation pipeline. In this paper, we attempt to build a deep learning model that mimics all four steps in the traditional homography estimation pipeline. In particular, the feature matching step is implemented using the cost volume technique. To remove outliers in the cost volume, we treat this outlier removal problem as a denoising problem and propose a novel self-supervised loss to solve the problem. Extensive experiments on synthetic and real datasets demonstrate that the proposed model outperforms existing deep learning models.
Published: 2022

34. Two-Stage Neural Contextual Bandits for Personalised News Recommendation

Author: Zhang, Mengyan, Nguyen-Tang, Thanh, Wu, Fangzhao, He, Zhenyu, Xie, Xing, and Ong, Cheng Soon
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: We consider the problem of personalised news recommendation where each user consumes news in a sequential fashion. Existing personalised news recommendation methods focus on exploiting user interests and ignores exploration in recommendation, which leads to biased feedback loops and hurt recommendation quality in the long term. We build on contextual bandits recommendation strategies which naturally address the exploitation-exploration trade-off. The main challenges are the computational efficiency for exploring the large-scale item space and utilising the deep representations with uncertainty. We propose a two-stage hierarchical topic-news deep contextual bandits framework to efficiently learn user preferences when there are many news items. We use deep learning representations for users and news, and generalise the neural upper confidence bound (UCB) policies to generalised additive UCB and bilinear UCB. Empirical results on a large-scale news recommendation dataset show that our proposed policies are efficient and outperform the baseline bandit policies.
Published: 2022

35. Global Tracking via Ensemble of Local Trackers

Author: Zhou, Zikun, Chen, Jianqiu, Pei, Wenjie, Mao, Kaige, Wang, Hongpeng, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The crux of long-term tracking lies in the difficulty of tracking the target with discontinuous moving caused by out-of-view or occlusion. Existing long-term tracking methods follow two typical strategies. The first strategy employs a local tracker to perform smooth tracking and uses another re-detector to detect the target when the target is lost. While it can exploit the temporal context like historical appearances and locations of the target, a potential limitation of such strategy is that the local tracker tends to misidentify a nearby distractor as the target instead of activating the re-detector when the real target is out of view. The other long-term tracking strategy tracks the target in the entire image globally instead of local tracking based on the previous tracking results. Unfortunately, such global tracking strategy cannot leverage the temporal context effectively. In this work, we combine the advantages of both strategies: tracking the target in a global view while exploiting the temporal context. Specifically, we perform global tracking via ensemble of local trackers spreading the full image. The smooth moving of the target can be handled steadily by one local tracker. When the local tracker accidentally loses the target due to suddenly discontinuous moving, another local tracker close to the target is then activated and can readily take over the tracking to locate the target. While the activated local tracker performs tracking locally by leveraging the temporal context, the ensemble of local trackers renders our model the global view for tracking. Extensive experiments on six datasets demonstrate that our method performs favorably against state-of-the-art algorithms., Comment: 10 pages; 6 figures; accepted to CVPR2022
Published: 2022

36. Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs

Author: Xia, Jingfei, Zhuge, Mingchen, Geng, Tiantian, Fan, Shun, Wei, Yuantai, He, Zhenyu, and Zheng, Feng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability., Comment: Our code is available at https://github.com/AndyFrancesco29/Audio-Visual-Figure-Skating
Published: 2022

37. Enhanced catalytic efficiency for chiral alcohol pharmaceutical intermediate production using polyacrylic acid (PAA)-based nano-bienzyme conjugates at the organic-buffer biphasic interface

Author: Cheng, Pengpeng, He, Zhenyu, Liu, Bo, Wang, Jinmei, Zhang, Chuyue, Tang, Lan, Du, Lihua, Lu, Yuan, and Ou, Zhimin
Published: 2024
Full Text: View/download PDF

38. Study on the thermal characteristics and heat-insulation ability of gel-stabilized foam used for preventing the spontaneous combustion of coal

Author: Shi, Quanlin, Sun, Yongjiang, He, Zhenyu, Yan, Hang, Nie, Xiaoyang, and Xia, Cuiping
Published: 2024
Full Text: View/download PDF

39. GuidedMix-Net: Semi-supervised Semantic Segmentation by Using Labeled Images as Reference

Author: Tu, Peng, Huang, Yawen, Zheng, Feng, He, Zhenyu, Cao, Liujun, and Shao, Ling
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semi-supervised learning is a challenging problem which aims to construct a model by learning from limited labeled examples. Numerous methods for this task focus on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass prior knowledge learned from the labeled examples. %, and failure to mine the feature interaction between the labeled and unlabeled image pairs. In this paper, we propose a novel method for semi-supervised semantic segmentation named GuidedMix-Net, by leveraging labeled information to guide the learning of unlabeled instances. Specifically, GuidedMix-Net employs three operations: 1) interpolation of similar labeled-unlabeled image pairs; 2) transfer of mutual information; 3) generalization of pseudo masks. It enables segmentation models can learning the higher-quality pseudo masks of unlabeled data by transfer the knowledge from labeled samples to unlabeled data. Along with supervised learning for labeled data, the prediction of unlabeled data is jointly learned with the generated pseudo masks from the mixed data. Extensive experiments on PASCAL VOC 2012, and Cityscapes demonstrate the effectiveness of our GuidedMix-Net, which achieves competitive segmentation accuracy and significantly improves the mIoU by +7$\%$ compared to previous approaches., Comment: Accepted by AAAI'22. arXiv admin note: substantial text overlap with arXiv:2106.15064
Published: 2021

40. Active Learning for Deep Visual Tracking

Author: Yuan, Di, Chang, Xiaojun, Yang, Yi, Liu, Qiao, Wang, Dehua, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years. Generally, training a deep CNN model requires numerous labeled training samples, and the number and quality of these samples directly affect the representational capability of the trained model. However, this approach is restrictive in practice, because manually labeling such a large number of training samples is time-consuming and prohibitively expensive. In this paper, we propose an active learning method for deep visual tracking, which selects and annotates the unlabeled samples to train the deep CNNs model. Under the guidance of active learning, the tracker based on the trained deep CNNs model can achieve competitive tracking performance while reducing the labeling cost. More specifically, to ensure the diversity of selected samples, we propose an active learning method based on multi-frame collaboration to select those training samples that should be and need to be annotated. Meanwhile, considering the representativeness of these selected samples, we adopt a nearest neighbor discrimination method based on the average nearest neighbor distance to screen isolated samples and low-quality samples. Therefore, the training samples subset selected based on our method requires only a given budget to maintain the diversity and representativeness of the entire sample set. Furthermore, we adopt a Tversky loss to improve the bounding box estimation of our tracker, which can ensure that the tracker achieves more accurate target states. Extensive experimental results confirm that our active learning-based tracker (ALT) achieves competitive tracking accuracy and speed compared with state-of-the-art trackers on the seven most challenging evaluation benchmarks., Comment: 12 pages
Published: 2021

41. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Author: Li, Jing, Kang, Di, Pei, Wenjie, Zhe, Xuefei, Zhang, Ying, He, Zhenyu, and Bao, Linchao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.
Published: 2021

42. Saliency-Associated Object Tracking

Author: Zhou, Zikun, Pei, Wenjie, Li, Xin, Wang, Hongpeng, Zheng, Feng, and He, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most existing trackers based on deep learning perform tracking in a holistic strategy, which aims to learn deep representations of the whole target for localizing the target. It is arduous for such methods to track targets with various appearance variations. To address this limitation, another type of methods adopts a part-based tracking strategy which divides the target into equal patches and tracks all these patches in parallel. The target state is inferred by summarizing the tracking results of these patches. A potential limitation of such trackers is that not all patches are equally informative for tracking. Some patches that are not discriminative may have adverse effects. In this paper, we propose to track the salient local parts of the target that are discriminative for tracking. In particular, we propose a fine-grained saliency mining module to capture the local saliencies. Further, we design a saliency-association modeling module to associate the captured saliencies together to learn effective correlation representations between the exemplar and the search image for state estimation. Extensive experiments on five diverse datasets demonstrate that the proposed method performs favorably against state-of-the-art trackers., Comment: Accepted by ICCV 2021
Published: 2021

43. Carbon nanotube incorporated quaternized lignin-based loose nanofiltration membrane toward dye/salt separation

Author: He, Zhenyu, Zhang, Na, Zhao, Li, Li, Zhenghua, Cui, Wanling, Zhu, Yuangang, Wang, Yi, Deng, Huining, Si, Chengrun, Bo, Wen, Zhou, Jie, Xu, Shicai, and Li, Qiang
Published: 2024
Full Text: View/download PDF

44. Extreme heat and firms' robot adoption: Evidence from China

Author: Tang, Yuwei and He, Zhenyu
Published: 2024
Full Text: View/download PDF

45. Sea target detection using the GNSS reflection signals

Author: He, Zhenyu, Chen, Wu, Yang, Yang, and Shen, Mingwei
Published: 2023
Full Text: View/download PDF

46. Self-Supervised Tracking via Target-Aware Data Synthesis

Author: Li, Xin, Pei, Wenjie, Wang, Yaowei, He, Zhenyu, Lu, Huchuan, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While deep-learning based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference. Since the target state is known in all synthesized data, existing deep trackers can be trained in routine ways using the synthesized data without human annotation. The proposed target-aware data-synthesis method adapts existing tracking approaches within a self-supervised learning framework without algorithmic changes. Thus, the proposed self-supervised learning mechanism can be seamlessly integrated into existing tracking frameworks to perform training. Extensive experiments show that our method 1) achieves favorable performance against supervised learning schemes under the cases with limited annotations; 2) helps deal with various tracking challenges such as object deformation, occlusion, or background clutter due to its manipulability; 3) performs favorably against state-of-the-art unsupervised tracking methods; 4) boosts the performance of various state-of-the-art supervised learning frameworks, including SiamRPN++, DiMP, and TransT., Comment: 11 pages, 7 figures, Accepted by IEEE Transactions on Neural Networks and Learning Systems
Published: 2021

47. SiamCorners: Siamese Corner Networks for Visual Tracking

Author: Yang, Kai, He, Zhenyu, Pei, Wenjie, Zhou, Zikun, Li, Xin, Yuan, Di, and Zhang, Haijun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The current Siamese network based on region proposal network (RPN) has attracted great attention in visual tracking due to its excellent accuracy and high efficiency. However, the design of the RPN involves the selection of the number, scale, and aspect ratios of anchor boxes, which will affect the applicability and convenience of the model. Furthermore, these anchor boxes require complicated calculations, such as calculating their intersection-over-union (IoU) with ground truth bounding boxes.Due to the problems related to anchor boxes, we propose a simple yet effective anchor-free tracker (named Siamese corner networks, SiamCorners), which is end-to-end trained offline on large-scale image pairs. Specifically, we introduce a modified corner pooling layer to convert the bounding box estimate of the target into a pair of corner predictions (the bottom-right and the top-left corners). By tracking a target as a pair of corners, we avoid the need to design the anchor boxes. This will make the entire tracking algorithm more flexible and simple than anchorbased trackers. In our network design, we further introduce a layer-wise feature aggregation strategy that enables the corner pooling module to predict multiple corners for a tracking target in deep networks. We then introduce a new penalty term that is used to select an optimal tracking box in these candidate corners. Finally, SiamCorners achieves experimental results that are comparable to the state-of-art tracker while maintaining a high running speed. In particular, SiamCorners achieves a 53.7% AUC on NFS30 and a 61.4% AUC on UAV123, while still running at 42 frames per second (FPS).
Published: 2021

48. Teaching Reform and Practice of University Computer Foundation Course Based on Python

Author: He, Zhenyu, Zhao, Haihu, He, Zhenfeng, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Hong, Wenxing, editor, and Weng, Yang, editor
Published: 2023
Full Text: View/download PDF

49. VEDesc: vertex-edge constraint on local learned descriptors

Author: Yin, Jianhua, Zhu, Longzhen, Bai, Yang, and He, Zhenyu
Published: 2023
Full Text: View/download PDF

50. Learning diverse fine-grained features for thermal infrared tracking

Author: Yang, Chao, Liu, Qiao, Li, Gaojun, Pan, Honghu, and He, Zhenyu
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,012 results on '"He, Zhenyu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources