Author: "Liu, Wenyu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Liu, Wenyu"' showing total 2,042 results

Start Over Author "Liu, Wenyu"

2,042 results on '"Liu, Wenyu"'

1. Partial Scene Text Retrieval

Author: Wang, Hao, Liao, Minghui, Xie, Zhouyi, Liu, Wenyu, and Bai, Xiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at https://github.com/lanfeng4659/PSTR., Comment: Accepted on TPAMI
Published: 2024

2. Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Author: Jiang, Bo, Chen, Shaoyu, Liao, Bencheng, Zhang, Xingyu, Yin, Wei, Zhang, Qian, Huang, Chang, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at https://github.com/hustvl/Senna., Comment: Project Page: https://github.com/hustvl/Senna
Published: 2024

3. FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

Author: Yao, Jingfeng, Cheng, Wang, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion Transformers (DiT) have attracted significant attention in research. However, they suffer from a slow convergence rate. In this paper, we aim to accelerate DiT training without any architectural modification. We identify the following issues in the training process: firstly, certain training strategies do not consistently perform well across different data. Secondly, the effectiveness of supervision at specific timesteps is limited. In response, we propose the following contributions: (1) We introduce a new perspective for interpreting the failure of the strategies. Specifically, we slightly extend the definition of Signal-to-Noise Ratio (SNR) and suggest observing the Probability Density Function (PDF) of SNR to understand the essence of the data robustness of the strategy. (2) We conduct numerous experiments and report over one hundred experimental results to empirically summarize a unified accelerating strategy from the perspective of PDF. (3) We develop a new supervision method that further accelerates the training process of DiT. Based on them, we propose FasterDiT, an exceedingly simple and practicable design strategy. With few lines of code modifications, it achieves 2.30 FID on ImageNet 256 resolution at 1000k iterations, which is comparable to DiT (2.27 FID) but 7 times faster in training., Comment: NeurIPS 2024 (poster); update to camera-ready version
Published: 2024

4. LCD-Net: A Lightweight Remote Sensing Change Detection Network Combining Feature Fusion and Gating Mechanism

Author: Liu, Wenyu, Li, Jindong, Wang, Haoji, Tan, Run, Fu, Yali, and Tian, Qichuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Remote sensing image change detection (RSCD) is crucial for monitoring dynamic surface changes, with applications ranging from environmental monitoring to disaster assessment. While traditional CNN-based methods have improved detection accuracy, they often suffer from high computational complexity and large parameter counts, limiting their use in resource-constrained environments. To address these challenges, we propose a Lightweight remote sensing Change Detection Network (LCD-Net in short) that reduces model size and computational cost while maintaining high detection performance. LCD-Net employs MobileNetV2 as the encoder to efficiently extract features from bitemporal images. A Temporal Interaction and Fusion Module (TIF) enhances the interaction between bitemporal features, improving temporal context awareness. Additionally, the Feature Fusion Module (FFM) aggregates multiscale features to better capture subtle changes while suppressing background noise. The Gated Mechanism Module (GMM) in the decoder further enhances feature learning by dynamically adjusting channel weights, emphasizing key change regions. Experiments on LEVIR-CD+, SYSU, and S2Looking datasets show that LCD-Net achieves competitive performance with just 2.56M parameters and 4.45G FLOPs, making it well-suited for real-time applications in resource-limited settings. The code is available at https://github.com/WenyuLiu6/LCD-Net.
Published: 2024

5. ControlAR: Controllable Image Generation with Autoregressive Models

Author: Li, Zongming, Cheng, Tianheng, Chen, Shoufa, Sun, Peize, Shen, Haocheng, Ran, Longjin, Chen, Xiaoxin, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR., Comment: Preprint. Work in progress
Published: 2024

6. Dynamic 2D Gaussians: Geometrically accurate radiance fields for dynamic objects

Author: Zhang, Shuai, Wu, Guanjun, Wang, Xinggang, Feng, Bin, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture 2D Gaussian's deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, a high-quality dynamic mesh sequence of the object can be extracted. Experiments demonstrate that our D-2DGS is outstanding in reconstructing high-quality meshes from sparse input. More demos and code are available at https://github.com/hustvl/Dynamic-2DGS.
Published: 2024

7. TrackSSM: A General Motion Predictor by State-Space Model

Author: Hu, Bin, Luo, Run, Liu, Zelin, Wang, Cheng, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal motion modeling has always been a key component in multiple object tracking (MOT) which can ensure smooth trajectory movement and provide accurate positional information to enhance association precision. However, current motion models struggle to be both efficient and effective across different application scenarios. To this end, we propose TrackSSM inspired by the recently popular state space models (SSM), a unified encoder-decoder motion framework that uses data-dependent state space model to perform temporal motion of trajectories. Specifically, we propose Flow-SSM, a module that utilizes the position and motion information from historical trajectories to guide the temporal state transition of object bounding boxes. Based on Flow-SSM, we design a flow decoder. It is composed of a cascaded motion decoding module employing Flow-SSM, which can use the encoded flow information to complete the temporal position prediction of trajectories. Additionally, we propose a Step-by-Step Linear (S$^2$L) training strategy. By performing linear interpolation between the positions of the object in the previous frame and the current frame, we construct the pseudo labels of step-by-step linear training, ensuring that the trajectory flow information can better guide the object bounding box in completing temporal transitions. TrackSSM utilizes a simple Mamba-Block to build a motion encoder for historical trajectories, forming a temporal motion model with an encoder-decoder structure in conjunction with the flow decoder. TrackSSM is applicable to various tracking scenarios and achieves excellent tracking performance across multiple benchmarks, further extending the potential of SSM-like temporal motion models in multi-object tracking tasks. Code and models are publicly available at \url{https://github.com/Xavier-Lin/TrackSSM}.
Published: 2024

8. PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification

Author: Hu, Bin, Wang, Xinggang, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at \url{https://github.com/hustvl/PersonViT} to promote further research in the person ReID field.
Published: 2024

9. XS-VID: An Extremely Small Video Object Detection Dataset

Author: Guo, Jiahao, Xu, Ziyang, Wu, Lianjun, Gao, Fei, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Small Video Object Detection (SVOD) is a crucial subfield in modern computer vision, essential for early object discovery and detection. However, existing SVOD datasets are scarce and suffer from issues such as insufficiently small objects, limited object categories, and lack of scene diversity, leading to unitary application scenarios for corresponding methods. To address this gap, we develop the XS-VID dataset, which comprises aerial data from various periods and scenes, and annotates eight major object categories. To further evaluate existing methods for detecting extremely small objects, XS-VID extensively collects three types of objects with smaller pixel areas: extremely small (\textit{es}, $0\sim12^2$), relatively small (\textit{rs}, $12^2\sim20^2$), and generally small (\textit{gs}, $20^2\sim32^2$). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{https://gjhhust.github.io/XS-VID/}.
Published: 2024

10. LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels

Author: Cui, Ziwei, Yao, Jingfeng, Zeng, Lunbin, Yang, Juan, Liu, Wenyu, and Wang, Xinggang
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: The segmentation of cell nuclei in tissue images stained with the blood dye hematoxylin and eosin (H$\&$E) is essential for various clinical applications and analyses. Due to the complex characteristics of cellular morphology, a large receptive field is considered crucial for generating high-quality segmentation. However, previous methods face challenges in achieving a balance between the receptive field and computational burden. To address this issue, we propose LKCell, a high-accuracy and efficient cell segmentation method. Its core insight lies in unleashing the potential of large convolution kernels to achieve computationally efficient large receptive fields. Specifically, (1) We transfer pre-trained large convolution kernel models to the medical domain for the first time, demonstrating their effectiveness in cell segmentation. (2) We analyze the redundancy of previous methods and design a new segmentation decoder based on large convolution kernels. It achieves higher performance while significantly reducing the number of parameters. We evaluate our method on the most challenging benchmark and achieve state-of-the-art results (0.5080 mPQ) in cell nuclei instance segmentation with only 21.6% FLOPs compared with the previous leading method. Our source code and models are available at https://github.com/hustvl/LKCell.
Published: 2024

11. Visual Text Generation in the Wild

Author: Zhu, Yuanzhi, Liu, Jiawei, Gao, Feiyu, Liu, Wenyu, Wang, Xinggang, Wang, Peng, Huang, Fei, Yao, Cong, and Yang, Zhibo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery., Comment: Accepted to ECCV 2024
Published: 2024

12. Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition

Author: Xiong, Haijun, Feng, Bin, Wang, Xinggang, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Gait recognition is a biometric technology that distinguishes individuals by their walking patterns. However, previous methods face challenges when accurately extracting identity features because they often become entangled with non-identity clues. To address this challenge, we propose CLTD, a causality-inspired discriminative feature learning module designed to effectively eliminate the influence of confounders in triple domains, \ie, spatial, temporal, and spectral. Specifically, we utilize the Cross Pixel-wise Attention Generator (CPAG) to generate attention distributions for factual and counterfactual features in spatial and temporal domains. Then, we introduce the Fourier Projection Head (FPH) to project spatial features into the spectral space, which preserves essential information while reducing computational costs. Additionally, we employ an optimization method with contrastive learning to enforce semantic consistency constraints across sequences from the same subject. Our approach has demonstrated significant performance improvements on challenging datasets, proving its effectiveness. Moreover, it can be seamlessly integrated into existing gait recognition methods., Comment: Accepted by ECCV 2024
Published: 2024

13. MoSt-DSA: Modeling Motion and Structural Interactions for Direct Multi-Frame Interpolation in DSA Images

Author: Xu, Ziyang, Zhao, Huangxuan, Cui, Ziwei, Liu, Wenyu, Zheng, Chuansheng, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Artificial intelligence has become a crucial tool for medical image analysis. As an advanced cerebral angiography technique, Digital Subtraction Angiography (DSA) poses a challenge where the radiation dose to humans is proportional to the image count. By reducing images and using AI interpolation instead, the radiation can be cut significantly. However, DSA images present more complex motion and structural features than natural scenes, making interpolation more challenging. We propose MoSt-DSA, the first work that uses deep learning for DSA frame interpolation. Unlike natural scene Video Frame Interpolation (VFI) methods that extract unclear or coarse-grained features, we devise a general module that models motion and structural context interactions between frames in an efficient full convolution manner by adjusting optimal context range and transforming contexts into linear functions. Benefiting from this, MoSt-DSA is also the first method that directly achieves any number of interpolations at any time steps with just one forward pass during both training and testing. We conduct extensive comparisons with 7 representative VFI models for interpolating 1 to 3 frames, MoSt-DSA demonstrates robust results across 470 DSA image sequences (each typically 152 images), with average SSIM over 0.93, average PSNR over 38 (standard deviations of less than 0.030 and 3.6, respectively), comprehensively achieving state-of-the-art performance in accuracy, speed, visual effect, and memory usage. Our code is available at https://github.com/ZyoungXu/MoSt-DSA., Comment: Accepted to ECAI2024
Published: 2024

14. Segment Any 4D Gaussians

Author: Ji, Shengxiang, Wu, Guanjun, Fang, Jiemin, Cen, Jiazhong, Yi, Taoran, Liu, Wenyu, Tian, Qi, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research focusing on segmentation within 4D representations. In this paper, we propose Segment Any 4D Gaussians (SA4D), one of the first frameworks to segment anything in the 4D digital world based on 4D Gaussians. In SA4D, an efficient temporal identity feature field is introduced to handle Gaussian drifting, with the potential to learn precise identity features from noisy and sparse input. Additionally, a 4D segmentation refinement process is proposed to remove artifacts. Our SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks. More demos are available at: https://jsxzs.github.io/sa4d/., Comment: 22 pages
Published: 2024

15. Occupancy as Set of Points

Author: Shi, Yiang, Cheng, Tianheng, Zhang, Qian, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. Existing camera-based methods tend to exploit dense volume-based representation to predict the occupancy of the whole scene, making it hard to focus on the special areas or areas out of the perception range. In comparison, we present the Points of Interest (PoIs) to represent the scene and propose OSP, a novel framework for point-based 3D occupancy prediction. Owing to the inherent flexibility of the point-based representation, OSP achieves strong performance compared with existing methods and excels in terms of training and inference adaptability. It extends beyond traditional perception boundaries and can be seamlessly integrated with volume-based methods to significantly enhance their effectiveness. Experiments on the Occ3D nuScenes occupancy benchmark show that OSP has strong performance and flexibility. Code and models are available at \url{https://github.com/hustvl/osp}., Comment: Accepted by ECCV 2024. Code and models: https://github.com/hustvl/osp
Published: 2024

16. EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Author: Zhang, Yuxuan, Cheng, Tianheng, Hu, Rui, Liu, Lei, Liu, Heng, Ran, Longjin, Chen, Xiaoxin, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models., Comment: Preprint. Update: (1) better performance and (2) versatile segmentation. Code and models are available at: https://github.com/hustvl/EVF-SAM
Published: 2024

17. GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality

Author: Yi, Taoran, Fang, Jiemin, Zhou, Zanwei, Wang, Junjie, Wu, Guanjun, Xie, Lingxi, Zhang, Xiaopeng, Liu, Wenyu, Wang, Xinggang, and Tian, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Recently, 3D Gaussian splatting (3D-GS) has achieved great success in reconstructing and rendering real-world scenes. To transfer the high rendering quality to generation tasks, a series of research works attempt to generate 3D-Gaussian assets from text. However, the generated assets have not achieved the same quality as those in reconstruction tasks. We observe that Gaussians tend to grow without control as the generation process may cause indeterminacy. Aiming at highly enhancing the generation quality, we propose a novel framework named GaussianDreamerPro. The main idea is to bind Gaussians to reasonable geometry, which evolves over the whole generation process. Along different stages of our framework, both the geometry and appearance can be enriched progressively. The final output asset is constructed with 3D Gaussians bound to mesh, which shows significantly enhanced details and quality compared with previous methods. Notably, the generated asset can also be seamlessly integrated into downstream manipulation pipelines, e.g. animation, composition, and simulation etc., greatly promoting its potential in wide applications. Demos are available at https://taoranyi.com/gaussiandreamerpro/., Comment: Project page: https://taoranyi.com/gaussiandreamerpro/
Published: 2024

18. Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Author: Zhao, Zhengqi, Huang, Xiaohu, Zhou, Hao, Yao, Kun, Ding, Errui, Wang, Jingdong, Wang, Xinggang, Liu, Wenyu, and Feng, Bin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance., Comment: 13 pages, 9 figures
Published: 2024

19. EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning

Author: Yao, Jingfeng, Wang, Xinggang, Song, Yuehao, Zhao, Huangxuan, Ma, Jun, Chen, Yajie, Liu, Wenyu, and Wang, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The diagnosis and treatment of chest diseases play a crucial role in maintaining human health. X-ray examination has become the most common clinical examination means due to its efficiency and cost-effectiveness. Artificial intelligence analysis methods for chest X-ray images are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical dissemination. Here we present EVA-X, an innovative foundational model based on X-ray images with broad applicability to various chest disease detection tasks. EVA-X is the first X-ray image based self-supervised learning method capable of capturing both semantic and geometric information from unlabeled images for universal X-ray image representation. Through extensive experimentation, EVA-X has demonstrated exceptional performance in chest disease analysis and localization, becoming the first model capable of spanning over 20 different chest diseases and achieving leading results in over 11 different detection tasks in the medical field. Additionally, EVA-X significantly reduces the burden of data annotation in the medical AI field, showcasing strong potential in the domain of few-shot learning. The emergence of EVA-X will greatly propel the development and application of foundational medical models, bringing about revolutionary changes in future medical research and clinical practice. Our codes and models are available at: https://github.com/hustvl/EVA-X., Comment: codes available at: https://github.com/hustvl/EVA-X
Published: 2024

20. Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation

Author: Wang, Song, Yu, Jiawei, Li, Wentong, Liu, Wenyu, Liu, Xiaolu, Chen, Junbo, and Zhu, Jianke
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC., Comment: Accepted by CVPR2024
Published: 2024

21. TOGS: Gaussian Splatting with Temporal Opacity Offset for Real-Time 4D DSA Rendering

Author: Zhang, Shuai, Zhao, Huangxuan, Zhou, Zhenghong, Wu, Guanjun, Zheng, Chuansheng, Wang, Xinggang, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Four-dimensional Digital Subtraction Angiography (4D DSA) is a medical imaging technique that provides a series of 2D images captured at different stages and angles during the process of contrast agent filling blood vessels. It plays a significant role in the diagnosis of cerebrovascular diseases. Improving the rendering quality and speed under sparse sampling is important for observing the status and location of lesions. The current methods exhibit inadequate rendering quality in sparse views and suffer from slow rendering speed. To overcome these limitations, we propose TOGS, a Gaussian splatting method with opacity offset over time, which can effectively improve the rendering quality and speed of 4D DSA. We introduce an opacity offset table for each Gaussian to model the opacity offsets of the Gaussian, using these opacity-varying Gaussians to model the temporal variations in the radiance of the contrast agent. By interpolating the opacity offset table, the opacity variation of the Gaussian at different time points can be determined. This enables us to render the 2D DSA image at that specific moment. Additionally, we introduced a Smooth loss term in the loss function to mitigate overfitting issues that may arise in the model when dealing with sparse view scenarios. During the training phase, we randomly prune Gaussians, thereby reducing the storage overhead of the model. The experimental results demonstrate that compared to previous methods, this model achieves state-of-the-art render quality under the same number of training views. Additionally, it enables real-time rendering while maintaining low storage overhead. The code is available at https://github.com/hustvl/TOGS.
Published: 2024

22. ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Author: Song, Yuehao, Wang, Xinggang, Yao, Jingfeng, Liu, Wenyu, Zhang, Jinglin, and Xu, Xiangmin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less., Comment: 15 pages; Accepted by Visual Intelligence
Published: 2024

23. MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

Author: Zou, Jialv, Liao, Bencheng, Zhang, Qian, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including BEV segmentation (8.7% IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our work offers a new choice for learning representation at scale in autonomous driving. Code and models are released at https://github.com/hustvl/MIM4D
Published: 2024

24. MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

Author: Liao, Bencheng, Chen, Shaoyu, Zhang, Yunchi, Jiang, Bo, Zhang, Qian, Liu, Wenyu, Huang, Chang, and Wang, Xinggang
Published: 2024
Full Text: View/download PDF

25. WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition

Author: Zhu, Lianghui, Zhou, Junwei, Liu, Yan, Hao, Xin, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's problems of requiring prompts and category unawareness for automatic object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively. The code is available at \url{https://github.com/hustvl/WeakSAM}., Comment: Accepted by ACM MM 2024. Code is available at https://github.com/hustvl/WeakSAM
Published: 2024

26. VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Author: Chen, Shaoyu, Jiang, Bo, Gao, Hao, Liao, Bencheng, Xu, Qing, Zhang, Qian, Huang, Chang, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Learning a human-like driving policy from large-scale driving demonstrations is promising, but the uncertainty and non-deterministic nature of planning make it challenging. In this work, to cope with the uncertainty problem, we propose VADv2, an end-to-end driving model based on probabilistic planning. VADv2 takes multi-view image sequences as input in a streaming manner, transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. Only with camera sensors, VADv2 achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods. It runs stably in a fully end-to-end manner, even without the rule-based wrapper. Closed-loop demos are presented at https://hgao-cv.github.io/VADv2., Comment: Project Page: https://hgao-cv.github.io/VADv2
Published: 2024

27. YOLO-World: Real-Time Open-Vocabulary Object Detection

Author: Cheng, Tianheng, Song, Lin, Ge, Yixiao, Liu, Wenyu, Wang, Xinggang, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation., Comment: Work still in progress. Code & models are available at: https://github.com/AILab-CVC/YOLO-World
Published: 2024

28. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Author: Zhu, Lianghui, Liao, Bencheng, Zhang, Qian, Wang, Xinlong, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim., Comment: Vision Mamba (Vim) is accepted by ICML 2024. Code is available at https://github.com/hustvl/Vim
Published: 2024

29. Fast High Dynamic Range Radiance Fields for Dynamic Scenes

Author: Wu, Guanjun, Yi, Taoran, Fang, Jiemin, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Neural Radiances Fields (NeRF) and their extensions have shown great success in representing 3D scenes and synthesizing novel-view images. However, most NeRF methods take in low-dynamic-range (LDR) images, which may lose details, especially with nonuniform illumination. Some previous NeRF methods attempt to introduce high-dynamic-range (HDR) techniques but mainly target static scenes. To extend HDR NeRF methods to wider applications, we propose a dynamic HDR NeRF framework, named HDR-HexPlane, which can learn 3D scenes from dynamic 2D images captured with various exposures. A learnable exposure mapping function is constructed to obtain adaptive exposure values for each image. Based on the monotonically increasing prior, a camera response function is designed for stable learning. With the proposed model, high-quality novel-view images at any time point can be rendered with any desired exposure. We further construct a dataset containing multiple dynamic scenes captured with diverse exposures for evaluation. All the datasets and code are available at \url{https://guanjunwu.github.io/HDR-HexPlane/}., Comment: 3DV 2024. Project page: https://guanjunwu.github.io/HDR-HexPlane
Published: 2024

30. Lane Graph as Path: Continuity-Preserving Path-Wise Modeling for Online Lane Graph Construction

Author: Liao, Bencheng, Chen, Shaoyu, Jiang, Bo, Cheng, Tianheng, Zhang, Qian, Liu, Wenyu, Huang, Chang, Wang, Xinggang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

31. Circuit as Set of Points

Author: Zou, Jialv, Wang, Xinggang, Guo, Jiahao, Liu, Wenyu, Zhang, Qian, and Huang, Chang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As the size of circuit designs continues to grow rapidly, artificial intelligence technologies are being extensively used in Electronic Design Automation (EDA) to assist with circuit design. Placement and routing are the most time-consuming parts of the physical design process, and how to quickly evaluate the placement has become a hot research topic. Prior works either transformed circuit designs into images using hand-crafted methods and then used Convolutional Neural Networks (CNN) to extract features, which are limited by the quality of the hand-crafted methods and could not achieve end-to-end training, or treated the circuit design as a graph structure and used Graph Neural Networks (GNN) to extract features, which require time-consuming preprocessing. In our work, we propose a novel perspective for circuit design by treating circuit components as point clouds and using Transformer-based point cloud perception methods to extract features from the circuit. This approach enables direct feature extraction from raw data without any preprocessing, allows for end-to-end training, and results in high performance. Experimental results show that our method achieves state-of-the-art performance in congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as well as in design rule check (DRC) violation prediction tasks on the CircuitNet dataset. Our method establishes a bridge between the relatively mature point cloud perception methods and the fast-developing EDA algorithms, enabling us to leverage more collective intelligence to solve this task. To facilitate the research of open EDA design, source codes and pre-trained models are released at https://github.com/hustvl/circuitformer.
Published: 2023

32. Label-efficient Segmentation via Affinity Propagation

Author: Li, Wentong, Yuan, Yuqian, Wang, Song, Liu, Wenyu, Tang, Dongqi, Liu, Jian, Zhu, Jianke, and Zhang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Weakly-supervised segmentation with label-efficient sparse annotations has attracted increasing research attention to reduce the cost of laborious pixel-wise labeling process, while the pairwise affinity modeling techniques play an essential role in this task. Most of the existing approaches focus on using the local appearance kernel to model the neighboring pairwise potentials. However, such a local operation fails to capture the long-range dependencies and ignores the topology of objects. In this work, we formulate the affinity modeling as an affinity propagation process, and propose a local and a global pairwise affinity terms to generate accurate soft pseudo labels. An efficient algorithm is also developed to reduce significantly the computational cost. The proposed approach can be conveniently plugged into existing segmentation networks. Experiments on three typical label-efficient segmentation tasks, i.e. box-supervised instance segmentation, point/scribble-supervised semantic segmentation and CLIP-guided semantic segmentation, demonstrate the superior performance of the proposed approach., Comment: NeurIPS2023 Acceptance. Project Page:https://LiWentomng.github.io/apro/. Code: https://github.com/CircleRadon/APro
Published: 2023

33. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Author: Wu, Guanjun, Yi, Taoran, Fang, Jiemin, Xie, Lingxi, Zhang, Xiaopeng, Wei, Wei, Liu, Wenyu, Tian, Qi, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Representing and rendering dynamic scenes has been an important but challenging task. Especially, to accurately model complex motions, high efficiency is usually hard to guarantee. To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency, we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. In 4D-GS, a novel explicit representation containing both 3D Gaussians and 4D neural voxels is proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is proposed to efficiently build Gaussian features from 4D neural voxels and then a lightweight MLP is applied to predict Gaussian deformations at novel timestamps. Our 4D-GS method achieves real-time rendering under high resolutions, 82 FPS at an 800$\times$800 resolution on an RTX 3090 GPU while maintaining comparable or better quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs/., Comment: CVPR 2024. Project page: https://guanjunwu.github.io/4dgs/
Published: 2023

34. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Author: Yi, Taoran, Fang, Jiemin, Wang, Junjie, Wu, Guanjun, Xie, Lingxi, Zhang, Xiaopeng, Liu, Wenyu, Tian, Qi, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: In recent times, the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can help generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency, but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation, but 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D object generation framework, named as GaussianDreamer, is proposed, where the 3D diffusion model provides priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D avatar within 15 minutes on one GPU, much faster than previous methods, while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/., Comment: CVPR 2024, Project page: https://taoranyi.com/gaussiandreamer/
Published: 2023

35. TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction

Author: Zhou, Zhenghong, Zhao, Huangxuan, Fang, Jiemin, Xiang, Dongqiao, Chen, Lei, Wu, Lingxia, Wu, Feihong, Liu, Wenyu, Zheng, Chuansheng, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiation dose. To address this high radiation issue, we propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA reconstruction, which paves the way for high-quality 4D imaging. Additionally, 2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation properties from both spatial and temporal dimensions. It is optimized by minimizing discrepancies between the rendered images and sparse 2D DSA images. Without any neural network involved, TiAVox enjoys specific physical interpretability. The parameters of each learnable voxel represent the attenuation coefficients. We validated the TiAVox approach on both clinical and simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for novel view synthesis using only 30 views on the clinically sourced dataset, whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly, with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32 for novel view synthesis and 41.40 for 3D reconstruction. We also executed ablation studies to corroborate the essential components of TiAVox. The code will be publically available., Comment: 10 pages, 8 figures
Published: 2023

36. Transgaze: exploring plain vision transformers for gaze estimation

Author: Ye, Lang, Wang, Xinggang, Yao, Jingfeng, and Liu, Wenyu
Published: 2024
Full Text: View/download PDF

37. Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

Author: Huang, Xiaohu, Wang, Xinggang, Jin, Zhidianqiu, Yang, Bo, He, Botao, Feng, Bin, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.The source code will be available at https://github.com/OliverHxh/CAG., Comment: Accepted by TIP journal
Published: 2023

38. MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

Author: Liao, Bencheng, Chen, Shaoyu, Zhang, Yunchi, Jiang, Bo, Zhang, Qian, Liu, Wenyu, Huang, Chang, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications., Comment: Accepted to IJCV 2024. Code available at https://github.com/hustvl/MapTR . arXiv admin note: substantial text overlap with arXiv:2208.14437
Published: 2023

39. Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

Author: Jiang, Haoyi, Cheng, Tianheng, Gao, Naiyu, Zhang, Haoyang, Lin, Tianwei, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: `3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal undertaking in autonomous driving, aiming to predict voxel occupancy within volumetric scenes. However, prevailing methodologies primarily focus on voxel-wise feature aggregation, while neglecting instance semantics and scene context. In this paper, we present a novel paradigm termed Symphonies (Scene-from-Insts), that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our proposed Serial Instance-Propagated Attentions, Symphonies dynamically encodes instance-centric semantics, facilitating intricate interactions between image-based and volumetric domains. Simultaneously, Symphonies enables holistic scene comprehension by capturing context through the efficient fusion of instance queries, alleviating geometric ambiguity such as occlusion and perspective errors through contextual scene reasoning. Experimental results demonstrate that Symphonies achieves state-of-the-art performance on challenging benchmarks SemanticKITTI and SSCBench-KITTI-360, yielding remarkable mIoU scores of 15.04 and 18.58, respectively. These results showcase the paradigm's promising advancements. The code is available at https://github.com/hustvl/Symphonies., Comment: Technical report. Code and models at: https://github.com/hustvl/Symphonies
Published: 2023

40. SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth

Author: Liu, Zelin, Wang, Xinggang, Wang, Cheng, Liu, Wenyu, and Bai, Xiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Exploring robust and efficient association methods has always been an important issue in multiple-object tracking (MOT). Although existing tracking methods have achieved impressive performance, congestion and frequent occlusions still pose challenging problems in multi-object tracking. We reveal that performing sparse decomposition on dense scenes is a crucial step to enhance the performance of associating occluded targets. To this end, we propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images. Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets and perform data association on these sparse target subsets in order from near to far. By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack. SparseTrack provides a new perspective for solving the challenging crowded scene MOT problem. Only using IoU matching, SparseTrack achieves comparable performance with the state-of-the-art (SOTA) methods on the MOT17 and MOT20 benchmarks. Code and models are publicly available at \url{https://github.com/hustvl/SparseTrack}., Comment: 12 pages, 8 figures
Published: 2023

41. Matte Anything: Interactive Natural Image Matting with Segment Anything Models

Author: Yao, Jingfeng, Wang, Xinggang, Ye, Lang, and Liu, Wenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimap often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything (MatAny), an interactive natural image matting model that could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. In our work, we leverage vision foundation models to enhance the performance of natural image matting. Specifically, we use the segment anything model to predict high-quality contour with user interaction and an open-vocabulary detector to predict the transparency of any object. Subsequently, a pre-trained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms. MatAny has 58.3% improvement on MSE and 40.6% improvement on SAD compared to the previous image matting methods with simple guidance, achieving new state-of-the-art (SOTA) performance. The source codes and pre-trained models are available at https://github.com/hustvl/Matte-Anything., Comment: 21 pages, codes: https://github.com/hustvl/Matte-Anything
Published: 2023

42. Ultra-permeable silk-based polymeric membranes for vacuum-driven nanofiltration

Author: Gan, Bowen, Peng, Lu Elfa, Liu, Wenyu, Zhang, Lingyue, Wang, Li Ares, Long, Li, Guo, Hao, Song, Xiaoxiao, Yang, Zhe, and Tang, Chuyang Y.
Published: 2024
Full Text: View/download PDF

43. Structural brain characteristics of epilepsy patients with comorbid migraine without aura

Author: Zhang, Shujiang, Liu, Wenyu, Li, Jinmei, and Zhou, Dong
Published: 2024
Full Text: View/download PDF

44. Stereotactic body radiation therapy for the primary tumor and oligometastases versus the primary tumor alone in patients with metastatic pancreatic cancer

Author: Jiang, Lingong, Ye, Yusheng, Feng, Zhiru, Liu, Wenyu, Cao, Yangsen, Zhao, Xianzhi, Zhu, Xiaofei, and Zhang, Huojun
Published: 2024
Full Text: View/download PDF

45. RNA demethylase FTO participates in malignant progression of gastric cancer by regulating SP1-AURKB-ATM pathway

Author: Zeng, Xueliang, Lu, Yao, Zeng, Taohui, Liu, Wenyu, Huang, Weicai, Yu, Tingting, Tang, Xuerui, Huang, Panpan, Li, Bei, and Wei, Hulai
Published: 2024
Full Text: View/download PDF

46. Thermally stable Ni foam-supported inverse CeAlOx/Ni ensemble as an active structured catalyst for CO2 hydrogenation to methane

Author: Tang, Xin, Song, Chuqiao, Li, Haibo, Liu, Wenyu, Hu, Xinyu, Chen, Qiaoli, Lu, Hanfeng, Yao, Siyu, Li, Xiao-nian, and Lin, Lili
Published: 2024
Full Text: View/download PDF

47. A sequence-aware merger of genomic structural variations at population scale

Author: Zheng, Zeyu, Zhu, Mingjia, Zhang, Jin, Liu, Xinfeng, Hou, Liqiang, Liu, Wenyu, Yuan, Shuai, Luo, Changhong, Yao, Xinhao, Liu, Jianquan, and Yang, Yongzhi
Published: 2024
Full Text: View/download PDF

48. Fully neuroendoscopic resection of cerebellopontine angle tumors through a retrosigmoid approach: a retrospective single-center study

Author: Zhang, Hengrui, Wang, Jiwei, Liu, Junzhi, Cao, Zexin, Liu, Xuchen, Jin, Haoyong, Liu, Wenyu, Xue, Zhiwei, Yang, Ning, Li, Chao, and Wang, Xinyu
Published: 2024
Full Text: View/download PDF

49. USF2 activates RhoB/ROCK pathway by transcriptional inhibition of miR-206 to promote pyroptosis in septic cardiomyocytes

Author: Dong, Wei, Liao, Ruichun, Weng, Junfei, Du, Xingxiang, Chen, Jin, Fang, Xu, Liu, Wenyu, Long, Tao, You, Jiaxiang, Wang, Wensheng, and Peng, Xiaoping
Published: 2024
Full Text: View/download PDF

50. miR-206 alleviates LPS-induced inflammatory injury in cardiomyocytes via directly targeting USP33 to inhibit the JAK2/STAT3 signaling pathway

Author: Dong, Wei, Chen, Jin, Wang, Yadong, Weng, Junfei, Du, Xingxiang, Fang, Xu, Liu, Wenyu, Long, Tao, You, Jiaxiang, Wang, Wensheng, and Peng, Xiaoping
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,042 results on '"Liu, Wenyu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources