Author: "Lu, Tong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lu, Tong"' showing total 3,987 results

Start Over Author "Lu, Tong"

3,987 results on '"Lu, Tong"'

1. CorrAdaptor: Adaptive Local Context Learning for Correspondence Pruning

Author: Zhu, Wei, Liu, Yicheng, He, Yuping, Liao, Tangfei, Zheng, Kang, Xu, Xiaoqiu, Wang, Tao, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In the fields of computer vision and robotics, accurate pixel-level correspondences are essential for enabling advanced tasks such as structure-from-motion and simultaneous localization and mapping. Recent correspondence pruning methods usually focus on learning local consistency through k-nearest neighbors, which makes it difficult to capture robust context for each correspondence. We propose CorrAdaptor, a novel architecture that introduces a dual-branch structure capable of adaptively adjusting local contexts through both explicit and implicit local graph learning. Specifically, the explicit branch uses KNN-based graphs tailored for initial neighborhood identification, while the implicit branch leverages a learnable matrix to softly assign neighbors and adaptively expand the local context scope, significantly enhancing the model's robustness and adaptability to complex image variations. Moreover, we design a motion injection module to integrate motion consistency into the network to suppress the impact of outliers and refine local context learning, resulting in substantial performance improvements. The experimental results on extensive correspondence-based tasks indicate that our CorrAdaptor achieves state-of-the-art performance both qualitatively and quantitatively. The code and pre-trained models are available at https://github.com/TaoWangzj/CorrAdaptor., Comment: 8 pages, 4 figures, accepted by ECAI
Published: 2024

2. EAR: Edge-Aware Reconstruction of 3-D vertebrae structures from bi-planar X-ray images

Author: Tan, Lixing, Song, Shuang, He, Yaofeng, Zhou, Kangneng, Lu, Tong, and Xiao, Ruoxiu
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: X-ray images ease the diagnosis and treatment process due to their rapid imaging speed and high resolution. However, due to the projection process of X-ray imaging, much spatial information has been lost. To accurately provide efficient spinal morphological and structural information, reconstructing the 3-D structures of the spine from the 2-D X-ray images is essential. It is challenging for current reconstruction methods to preserve the edge information and local shapes of the asymmetrical vertebrae structures. In this study, we propose a new Edge-Aware Reconstruction network (EAR) to focus on the performance improvement of the edge information and vertebrae shapes. In our network, by using the auto-encoder architecture as the backbone, the edge attention module and frequency enhancement module are proposed to strengthen the perception of the edge reconstruction. Meanwhile, we also combine four loss terms, including reconstruction loss, edge loss, frequency loss and projection loss. The proposed method is evaluated using three publicly accessible datasets and compared with four state-of-the-art models. The proposed method is superior to other methods and achieves 25.32%, 15.32%, 86.44%, 80.13%, 23.7612 and 0.3014 with regard to MSE, MAE, Dice, SSIM, PSNR and frequency distance. Due to the end-to-end and accurate reconstruction process, EAR can provide sufficient 3-D spatial information and precise preoperative surgical planning guidance., Comment: 13 pages, 11 figures, 3 tables
Published: 2024

3. MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Author: Liu, Yangzhou, Cao, Yue, Gao, Zhangwei, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Tian, Hao, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct., Comment: 18 pages, 8 figures, technical report
Published: 2024

4. EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Author: Pei, Baoqi, Chen, Guo, Xu, Jilan, He, Yuping, Liu, Yicheng, Pan, Kanghua, Huang, Yifei, Wang, Yali, Lu, Tong, Wang, Limin, and Qiao, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo., Comment: Champion solutions in the EgoVis CVPR 2024 workshop
Published: 2024

5. OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Author: Li, Qingyun, Chen, Zhe, Wang, Weiyun, Wang, Wenhai, Ye, Shenglong, Jin, Zhenjiang, Chen, Guanzhou, He, Yinan, Gao, Zhangwei, Cui, Erfei, Yu, Jiashuo, Tian, Hao, Zhou, Jiasheng, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Li, Zhenxiang, Chu, Pei, Wang, Yi, Dou, Min, Tian, Changyao, Zhu, Xizhou, Lu, Lewei, Chen, Yushi, He, Junjun, Tu, Zhongying, Lu, Tong, Wang, Yali, Wang, Limin, Lin, Dahua, Qiao, Yu, Shi, Botian, He, Conghui, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.
Published: 2024

6. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Author: Wu, Jiannan, Zhong, Muyan, Xing, Sen, Lai, Zeqiang, Liu, Zhaoyang, Wang, Wenhai, Chen, Zhe, Zhu, Xizhou, Lu, Lewei, Lu, Tong, Luo, Ping, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs., Comment: 43 pages
Published: 2024

7. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Author: Chen, Zhe, Wang, Weiyun, Tian, Hao, Ye, Shenglong, Gao, Zhangwei, Cui, Erfei, Tong, Wenwen, Hu, Kongzhi, Luo, Jiapeng, Ma, Zheng, Ma, Ji, Wang, Jiaqi, Dong, Xiaoyi, Yan, Hang, Guo, Hewei, He, Conghui, Shi, Botian, Jin, Zhenjiang, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Dou, Min, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Lin, Dahua, Qiao, Yu, Dai, Jifeng, and Wang, Wenhai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL., Comment: Technical report
Published: 2024

8. Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Author: Chen, Guo, Huang, Yifei, Xu, Jilan, Pei, Baoqi, Chen, Zhe, Li, Zhiqi, Wang, Jiahao, Li, Kunchang, Lu, Tong, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite., Comment: Technical Report
Published: 2024

9. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Author: Duan, Yuchen, Wang, Weiyun, Chen, Zhe, Zhu, Xizhou, Lu, Lewei, Lu, Tong, Qiao, Yu, Li, Hongsheng, Dai, Jifeng, and Wang, Wenhai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code is released at \url{https://github.com/OpenGVLab/Vision-RWKV}.
Published: 2024

10. PromptRR: Diffusion Models as Prompt Generators for Single Image Reflection Removal

Author: Wang, Tao, Lu, Wanglong, Zhang, Kaihao, Luo, Wenhan, Kim, Tae-Kyun, Lu, Tong, Li, Hongdong, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing single image reflection removal (SIRR) methods using deep learning tend to miss key low-frequency (LF) and high-frequency (HF) differences in images, affecting their effectiveness in removing reflections. To address this problem, this paper proposes a novel prompt-guided reflection removal (PromptRR) framework that uses frequency information as new visual prompts for better reflection performance. Specifically, the proposed framework decouples the reflection removal process into the prompt generation and subsequent prompt-guided restoration. For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts. Then, we adopt diffusion models (DMs) as prompt generators to generate the LF and HF prompts estimated by the pre-trained frequency prompt encoder. For the prompt-guided restoration, we integrate specially generated prompts into the PromptFormer network, employing a novel Transformer-based prompt block to effectively steer the model toward enhanced reflection removal. The results on commonly used benchmarks show that our method outperforms state-of-the-art approaches. The codes and models are available at https://github.com/TaoWangzj/PromptRR., Comment: 10 pages, 10 figures
Published: 2024

11. MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Author: Tian, Changyao, Zhu, Xizhou, Xiong, Yuwen, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Chen, Yuntao, Lu, Lewei, Lu, Tong, Zhou, Jie, Li, Hongsheng, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}., Comment: 20 pages, 9 figures, 17 tables
Published: 2024

12. Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Author: Xiong, Yuwen, Li, Zhiqi, Chen, Yuntao, Wang, Feng, Zhu, Xizhou, Luo, Jiapeng, Wang, Wenhai, Lu, Tong, Li, Hongsheng, Qiao, Yu, Lu, Lewei, Zhou, Jie, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models., Comment: Tech report; Code: https://github.com/OpenGVLab/DCNv4
Published: 2024

13. CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Author: Rong, Yi, Zhou, Haoran, Yuan, Lixin, Mei, Cheng, Wang, Jiahao, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN., Comment: Accepted to AAAI 2024
Published: 2024
Full Text: View/download PDF

14. The Sb–N charge transfer bridge over Cs3Sb2Br9/Sb–C3N4 Z-scheme heterojunction for boosting photocatalytic CO2 reduction

Author: Wang, Hao-Kun, Zhang, Meng-Ran, Su, Ke, Liu, Zhao-Lei, Mu, Yan-Fei, Bai, Fu-Quan, Zhang, Min, and Lu, Tong-Bu
Published: 2024
Full Text: View/download PDF

15. LC3-associated phagocytosis of neutrophils triggers tumor ferroptotic cell death in glioblastoma

Author: Lu, Tong, Yee, Patricia P, Chih, Stephen Y, Tang, Miaolu, Chen, Han, Aregawi, Dawit G, Glantz, Michael J, Zacharia, Brad E, Wang, Hong-Gang, and Li, Wei
Published: 2024
Full Text: View/download PDF

16. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Author: Chen, Zhe, Wu, Jiannan, Wang, Wenhai, Su, Weijie, Chen, Guo, Xing, Sen, Zhong, Muyan, Zhang, Qinglong, Zhu, Xizhou, Lu, Lewei, Li, Bin, Luo, Ping, Lu, Tong, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL., Comment: 25 pages, 5 figures, 28 tables
Published: 2023

17. Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Author: Li, Zhiqi, Yu, Zhiding, Lan, Shiyi, Li, Jiahan, Kautz, Jan, Lu, Tong, and Alvarez, Jose M.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner}, Comment: Accept to cvpr 2024
Published: 2023

18. Evaluating the effects of high-throughput structural neuroimaging predictors on whole-brain functional connectome outcomes via network-based vector-on-matrix regression

Author: Lu, Tong, Zhang, Yuan, Lyzinski, Vince, Bi, Chuan, Kochunov, Peter, Hong, Elliot, and Chen, Shuo
Subjects: Statistics - Methodology, Quantitative Biology - Neurons and Cognition, Quantitative Biology - Quantitative Methods, Statistics - Computation
Abstract: The joint analysis of multimodal neuroimaging data is critical in the field of brain research because it reveals complex interactive relationships between neurobiological structures and functions. In this study, we focus on investigating the effects of structural imaging (SI) features, including white matter micro-structure integrity (WMMI) and cortical thickness, on the whole brain functional connectome (FC) network. To achieve this goal, we propose a network-based vector-on-matrix regression model to characterize the FC-SI association patterns. We have developed a novel multi-level dense bipartite and clique subgraph extraction method to identify which subsets of spatially specific SI features intensively influence organized FC sub-networks. The proposed method can simultaneously identify highly correlated structural-connectomic association patterns and suppress false positive findings while handling millions of potential interactions. We apply our method to a multimodal neuroimaging dataset of 4,242 participants from the UK Biobank to evaluate the effects of whole-brain WMMI and cortical thickness on the resting-state FC. The results reveal that the WMMI on corticospinal tracts and inferior cerebellar peduncle significantly affect functional connections of sensorimotor, salience, and executive sub-networks with an average correlation of 0.81 (p<0.001)., Comment: 20 pages, 5 figures, 2 tables
Published: 2023

19. Multiple Imputation Method for High-Dimensional Neuroimaging Data

Author: Lu, Tong, Chen, Chixiang, Huang, Hsin-Hsiung, Kochunov, Peter, Hong, Elliot, and Chen, Shuo
Subjects: Statistics - Methodology, Statistics - Applications, Statistics - Computation
Abstract: Missingness is a common issue for neuroimaging data, and neglecting it in downstream statistical analysis can introduce bias and lead to misguided inferential conclusions. It is therefore crucial to conduct appropriate statistical methods to address this issue. While multiple imputation is a popular technique for handling missing data, its application to neuroimaging data is hindered by high dimensionality and complex dependence structures of multivariate neuroimaging variables. To tackle this challenge, we propose a novel approach, named High Dimensional Multiple Imputation (HIMA), based on Bayesian models. HIMA develops a new computational strategy for sampling large covariance matrices based on a robustly estimated posterior mode, which drastically enhances computational efficiency and numerical stability. To assess the effectiveness of HIMA, we conducted extensive simulation studies and real-data analysis using neuroimaging data from a Schizophrenia study. HIMA showcases a computational efficiency improvement of over 2000 times when compared to traditional approaches, while also producing imputed datasets with improved precision and stability., Comment: 13 pages, 5 figures
Published: 2023

20. Cobalt-based heterogeneous catalysts for photocatalytic carbon dioxide reduction

Author: Yuan, Hong, Mei, Jian-Hua, Gong, Yun-Nan, Zhong, Di-Chang, and Lu, Tong-Bu
Published: 2024
Full Text: View/download PDF

21. GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Author: Wang, Tao, Zhang, Kaihao, Shao, Ziqian, Luo, Wenhan, Stenger, Bjorn, Lu, Tong, Kim, Tae-Kyun, Liu, Wei, and Li, Hongdong
Published: 2024
Full Text: View/download PDF

22. A Retrospective Study of Transaxillary Endoscopic Breast Augmentation Using Ultrasonic Scalpel or Conventional Electrocautery for Implant Pocket Dissection

Author: Xie, Zhiyang, Yan, Kaili, Qu, Yuming, Gao, Sheng, Lu, Tong, Hu, Chao, Wang, Shu, Shangguan, Wensong, and Wu, Guoping
Published: 2024
Full Text: View/download PDF

23. Photo-induced synthesis of heteronuclear dual-atom catalysts

Author: Zhao, Qiu-Ping, Shi, Wen-Xiong, Zhang, Jiangwei, Tian, Zhi-Yuan, Zhang, Zhi-Ming, Zhang, Peng, Wang, Ye, Qiao, Shi-Zhang, and Lu, Tong-Bu
Published: 2024
Full Text: View/download PDF

24. Surface iodine and pyrenyl-graphdiyne co-modified Bi catalysts for highly efficient CO2 electroreduction in acidic electrolyte

Author: Zhang, Min, Wang, Juan, Rong, Xin, Lu, Xiu-Li, and Lu, Tong-Bu
Published: 2024
Full Text: View/download PDF

25. Similarity search on social networks with incremental graph indexing based on probabilistic inference

Author: Qi, Zhiwei, Lu, Tong, Yue, Kun, and Duan, Liang
Published: 2024
Full Text: View/download PDF

26. Deep Video Restoration for Under-Display Camera

Author: Chen, Xuanxi, Wang, Tao, Shao, Ziqian, Zhang, Kaihao, Luo, Wenhan, Lu, Tong, Liu, Zikun, Kim, Tae-Kyun, and Li, Hongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Images or videos captured by the Under-Display Camera (UDC) suffer from severe degradation, such as saturation degeneration and color shift. While restoration for UDC has been a critical task, existing works of UDC restoration focus only on images. UDC video restoration (UDC-VR) has not been explored in the community. In this work, we first propose a GAN-based generation pipeline to simulate the realistic UDC degradation process. With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC. Using the proposed dataset, we conduct extensive benchmark studies on existing video restoration methods and observe their limitations on the UDC-VR task. To this end, we propose a novel transformer-based baseline method that adaptively enhances degraded videos. The key components of the method are a spatial branch with local-aware transformers, a temporal branch embedded temporal transformers, and a spatial-temporal fusion module. These components drive the model to fully exploit spatial and temporal information for UDC-VR. Extensive experiments show that our method achieves state-of-the-art performance on PexelsUDC. The benchmark and the baseline method are expected to promote the progress of UDC-VR in the community, which will be made public.
Published: 2023

27. Memory-and-Anticipation Transformer for Online Action Understanding

Author: Wang, Jiahao, Chen, Guo, Huang, Yifei, Wang, Limin, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most existing forecasting systems are memory-based methods, which attempt to mimic human forecasting ability by employing various memory mechanisms and have progressed in temporal modeling for memory dependency. Nevertheless, an obvious weakness of this paradigm is that it can only model limited historical dependence and can not transcend the past. In this paper, we rethink the temporal dependence of event evolution and propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks. In addition, owing to the inherent superiority of MAT, it can process online action detection and anticipation tasks in a unified manner. The proposed MAT model is tested on four challenging benchmarks TVSeries, THUMOS'14, HDD, and EPIC-Kitchens-100, for online action detection and anticipation tasks, and it significantly outperforms all existing methods. Code is available at https://github.com/Echo0125/Memory-and-Anticipation-Transformer., Comment: ICCV 2023 Camera Ready
Published: 2023

28. FB-BEV: BEV Representation from Forward-Backward View Transformations

Author: Li, Zhiqi, Yu, Zhiding, Wang, Wenhai, Anandkumar, Anima, Lu, Tong, and Alvarez, Jose M.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV., Comment: Accept to ICCV 2023, camera-ready version
Published: 2023

29. The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Author: Wang, Weiyun, Shi, Min, Li, Qingyun, Wang, Wenhai, Huang, Zhenhang, Xing, Linjie, Chen, Zhe, Li, Hao, Zhu, Xizhou, Cao, Zhiguo, Chen, Yushi, Lu, Tong, Dai, Jifeng, and Qiao, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing., Comment: Technical Report
Published: 2023

30. AVSegFormer: Audio-Visual Segmentation with Transformer

Author: Gao, Shengyi, Chen, Zhe, Chen, Guo, Wang, Wenhai, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer., Comment: 7 pages, 6 figures
Published: 2023

31. Photocatalytic reduction of CO2 with H2O into C2H6 mediated by dual metalation strategy

Author: Wang, Mei and Lu, Tong-Bu
Published: 2024
Full Text: View/download PDF

32. GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Author: Wang, Tao, Zhang, Kaihao, Shao, Ziqian, Luo, Wenhan, Stenger, Bjorn, Lu, Tong, Kim, Tae-Kyun, Liu, Wei, and Li, Hongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introduces two core designs. First, it uses an enhanced attention mechanism in the transformer layer. The mechanism includes stages of the sampler and compact self-attention to improve efficiency, and a local enhancement stage to strengthen local information. Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer. This design further improves the network's ability to learn effective features from both preceding and current local features. The GridFormer framework achieves state-of-the-art results on five diverse image restoration tasks in adverse weather conditions, including image deraining, dehazing, deraining \& dehazing, desnowing, and multi-weather restoration. The source code and pre-trained models are available at https://github.com/TaoWangzj/GridFormer., Comment: 20 pages, 15 figures, accepted by IJCV
Published: 2023

33. VideoLLM: Modeling Video Sequence with Large Language Models

Author: Chen, Guo, Zheng, Yin-Dong, Wang, Jiahao, Xu, Jilan, Huang, Yifei, Pan, Junting, Wang, Yi, Wang, Yali, Qiao, Yu, Lu, Tong, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM., Comment: Technical Report
Published: 2023

34. Graph Propagation Transformer for Graph Representation Learning

Author: Chen, Zhe, Tan, Hao, Wang, Tao, Shen, Tianrun, Lu, Tong, Peng, Qiuying, Cheng, Cheng, and Qi, Yue
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: This paper presents a novel transformer architecture for graph representation learning. The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks. Specifically, we propose a new attention mechanism called Graph Propagation Attention (GPA). It explicitly passes the information among nodes and edges in three ways, i.e. node-to-node, node-to-edge, and edge-to-node, which is essential for learning graph-structured data. On this basis, we design an effective transformer architecture named Graph Propagation Transformer (GPTrans) to further help learn graph data. We verify the performance of GPTrans in a wide range of graph learning experiments on several benchmark datasets. These results show that our method outperforms many state-of-the-art transformer-based graph models with better performance. The code will be released at https://github.com/czczup/GPTrans., Comment: Accepted to IJCAI 2023
Published: 2023

35. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Author: Wang, Wenhai, Chen, Zhe, Chen, Xiaokang, Wu, Jiannan, Zhu, Xizhou, Zeng, Gang, Luo, Ping, Lu, Tong, Zhou, Jie, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM., Comment: Technical Report
Published: 2023

36. Network method for voxel-pair-level brain connectivity analysis under spatial-contiguity constraints

Author: Lu, Tong, Zhang, Yuan, Kochunov, Peter, Hong, Elliot, and Chen, Shuo
Subjects: Statistics - Methodology, Statistics - Applications
Abstract: Brain connectome analysis commonly compresses high-resolution brain scans (typically composed of millions of voxels) down to only hundreds of regions of interest (ROIs) by averaging within-ROI signals. This huge dimension reduction improves computational speed and the morphological properties of anatomical structures; however, it also comes at the cost of substantial losses in spatial specificity and sensitivity, especially when the signals exhibit high within-ROI heterogeneity. Oftentimes, abnormally expressed functional connectivity (FC) between a pair of ROIs caused by a brain disease is primarily driven by only small subsets of voxel pairs within the ROI pair. This article proposes a new network method for detection of voxel-pair-level neural dysconnectivity with spatial constraints. Specifically, focusing on an ROI pair, our model aims to extract dense sub-areas that contain aberrant voxel-pair connections while ensuring that the involved voxels are spatially contiguous. In addition, we develop sub-community-detection algorithms to realize the model, and the consistency of these algorithms is justified. Comprehensive simulation studies demonstrate our method's effectiveness in reducing the false-positive rate while increasing statistical power, detection replicability, and spatial specificity. We apply our approach to reveal: (i) voxel-wise schizophrenia-altered FC patterns within the salience and temporal-thalamic network from 330 participants in a schizophrenia study; (ii) disrupted voxel-wise FC patterns related to nicotine addiction between the basal ganglia, hippocampus, and insular gyrus from 3269 participants using UK Biobank data. The detected results align with previous medical findings but include improved localized information., Comment: 25 pages, 6 figures
Published: 2023

37. MRSN: Multi-Relation Support Network for Video Action Detection

Author: Zheng, Yin-Dong, Chen, Guo, Yuan, Minglei, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Action detection is a challenging video understanding task, requiring modeling spatio-temporal and interaction relations. Current methods usually model actor-actor and actor-context relations separately, ignoring their complementarity and mutual support. To solve this problem, we propose a novel network called Multi-Relation Support Network (MRSN). In MRSN, Actor-Context Relation Encoder (ACRE) and Actor-Actor Relation Encoder (AARE) model the actor-context and actor-actor relation separately. Then Relation Support Encoder (RSE) computes the supports between the two relations and performs relation-level interactions. Finally, Relation Consensus Module (RCM) enhances two relations with the long-term relations from the Long-term Relation Bank (LRB) and yields a consensus. Our experiments demonstrate that modeling relations separately and performing relation-level interactions can achieve and outperformer state-of-the-art results on two challenging video datasets: AVA and UCF101-24., Comment: 6 pages
Published: 2023

38. Probabilistic Inference Based Incremental Graph Index for Similarity Search on Social Networks

Author: Lu, Tong, Qi, Zhiwei, Yue, Kun, Duan, Liang, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Gao, Honghao, editor, Wang, Xinheng, editor, and Voros, Nikolaos, editor
Published: 2024
Full Text: View/download PDF

39. Structure and Functioning of China’s Dryland Ecosystems in a Changing Environment

Author: Li, Changjia, Fu, Bojie, Wang, Shuai, Stringer, Lindsay C., Zhou, Wenxin, Lu, Tong, Wu, Xutong, Hu, Rina, Ren, Zhuobing, Fu, Bojie, editor, and Stafford-Smith, Mark, editor
Published: 2024
Full Text: View/download PDF

40. Intranasal administration of insulin on the incidence of postoperative delirium in middle-aged patients undergoing elective on-pump cardiac surgery (INIPOD-MOPS): a prospective double-blinded randomized control study protocol

Author: Yang, Ming, Yang, Guiying, Lu, Tong, Cao, Lei, Xiao, Cheng, Liang, Yan, Ding, Jinping, Jiang, Xuetao, Wang, Wei, Chen, Fang, Du, Zhiyong, and Li, Hong
Published: 2024
Full Text: View/download PDF

41. Short-term microbial community dynamics induced by 13C-labeled maize root, its derived biochar and NPK in long-term amended soil

Author: Lu, Zonglin, Lu, Tong, Shi, Junmei, Chen, Kun, Guo, Hangming, Li, Na, and Han, Xiaori
Published: 2024
Full Text: View/download PDF

42. Development and validation of an automated Tomotherapy planning method for cervical cancer

Author: Han, Feiru, Xue, Yi, Huang, Sheng, Lu, Tong, Yang, Yining, Cao, Yuanjie, Chen, Jie, Hou, Hailing, Sun, Yao, Wang, Wei, Yuan, Zhiyong, Tao, Zhen, and Jiang, Shengpeng
Published: 2024
Full Text: View/download PDF

43. Intranasal PAMAM-G3 scavenges cell-free DNA attenuating the allergic airway inflammation

Author: Chen, Xiumin, Chen, Changhui, Tu, Zhaoxu, Guo, Zeling, Lu, Tong, Li, Jian, Wen, Yihui, Chen, Dehua, Lei, Wenbin, Wen, Weiping, and Li, Hang
Published: 2024
Full Text: View/download PDF

44. Effects of extreme temperatures on public sentiment in 49 Chinese cities

Author: Wang, Chan, Bai, Yi-Xiang, Li, Xin-Wu, and Lin, Lu-tong
Published: 2024
Full Text: View/download PDF

45. The influence of placenta microbiota of normal term pregnant women on immune regulation during pregnancy

Author: Yang, Ping, Lu, Tong, Liang, Xinyuan, Huang, Ting, Wu, Lulu, He, Zonglin, Xiao, Xiaomin, and Fan, Shangrong
Published: 2024
Full Text: View/download PDF

46. Three-dimensional analysis of hard and soft tissue changes in skeletal class II patients with high mandibular plane angle undergoing surgery

Author: Zhang, Caixia, Lu, Tong, Wang, Lichan, Wen, Juan, Huang, Ziwei, Lin, Shuang, Zhou, Yiwen, Li, Guifeng, and Li, Huang
Published: 2024
Full Text: View/download PDF

47. Ultrasound-guided erector spinae plane block for perioperative analgesia in patients undergoing laparoscopic nephrectomies surgery: a randomized controlled trial

Author: Yang, Ming, Cao, Lei, Lu, Tong, Xiao, Cheng, Wu, Zhuoxi, Jiang, Xuetao, Wang, Wei, and Li, Hong
Published: 2024
Full Text: View/download PDF

48. Preparation of PAN/GO composite nanofiber membrane for oil-containing wastewater treatment

Author: Lu, Tong, Zhang, Qingxia, Yang, Jingjing, Xin, Yue, Zhang, Zhilei, Hu, Lingye, Hu, Jing, Qin, Qin, and Yang, Hao
Published: 2024
Full Text: View/download PDF

49. DDP: Diffusion Model for Dense Visual Prediction

Author: Ji, Yuanfeng, Chen, Zhe, Xie, Enze, Hong, Lanqing, Liu, Xihui, Liu, Zhaoqiang, Lu, Tong, Li, Zhenguo, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research, Comment: Added controlnet exp
Published: 2023

50. Champion Solution for the WSDM2023 Toloka VQA Challenge

Author: Gao, Shengyi, Chen, Zhe, Chen, Guo, Wang, Wenhai, and Lu, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023., Comment: Technical report in WSDM Cup 2023
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,987 results on '"Lu, Tong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources