Author: "Wang, Xinlong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Xinlong"' showing total 1,914 results

Start Over Author "Wang, Xinlong"

1,914 results on '"Wang, Xinlong"'

1. A Simple Image Segmentation Framework via In-Context Examples

Author: Liu, Yang, Jing, Chenchen, Li, Hengtao, Zhu, Muzhi, Chen, Hao, Wang, Xinlong, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method., Comment: Accepted to Proc. Conference on Neural Information Processing Systems (NeurIPS) 2024. Webpage: https://github.com/aim-uofa/SINE
Published: 2024

2. Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Author: Zhu, Muzhi, Liu, Yang, Luo, Zekai, Jing, Chenchen, Chen, Hao, Xu, Guangkai, Wang, Xinlong, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings., Comment: Accepted to Proc. Annual Conference on Neural Information Processing Systems (NeurIPS) 2024
Published: 2024

3. Emu3: Next-Token Prediction is All You Need

Author: Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, Zhao, Yingli, Ao, Yulong, Min, Xuebin, Li, Tao, Wu, Boya, Zhao, Bo, Zhang, Bowen, Wang, Liangdong, Liu, Guang, He, Zheqi, Yang, Xi, Liu, Jingjing, Lin, Yonghua, Huang, Tiejun, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction., Comment: Project Page: https://emu.baai.ac.cn
Published: 2024

4. Diffusion Feedback Helps CLIP See Better

Author: Wang, Wenxuan, Sun, Quan, Zhang, Fan, Tang, Yepeng, Liu, Jing, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at https://github.com/baaivision/DIVA.
Published: 2024

5. DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Author: Li, Xiaotong, Zhang, Fan, Diao, Haiwen, Wang, Yueze, Wang, Xinlong, and Duan, Ling-Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs. The dataset and code are publicly available at https://github.com/baaivision/DenseFusion.
Published: 2024

6. Unveiling Encoder-Free Vision-Language Models

Author: Diao, Haiwen, Cui, Yufeng, Li, Xiaotong, Wang, Yueze, Lu, Huchuan, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE., Comment: 16 pages, 7 figures
Published: 2024

7. Effects of dry-wet cycles on the mechanical properties of sandstone with unloading-induced damage

Author: Nan, Gan, Zhang, Jiaming, Luo, Yi, Wang, Xinlong, and Hu, Zhongyi
Published: 2024
Full Text: View/download PDF

8. Incorporating nano-ZnCo-ZIF particles in the electrospinning polylactide membranes to improve their filtration and antibacterial performances

Author: Deng, Qingchen, Li, Jiangen, Li, Xiang, Du, Xuye, Wu, Lanlan, Wang, Junrui, and Wang, Xinlong
Published: 2024
Full Text: View/download PDF

9. Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Author: Wang, Wenxuan, Zhang, Yisi, He, Xingjian, Yan, Yichen, Zhao, Zijia, Wang, Xinlong, and Liu, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios. Since users usually prefer to provide intention-based expression for the desired object instead of covering all the details, it is necessary for the agents to interpret the intention-driven instructions. Thus, in this work, we take a step further to the intention-driven visual-language (V-L) understanding. To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions. Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration. Besides, various types of models are set up as the baselines to realize our IVG task. Extensive experiments on our IntentionVG dataset and baselines demonstrate the necessity and efficacy of our method for the V-L field. To foster future research in this direction, our newly built dataset and baselines will be publicly available at https://github.com/Rubics-Xuan/IVG., Comment: This work has been accepted by ACL 2024 (Findings)
Published: 2024

10. EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Author: Sun, Quan, Wang, Jinsheng, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
Published: 2024

11. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Author: Zhu, Lianghui, Liao, Bencheng, Zhang, Qian, Wang, Xinlong, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim., Comment: Work in progress. Code is available at https://github.com/hustvl/Vim
Published: 2024

12. Generative Multimodal Models are In-Context Learners

Author: Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Luo, Zhengxiong, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research., Comment: Accepted to CVPR 2024. Project page: https://baaivision.github.io/emu2
Published: 2023

13. Tokenize Anything via Prompting

Author: Pan, Ting, Tang, Lulu, Wang, Xinlong, and Shan, Shiguang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}., Comment: code, model, and demo: https://github.com/baaivision/tokenize-anything
Published: 2023

14. Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Author: Wang, Wenxuan, Yue, Tongtian, Zhang, Yisi, Guo, Longteng, He, Xingjian, Wang, Xinlong, and Liu, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper, we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding, we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine, we build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github.com/Rubics-Xuan/MRES, Comment: This work is accepted by CVPR 2024
Published: 2023

15. Masked Channel Modeling for Bootstrapping Visual Pre-training

Author: Liu, Yang, Wang, Xinlong, Zhu, Muzhi, Cao, Yue, Huang, Tiejun, and Shen, Chunhua
Published: 2024
Full Text: View/download PDF

16. Effects of silvopastoral systems on soil nutrient properties in the low hilly area of western Henan province, China

Author: Liu, Peisong, Cheng, Fan, Hu, Jun, Li, Meng, Wang, Xinlong, You, Shirong, Tong, Weishuang, Cheng, Liping, Zhang, Jinping, and Kou, Lixuan
Published: 2024
Full Text: View/download PDF

17. Generation of Autologous Vascular Endothelial Cells for Patients with Peripheral Artery Disease

Author: Jiang, Bin, Wang, Xinlong, Rivera-Bolanos, Nancy, and Ameer, Guillermo A.
Published: 2024
Full Text: View/download PDF

18. GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

Author: Ma, Baorui, Deng, Haoge, Zhou, Junsheng, Liu, Yu-Shen, Huang, Tiejun, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly stem from 2D diffusion models lacking 3D awareness during the lifting. In this work, we present GeoDream, a novel method that incorporates explicit generalized 3D priors with 2D diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric structures without sacrificing diversity or fidelity. Specifically, we first utilize a multi-view diffusion model to generate posed images and then construct cost volume from the predicted image, which serves as native 3D geometric priors, ensuring spatial consistency in 3D space. Subsequently, we further propose to harness 3D geometric priors to unlock the great potential of 3D awareness in 2D diffusion priors via a disentangled design. Notably, disentangling 2D and 3D priors allows us to refine 3D geometric priors further. We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors. Our numerical and visual comparisons demonstrate that GeoDream generates more 3D consistent textured meshes with high-resolution realistic renderings (i.e., 1024 $\times$ 1024) and adheres more closely to semantic coherence., Comment: Code and Demo: https://github.com/baaivision/GeoDream
Published: 2023

19. CapsFusion: Rethinking Image-Text Data at Scale

Author: Yu, Qiying, Sun, Quan, Zhang, Xiaosong, Cui, Yufeng, Zhang, Fan, Cao, Yue, Wang, Xinlong, and Liu, Jingjing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training., Comment: CVPR 2024. Code & Dataset: https://github.com/baaivision/CapsFusion
Published: 2023

20. JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Author: Zhu, Lianghui, Wang, Xinggang, and Wang, Xinlong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat., Comment: 30 pages, 23 figures
Published: 2023

21. 3D-GPT: Procedural 3D Modeling with Large Language Models

Author: Sun, Chunyi, Han, Junlin, Deng, Weijian, Wang, Xinlong, Qin, Zishan, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nature necessitating a deep understanding of rules, algorithms, and parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. 3D-GPT integrates three core agents: the task dispatch agent, the conceptualization agent, and the modeling agent. They collaboratively achieve two objectives. First, it enhances concise initial scene descriptions, evolving them into detailed forms while dynamically adapting the text based on subsequent instructions. Second, it integrates procedural generation, extracting parameter values from enriched text to effortlessly interface with 3D software for asset creation. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers. Furthermore, it seamlessly integrates with Blender, unlocking expanded manipulation possibilities. Our work highlights the potential of LLMs in 3D modeling, offering a basic framework for future advancements in scene generation and animation., Comment: Project page: https://chuny1.github.io/3DGPT/3dgpt.html
Published: 2023

22. Uni3D: Exploring Unified 3D Representation at Scale

Author: Zhou, Junsheng, Wang, Jinsheng, Ma, Baorui, Liu, Yu-Shen, Huang, Tiejun, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain., Comment: Code and Demo: https://github.com/baaivision/Uni3D
Published: 2023

23. Multipotent bone marrow cell-seeded polymeric composites drive long-term, definitive urinary bladder tissue regeneration.

Author: Bury, Matthew, Fuller, Natalie, Wang, Xinlong, Chan, Yvonne, Oh, Sang, Sofer, Laurel, Arora, Hans, Sharma, Tiffany, Nolan, Bonnie, Feng, Wei, Rabizadeh, Rebecca, Barac, Milica, Edassery, Sonia, Goedegebuure, Madeleine, Wang, Larry, Ganesh, Balaji, Halliday, Lisa, Seniw, Mark, Edassery, Seby, Mahmud, Nadim, Hofer, Matthias, McKenna, Kevin, Cheng, Earl, Ameer, Guillermo, Sharma, Arun, and Sturm, Renea
Subjects: autologous stem cells, biomechanocompatible scaffold, regenerative engineering, urinary bladder
Abstract: To date, there are no efficacious translational solutions for end-stage urinary bladder dysfunction. Current surgical strategies, including urinary diversion and bladder augmentation enterocystoplasty (BAE), utilize autologous intestinal segments (e.g. ileum) to increase bladder capacity to protect renal function. Considered the standard of care, BAE is fraught with numerous short- and long-term clinical complications. Previous clinical trials employing tissue engineering approaches for bladder tissue regeneration have also been unable to translate bench-top findings into clinical practice. Major obstacles still persist that need to be overcome in order to advance tissue-engineered products into the clinical arena. These include scaffold/bladder incongruencies, the acquisition and utility of appropriate cells for anatomic and physiologic tissue recapitulation, and the choice of an appropriate animal model for testing. In this study, we demonstrate that the elastomeric, bladder biomechanocompatible poly(1,8-octamethylene-citrate-co-octanol) (PRS; synthetic) scaffold coseeded with autologous bone marrow-derived mesenchymal stem cells and CD34+ hematopoietic stem/progenitor cells support robust long-term, functional bladder tissue regeneration within the context of a clinically relevant baboon bladder augmentation model simulating bladder trauma. Partially cystectomized baboons were independently augmented with either autologous ileum or stem-cell-seeded small-intestinal submucosa (SIS; a commercially available biological scaffold) or PRS grafts. Stem-cell synergism promoted functional trilayer bladder tissue regeneration, including whole-graft neurovascularization, in both cell-seeded grafts. However, PRS-augmented animals demonstrated fewer clinical complications and more advantageous tissue characterization metrics compared to ileum and SIS-augmented animals. Two-year study data demonstrate that PRS/stem-cell-seeded grafts drive bladder tissue regeneration and are a suitable alternative to BAE.
Published: 2024

24. Emu: Generative Pretraining in Multimodality

Author: Sun, Quan, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Yueze, Gao, Hongcheng, Liu, Jingjing, Huang, Tiejun, and Wang, Xinlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance., Comment: Accepted to ICLR 2024. Code and Models: https://github.com/baaivision/Emu
Published: 2023

25. Fine-Grained Visual Prompting

Author: Yang, Lingfeng, Wang, Yueze, Li, Xiang, Wang, Xinlong, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.
Published: 2023

26. Transcranial Infrared Laser Stimulation

Author: Wang, Xinlong, Gonzalez-Lima, Francisco, Liu, Hanli, Wassermann, Eric M., book editor, Peterchev, Angel V., book editor, Ziemann, Ulf, book editor, Lisanby, Sarah H., book editor, Siebner, Hartwig R., book editor, and Walsh, Vincent, book editor
Published: 2024
Full Text: View/download PDF

27. Biomechanical study of the stability of posterior cervical expansive open-door laminoplasty combined with bilateral C4/5 foraminotomy and short-segment lateral mass screw fixation: a finite element analysis

Author: Li, Kunpeng, Yu, Qun, Wang, Chongyi, Zhang, Runtong, Fu, Qingyang, Feng, Yunze, Liu, Chen, Wang, Xinlong, Zhang, Ronghan, Li, Le, and Si, Haipeng
Published: 2024
Full Text: View/download PDF

28. Repetitive transcranial magnetic stimulation promotes motor function recovery in mice after spinal cord injury via regulation of the Cx43-autophagy loop

Author: Zhang, Lechi, Xiao, Zhihang, Su, Zelin, Wang, Xinlong, Tian, Huifang, and Su, Min
Published: 2024
Full Text: View/download PDF

29. Directed physiological networks in the human prefrontal cortex at rest and post transcranial photobiomodulation

Author: Shahdadian, Sadra, Wang, Xinlong, and Liu, Hanli
Published: 2024
Full Text: View/download PDF

30. Towards Better Entity Linking with Multi-View Enhanced Distillation

Author: Liu, Yi, Tian, Yuan, Lian, Jianxun, Wang, Xinlong, Cao, Yanan, Fang, Fang, Zhang, Wen, Huang, Haizhen, Deng, Denvy, and Zhang, Qi
Subjects: Computer Science - Computation and Language
Abstract: Dense retrieval is widely used for entity linking to retrieve entities from large-scale knowledge bases. Mainstream techniques are based on a dual-encoder framework, which encodes mentions and entities independently and calculates their relevances via rough interaction metrics, resulting in difficulty in explicitly modeling multiple mention-relevant parts within entities to match divergent mentions. Aiming at learning entity representations that can match divergent mentions, this paper proposes a Multi-View Enhanced Distillation (MVD) framework, which can effectively transfer knowledge of multiple fine-grained and mention-relevant parts within entities from cross-encoders to dual-encoders. Each entity is split into multiple views to avoid irrelevant information being over-squashed into the mention-relevant view. We further design cross-alignment and self-alignment mechanisms for this framework to facilitate fine-grained knowledge distillation from the teacher model to the student model. Meanwhile, we reserve a global-view that embeds the entity as a whole to prevent dispersal of uniform information. Experiments show our method achieves state-of-the-art performance on several entity linking benchmarks., Comment: Accepted by ACL 2023 Main Conference
Published: 2023

31. Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

Author: Liu, Yang, Zhu, Muzhi, Li, Hengtao, Chen, Hao, Wang, Xinlong, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher., Comment: Accepted to ICLR2024
Published: 2023

32. SegGPT: Segmenting Everything In Context

Author: Wang, Xinlong, Zhang, Xiaosong, Cao, Yue, Wang, Wen, Shen, Chunhua, and Huang, Tiejun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of-domain targets, either qualitatively or quantitatively., Comment: Code and Demo: https://github.com/baaivision/Painter
Published: 2023

33. Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Author: Wang, Wen, Jiang, Yan, Xie, Kangyang, Liu, Zide, Chen, Hao, Cao, Yue, Wang, Xinlong, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code is made available at \url{https://github.com/baaivision/vid2vid-zero}., Comment: Add customized video editing. Under Review
Published: 2023

34. EVA-CLIP: Improved Training Techniques for CLIP at Scale

Author: Sun, Quan, Fang, Yuxin, Wu, Ledell, Wang, Xinlong, and Cao, Yue
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP., Comment: To Rei and the moon. Code & Models: https://github.com/baaivision/EVA/tree/master/EVA-CLIP
Published: 2023

35. EVA-02: A Visual Representation for Neon Genesis

Author: Fang, Yuxin, Sun, Quan, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, and Cao, Yue
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02., Comment: v2: Fix some known issues & typos. v1: To Asuka. Code & Models: https://github.com/baaivision/EVA/tree/master/EVA-02
Published: 2023
Full Text: View/download PDF

36. High-Precision Collaborative Relative Navigation Method for Multiple Aircraft Based on Geometric Topology Constraints

Author: Lu Kewen, Wang Xinlong, Wang Bin
Subjects: multiple aircraft formation, aviation field, geometric topology constraints, collaborative relative navigation, high precision, Motor vehicles. Aeronautics. Astronautics, TL1-4050
Abstract: Multiple aircraft formation has strong multi operation capability, high reliability and high overall efficiency, which is an important direction for the future development of aviation field. High-precision relative navigation is crucial for multiple aircraft to achieve formation flight. In order to achieve high-precision relative navigation, the existing methods usually increase the relative navigation accuracy by adding the types of navigation sensors or improving the measurement accuracy of navigation sensors. However, this not only leads to an increase in the cost and volume of the relative navigation system, but also leads to an increase in the system complexity. Therefore, based on the overall geometry topology structure of the formation, a formation geometric topology constraint model is established and introduced into the estimation of the relative inertial navigation errors in this paper. A collaborative relative navigation method based on geometric topology constraints is proposed. By introducing the geometric topology information of the whole formation, the proposed method can improve the relative navigation accuracy without increasing the configuration of navigation sensors. On this basis, a high-precision collaborative relative navigation scheme for multiple aircraft is designed. The simulation results show that the proposed method has higher relative navigation accuracy than the existing method, and the mean squared errors of relative attitude determination, relative positioning and relative velocity measurement are reduced by more than 66.48%, 48.73% and 69.53%, respectively. The proposed method can achieve high-precision relative navigation for multiple aircraft.
Published: 2024
Full Text: View/download PDF

37. FOXM1 mediates methotrexate resistance in osteosarcoma cells by promoting autophagy

Author: Wang Luoyang, Zhai Dongchang, Tang Lei, Zhang Hui, Wang Xinlong, Ma Ning, Zhang Xiaoyue, Cheng Mingguo, and Shen Ruowu
Subjects: osteosarcoma, drug resistance, FOXM1, methotrexate, Biochemistry, QD415-436, Genetics, QH426-470
Abstract: Osteosarcoma (OS) is a primary bone cancer mostly found in adolescents and elderly individuals. The treatment of OS is still largely dependent on traditional chemotherapy. However, the high incidence of drug resistance remains one of the greatest impediments to limiting improvements in OS treatment. Recent findings have indicated that the transcription factor FOXM1 plays an important role in various cancer-related events, especially drug resistance. However, the possible role of FOXM1 in the resistance of OS to methotrexate (MTX) remains to be explored. Here, we find that FOXM1, which confers resistance to MTX, is highly expressed in OS tissues and MTX-resistant cells. FOXM1 overexpression promotes MTX resistance by enhancing autophagy in an HMMR/ATG7-dependent manner. Importantly, silencing of FOXM1 or inhibiting autophagy reverses drug resistance. These findings demonstrate a new mechanism for FOXM1-induced MTX resistance and provide a promising target for improving OS chemotherapy outcomes.
Published: 2024
Full Text: View/download PDF

38. Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Author: Wang, Xinlong, Wang, Wen, Cao, Yue, Shen, Chunhua, and Huang, Tiejun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. In addition, Painter significantly outperforms recent generalist models on several challenging tasks., Comment: Accepted to CVPR 2023. Code and model is available at: https://github.com/baaivision/Painter
Published: 2022

39. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Author: Fang, Yuxin, Wang, Wen, Xie, Binhui, Sun, Quan, Wu, Ledell, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, and Cao, Yue
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA., Comment: v2: (i) fix / update EVA IN-1K variants results. (ii) add / update EVA-CLIP results. (iii) add Appendix. (iv) release all the code and models at https://github.com/baaivision/EVA
Published: 2022

40. Polygonal patterns of Faraday water waves analogous to collective excitations in Bose–Einstein condensates

Author: Liu, Xinyun and Wang, Xinlong
Published: 2024
Full Text: View/download PDF

41. Recent progress of manganese-based Prussian blue analogue cathode materials for sodium-ion batteries

Author: Liu, Yuao, Liu, Hongquan, Zhang, Ruizhong, Zhong, Yanjun, Wu, Zhenguo, Wang, Xinlong, and Zhang, Zhiye
Published: 2024
Full Text: View/download PDF

42. Face-directed Strategy for the Construction of Polyoxovanadate-based Metal-Organic Tetrahedra

Author: Chen, Huiping, Gong, Yaru, Chu, Qiangqiang, Pang, Xiao, Huang, Xiaojing, Tian, Xudong, Yang, Weiting, Pan, Qinhe, Su, Zhongmin, and Wang, Xinlong
Published: 2023
Full Text: View/download PDF

43. Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Author: Ge, Yongtao, Zhou, Qiang, Wang, Xinlong, Wang, Zhibin, Li, Hao, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point annotations are considerably more time-efficient than bounding box annotations. However, how to use cheap point annotations to boost the performance of semi-supervised object detection remains largely unsolved. In this work, we present Point-Teaching, a weakly semi-supervised object detection framework to fully exploit the point annotations. Specifically, we propose a Hungarian-based point matching method to generate pseudo labels for point annotated images. We further propose multiple instance learning (MIL) approaches at the level of images and points to supervise the object detector with point annotations. Finally, we propose a simple-yet-effective data augmentation, termed point-guided copy-paste, to reduce the impact of the unmatched points. Experiments demonstrate the effectiveness of our method on a few datasets and various data regimes.
Published: 2022

44. Hydroxylated SiO2-modified {0 0 1}-TiO2 nanosheets as a surface multifunctional photocatalyst for enhanced degradation of gaseous toluene

Author: Tang, Juntao, Wang, Xinlong, Huang, Yusheng, Du, Xiaopeng, He, Zhiqiao, Wang, Da, and Song, Shuang
Published: 2024
Full Text: View/download PDF

45. A biodegradable microgrooved and tissue mechanocompatible citrate-based scaffold improves bladder tissue regeneration

Author: Goedegebuure, Madeleine, Bury, Matthew I., Wang, Xinlong, Sanfelice, Pasquale, Cammarata, Federico, Wang, Larry, Sharma, Tiffany T., Rajinikanth, Nachiket, Karra, Vikram, Siddha, Vidhika, Sharma, Arun K., and Ameer, Guillermo A.
Published: 2024
Full Text: View/download PDF

46. Personalized composite scaffolds for accelerated cell- and growth factor-free craniofacial bone regeneration

Author: Kim, Mirae, Wang, Xinlong, Li, Yiming, Lin, Zitong, Collins, Caralyn P., Liu, Yugang, Ahn, Yujin, Tsal, Hsiu-Ming, Song, Joseph W., Duan, Chongwen, Zhu, Yi, Sun, Cheng, He, Tong-Chuan, Luo, Yuan, Reid, Russell R., and Ameer, Guillermo A.
Published: 2024
Full Text: View/download PDF

47. Improving the particulate matter filtration, antibacterial, and degradation properties of electrospinning poly(lactic acid) membranes with ZIF-8@chitosan

Author: Deng, Qingchen, Huang, Zhen, Zhu, Mengyu, Zong, Xinyue, Yue, Zhenqing, and Wang, Xinlong
Published: 2024
Full Text: View/download PDF

48. EVA-02: A visual representation for neon genesis

Author: Fang, Yuxin, Sun, Quan, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, and Cao, Yue
Published: 2024
Full Text: View/download PDF

49. Effects of the quorum sensing related luxS gene and lsr operon on Klebsiella michiganensis resisting copper stress

Author: Gao, Ya, Peng, Dongyu, Wang, Xinlong, and Lin, Shanshan
Published: 2024
Full Text: View/download PDF

50. Glucose-activated self-cascade antibacterial and pro-angiogenesis nanozyme-functionalized chitosan-arginine thermosensitive hydrogel for chronic diabetic wounds healing

Author: Chen, Shuhui, Chen, Jiali, Wang, Xinlong, Yang, Zhaofei, Lan, Jinxi, Wang, Liudi, Ji, Bingjie, and Yuan, Yue
Published: 2025
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,914 results on '"Wang, Xinlong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources