Author: "Xu, Haiyang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xu, Haiyang"' showing total 1,311 results

Start Over Author "Xu, Haiyang"

1,311 results on '"Xu, Haiyang"'

1. SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Author: Qian, Qi, Xu, Haiyang, Yan, Ming, and Hu, Juhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models demonstrate impressive image generation performance with text guidance. Inspired by the learning process of diffusion, existing images can be edited according to text by DDIM inversion. However, the vanilla DDIM inversion is not optimized for classifier-free guidance and the accumulated error will result in the undesired performance. While many algorithms are developed to improve the framework of DDIM inversion for editing, in this work, we investigate the approximation error in DDIM inversion and propose to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework. Moreover, a better guidance scale (i.e., 0.5) than default settings can be derived theoretically. Experiments on PIE-Bench show that our proposal can improve the performance of DDIM inversion dramatically without sacrificing efficiency.
Published: 2024

2. mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Zhang, Liang, Ye, Jiabo, Yan, Ming, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2., Comment: 15 pages, 7 figures
Published: 2024

3. MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Author: Jiang, Chaoya, Hongrui, Jia, Xu, Haiyang, Ye, Wei, Dong, Mengfan, Yan, Ming, Zhang, Ji, Huang, Fei, and Zhang, Shikun
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
Published: 2024

4. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Author: Ye, Jiabo, Xu, Haiyang, Liu, Haowei, Hu, Anwen, Yan, Ming, Qian, Qi, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
Published: 2024

5. MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Author: Liu, Haowei, Zhang, Xi, Xu, Haiyang, Shi, Yaya, Jiang, Chaoya, Yan, Ming, Zhang, Ji, Huang, Fei, Yuan, Chunfeng, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench., Comment: 10 pages, 4 figures
Published: 2024

6. OmniControlNet: Dual-stage Integration for Conditional Image Generation

Author: Wang, Yilin, Xu, Haiyang, Zhang, Xiang, Chen, Zeyuan, Sha, Zhizhou, Wang, Zirui, and Tu, Zhuowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation., Comment: Accepted to CVPR 2024 Workshop: Generative Models for Computer Vision
Published: 2024

7. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Author: Wang, Junyang, Xu, Haiyang, Jia, Haitao, Zhang, Xi, Yan, Ming, Shen, Weizhou, Zhang, Ji, Huang, Fei, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent., Comment: 22 pages, 11 figures, 10 Tables
Published: 2024

8. TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Author: Zhang, Liang, Hu, Anwen, Xu, Haiyang, Yan, Ming, Xu, Yichen, Jin, Qin, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart., Comment: 13 pages, 11 figures
Published: 2024

9. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Zhang, Liang, Zhang, Bo, Li, Chen, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5., Comment: 21 pages, 15 figures
Published: 2024

10. Bayesian Diffusion Models for 3D Shape Reconstruction

Author: Xu, Haiyang, Lei, Yu, Chen, Zeyuan, Zhang, Xiang, Zhao, Yue, Wang, Yilin, and Tu, Zhuowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM brings in rich prior information from standalone labels (e.g. point clouds) to improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian frameworks where explicit prior and likelihood are required for the inference, BDM performs seamless information fusion via coupled diffusion processes with learned gradient computation networks. The specialty of our BDM lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction., Comment: Accepted to CVPR 2024; Project Page: https://mlpc-ucsd.github.io/BDM/
Published: 2024

11. Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Author: Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. Specifically, to provide more semantically meaningful supervision for MIM, we propose a local semantics enhancing approach, which harvest high-level semantics from global image features via self-supervised agreement learning and transfer them to local patch encodings by sharing the encoding space. Moreover, to achieve deep involvement of text during the entire MIM process, we propose a text-guided masking strategy and devise an efficient way of injecting textual information in both masked modeling and reconstruction target acquisition. Experimental results validate that our method improves the effectiveness of the MIM task in facilitating cross-modal semantic alignment. Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks., Comment: Accepted to LREC-COLING 2024
Published: 2024

12. Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Author: Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively., Comment: Accepted to LREC-COLING 2024
Published: 2024

13. Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Author: Jiang, Chaoya, Ye, Wei, Dong, Mengfan, Jia, Hongrui, Xu, Haiyang, Yan, Ming, Zhang, Ji, and Zhang, Shikun
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.
Published: 2024

14. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Author: Wang, Junyang, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Shen, Weizhou, Zhang, Ji, Huang, Fei, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent., Comment: Accepted by ICLR 2024 Workshop in Large Language Model (LLM) Agents
Published: 2024

15. Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Author: Ye, Wei, Jiang, Chaoya, Xu, Haiyang, Ye, Chenhao, Li, Chenliang, Yan, Ming, Zhang, Shikun, Huang, Songhang, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
Published: 2024

16. Estimation of value-at-risk by Lp quantile regression

Author: Sun, Peng, Lin, Fuming, Xu, Haiyang, and Yu, Kaizhi
Published: 2024
Full Text: View/download PDF

17. TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Author: Jiang, Chaoya, ye, Wei, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Zhang, Ji, and Zhang, Shikun
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios., Comment: Accepted on AAAI2024
Published: 2023

18. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Author: Jiang, Chaoya, Xu, Haiyang, Dong, Mengfan, Chen, Jiaxing, Ye, Wei, Yan, Ming, Ye, Qinghao, Zhang, Ji, Huang, Fei, and Zhang, Shikun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
Published: 2023

19. mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Author: Hu, Anwen, Shi, Yaya, Xu, Haiyang, Ye, Jiabo, Ye, Qinghao, Yan, Ming, Li, Chenliang, Qian, Qi, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Multimedia, Computer Science - Computation and Language
Abstract: Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl., Comment: 20 pages, 12 figures
Published: 2023

20. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Author: Wang, Junyang, Wang, Yuhang, Xu, Guohai, Zhang, Jing, Gu, Yukai, Jia, Haitao, Wang, Jiaqi, Xu, Haiyang, Yan, Ming, Zhang, Ji, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER., Comment: 14 pages, 9 figures
Published: 2023

21. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Author: Ye, Qinghao, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Hu, Anwen, Liu, Haowei, Qian, Qi, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
Published: 2023

22. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Author: Ye, Jiabo, Hu, Anwen, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Xu, Guohai, Li, Chenliang, Tian, Junfeng, Qian, Qi, Zhang, Ji, Jin, Qin, He, Liang, Lin, Xin Alex, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.
Published: 2023

23. ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Author: Li, Chenliang, Chen, Hehong, Yan, Ming, Shen, Weizhou, Xu, Haiyang, Wu, Zhikai, Zhang, Zhicheng, Zhou, Wenmeng, Chen, Yingda, Cheng, Chen, Shi, Hongzhu, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent framework that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning over tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent library\footnote{https://github.com/modelscope/modelscope-agent} and online demo\footnote{https://modelscope.cn/studios/damo/ModelScopeGPT/summary} are now publicly available.
Published: 2023

24. Evaluation and Analysis of Hallucination in Large Vision-Language Models

Author: Wang, Junyang, Zhou, Yiyang, Xu, Guohai, Shi, Pengcheng, Zhao, Chenlin, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Zhang, Ji, Zhu, Jihua, Sang, Jitao, and Tang, Haoyu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large Vision-Language Models (LVLMs) have recently achieved remarkable success. However, LVLMs are still plagued by the hallucination problem, which limits the practicality in many scenarios. Hallucination refers to the information of LVLMs' responses that does not exist in the visual input, which poses potential risks of substantial consequences. There has been limited work studying hallucination evaluation in LVLMs. In this paper, we propose Hallucination Evaluation based on Large Language Models (HaELM), an LLM-based hallucination evaluation framework. HaELM achieves an approximate 95% performance comparable to ChatGPT and has additional advantages including low cost, reproducibility, privacy preservation and local deployment. Leveraging the HaELM, we evaluate the hallucination in current LVLMs. Furthermore, we analyze the factors contributing to hallucination in LVLMs and offer helpful suggestions to mitigate the hallucination problem. Our training data and human annotation hallucination data will be made public soon., Comment: 11 pages, 5 figures
Published: 2023

25. COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Author: Jiang, Chaoya, Xu, Haiyang, Ye, Wei, Ye, Qinghao, Li, Chenliang, Yan, Ming, Bi, Bin, Zhang, Shikun, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Multimedia
Abstract: Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale., Comment: Accepted on ACM MM2023
Published: 2023

26. BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Author: Jiang, Chaoya, Xu, Haiyang, Ye, Wei, Ye, Qinghao, Li, Chenliang, Yan, Ming, Bi, Bin, Zhang, Shikun, Huang, Fei, and Huang, Songfang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness. We evaluate our approach on various visual-language understanding and generation tasks and show competitive downstream task performance while boosting the training efficiency by 50\%. Additionally, our model achieves state-of-the-art performance on many downstream tasks by increasing input image resolution without increasing computational costs over baselines., Comment: Accepted on ICCV2023
Published: 2023

27. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Author: Ye, Jiabo, Hu, Anwen, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Dan, Yuhao, Zhao, Chenlin, Xu, Guohai, Li, Chenliang, Tian, Junfeng, Qi, Qian, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl., Comment: 10 pages, 8 figures
Published: 2023

28. Vision Transformer with Attention Map Hallucination and FFN Compaction

Author: Xu, Haiyang, Zhou, Zhichao, He, Dongliang, Li, Fu, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision Transformer(ViT) is now dominating many vision tasks. The drawback of quadratic complexity of its token-wise multi-head self-attention (MHSA), is extensively addressed via either token sparsification or dimension reduction (in spatial or channel). However, the therein redundancy of MHSA is usually overlooked and so is the feed-forward network (FFN). To this end, we propose attention map hallucination and FFN compaction to fill in the blank. Specifically, we observe similar attention maps exist in vanilla ViT and propose to hallucinate half of the attention maps from the rest with much cheaper operations, which is called hallucinated-MHSA (hMHSA). As for FFN, we factorize its hidden-to-output projection matrix and leverage the re-parameterization technique to strengthen its capability, making it compact-FFN (cFFN). With our proposed modules, a 10$\%$-20$\%$ reduction of floating point operations (FLOPs) and parameters (Params) is achieved for various ViT-based backbones, including straight (DeiT), hybrid (NextViT) and hierarchical (PVT) structures, meanwhile, the performances are quite competitive.
Published: 2023

29. Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

Author: Xu, Haiyang, Ye, Qinghao, Wu, Xuan, Yan, Ming, Miao, Yuan, Ye, Jiabo, Xu, Guohai, Hu, Anwen, Shi, Yaya, Xu, Guangwei, Li, Chenliang, Qian, Qi, Que, Maofei, Zhang, Ji, Zeng, Xiao, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge., Comment: Working in progress
Published: 2023

30. Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Author: Zhang, Zhen-Ru, Tan, Chuanqi, Xu, Haiyang, Wang, Chengyu, Huang, Jun, and Huang, Songfang
Subjects: Computer Science - Computation and Language
Abstract: Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix., Comment: Accepted to ACL 2023 (Main conference)
Published: 2023

31. Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Author: Jiang, Chaoya, Ye, Wei, Xu, Haiyang, yan, Miang, Zhang, Shikun, Zhang, Jie, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance., Comment: Accepted by ACL2023
Published: 2023

32. Transforming Visual Scene Graphs to Image Captions

Author: Yang, Xu, Peng, Jiawei, Wang, Zihua, Xu, Haiyang, Ye, Qinghao, Li, Chenliang, Huang, Songfang, Huang, Fei, Li, Zhangzikang, and Zhang, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://github.com/GaryJiajia/TSG., Comment: 12 pages, 4 figures, has been accepted by ACL 2023 main conference
Published: 2023
Full Text: View/download PDF

33. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Author: Ye, Qinghao, Xu, Haiyang, Xu, Guohai, Ye, Jiabo, Yan, Ming, Zhou, Yiyang, Wang, Junyang, Hu, Anwen, Shi, Pengcheng, Shi, Yaya, Li, Chenliang, Xu, Yuanhong, Chen, Hehong, Tian, Junfeng, Qian, Qi, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl., Comment: Working in Process
Published: 2023

34. ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

Author: Tian, Junfeng, Chen, Hehong, Xu, Guohai, Yan, Ming, Gao, Xing, Zhang, Jianhai, Li, Chenliang, Liu, Jiayi, Xu, Wenshen, Xu, Haiyang, Qian, Qi, Wang, Wei, Ye, Qinghao, Zhang, Jiejing, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present ChatPLUG, a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. Different from other open-domain dialogue models that focus on large-scale pre-training and scaling up model size or dialogue corpus, we aim to build a powerful and practical dialogue system for digital human with diverse skills and good multi-task generalization by internet-augmented instruction tuning. To this end, we first conduct large-scale pre-training on both common document corpus and dialogue data with curriculum learning, so as to inject various world knowledge and dialogue abilities into ChatPLUG. Then, we collect a wide range of dialogue tasks spanning diverse features of knowledge, personality, multi-turn memory, and empathy, on which we further instruction tune \modelname via unified natural language instruction templates. External knowledge from an internet search is also used during instruction finetuning for alleviating the problem of knowledge hallucinations. We show that \modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation, and demonstrates strong multi-task generalization on a variety of text understanding and generation tasks. In addition, we deploy \modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference. Our models and code will be made publicly available on ModelScope: https://modelscope.cn/models/damo/ChatPLUG-3.7B and Github: https://github.com/X-PLUG/ChatPLUG ., Comment: 36 pages
Published: 2023

35. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Author: Xu, Haiyang, Ye, Qinghao, Yan, Ming, Shi, Yaya, Ye, Jiabo, Xu, Yuanhong, Li, Chenliang, Bi, Bin, Qian, Qi, Wang, Wei, Xu, Guohai, Zhang, Ji, Huang, Songfang, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.
Published: 2023

36. Adaptively Clustering Neighbor Elements for Image-Text Generation

Author: Wang, Zihua, Yang, Xu, Zhang, Hanwang, Xu, Haiyang, Yan, Ming, Huang, Fei, and Zhang, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel Transformer-based image-to-text generation model termed as \textbf{ACF} that adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments for better visual-text coherence. To achieve this, we design a novel self-attention layer that applies self-attention over the elements in a local cluster window instead of the whole sequence. The window size is softly decided by a clustering matrix that is calculated by the current input data and thus this process is adaptive. By stacking these revised self-attention layers to construct ACF, the small clusters in the lower layers can be grouped into a bigger cluster, \eg vision/language. ACF clusters small objects/phrases into bigger ones. In this gradual clustering process, a parsing tree is generated which embeds the hierarchical knowledge of the input sequence. As a result, by using ACF to build the vision encoder and language decoder, the hierarchical object-phrase alignments are embedded and then transferred from vision to language domains in two popular image-to-text tasks: Image captioning and Visual Question Answering. The experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models and achieves comparable scores compared with some large-scale pre-trained models. Our code is available \href{https://github.com/ZihuaEvan/ACFModel/}{[here]}., Comment: Compared to v1 and v2, we expanded this method to VQA. And it proved that our method can be applied on more general image-text generation tasks
Published: 2023

37. Learning Trajectory-Word Alignments for Video-Language Tasks

Author: Yang, Xu, Li, Zhangzikang, Xu, Haiyang, Zhang, Hanwang, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Yu, Huang, Fei, and Huang, Songfang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In a video, an object usually appears as the trajectory, i.e., it spans over a few spatial but longer temporal patches, that contains abundant spatiotemporal contexts. However, modern Video-Language BERTs (VDL-BERTs) neglect this trajectory characteristic that they usually follow image-language BERTs (IL-BERTs) to deploy the patch-to-word (P2W) attention that may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks. Moreover, previous VDL-BERTs usually uniformly sample a few frames into the model while different trajectories have diverse graininess, i.e., some trajectories span longer frames and some span shorter, and using a few frames will lose certain useful temporal contexts. However, simply sampling more frames will also make pre-training infeasible due to the largely increased training burdens. To alleviate the problem, during the fine-tuning stage, we insert a novel Hierarchical Frame-Selector (HFS) module into the video encoder. HFS gradually selects the suitable frames conditioned on the text context for the later cross-modal encoder to learn better trajectory-word alignments. By the proposed T2W attention and HFS, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question-answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.
Published: 2023

38. Quantitative analysis of cervical vertebral maturation in Chinese adolescents based on three-dimensional morphology of cervical vertebrae

Author: WU Yue, TANG Wen, ZHANG Yuyanran, YUAN Weiyu, PAN Yifei, CHEN Xinyu, XU Haiyang, LYU Yunfan, IZADIKHAH Iman, CAO Dan, XIE Lizhe, YAN Bin
Subjects: skeletal maturation, cervical vertebrae maturation index, three-dimensional morphology, cone-beam computed tomography, Dentistry, RK1-715, Other systems of medicine, RZ201-999
Abstract: Objective To investigate associations between three-dimensional(3D) morphology of cervical vertebrae and skeletal maturation by cone-beam computed tomography(CBCT) and establish corresponding regression models for quantitatively evaluating cervical vertebral maturation(CVM). Methods The analyzed sample consisted of 358 CBCT images (175 male, 183 female), of which 277 images were randomly selected as the model development group and 81 as the performance test group. Twenty-one 3D morphological parameters were defined and measured, incorporating all parts of the cervical vertebrae, including the cervical vertebral bodies, transverse processes, spinous processes, pedicles, lamina, and articular processes. The cervical vertebral maturation index (CVMI) was determined by experienced orthodontists as reference standard. Spearman’s rank correlation coefficient and multivariable stepwise regression analysis were used to identify the associations and build regression models. The performance test group was employed to examine each model’s reliability. Paired-samples Wilcoxon signed-rank test compared the CVMI of the model prediction with the reference standard. Results Three-dimensional morphological changes in various parts of the cervical vertebrae correlated with CVMI (P0.05). Conclusion New associations were found between 3D morphology of cervical vertebrae and skeletal maturation. The 3D-driven morphometric CVM assessment method and corresponding regression models exhibited good credibility and high consistency with experts.
Published: 2024
Full Text: View/download PDF

39. HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Author: Ye, Qinghao, Xu, Guohai, Yan, Ming, Xu, Haiyang, Qian, Qi, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
Published: 2022

40. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Author: Li, Chenliang, Xu, Haiyang, Tian, Junfeng, Wang, Wei, Yan, Ming, Bi, Bin, Ye, Jiabo, Chen, Hehong, Xu, Guohai, Cao, Zheng, Zhang, Ji, Huang, Songfang, Huang, Fei, Zhou, Jingren, and Si, Luo
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.
Published: 2022

41. Image Captioning In the Transformer Age

Author: Xu, Yang, Li, Li, Xu, Haiyang, Huang, Songfang, Huang, Fei, and Cai, Jianfei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning., Comment: 8pages,2 figures
Published: 2022

42. In-sensor Computing Based on Two-terminal Optoelectronic Memristors

Author: Lin, Ya, primary, Wang, Zhongqiang, additional, Zhao, Xiaoning, additional, Xu, Haiyang, additional, and Liu, Yichun, additional
Published: 2023
Full Text: View/download PDF

43. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Author: Shi, Yaya, Yang, Xu, Xu, Haiyang, Yuan, Chunfeng, Li, Bing, Hu, Weiming, and Zha, Zheng-Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. However, they have some insuperable drawbacks, e.g., they cannot handle videos without references, and they may result in biased evaluation due to the one-to-many nature of video-to-text and the neglect of visual relevance. From the human evaluator's viewpoint, a high-quality caption should be consistent with the provided video, but not necessarily be similar to the reference in literal or semantics. Inspired by human evaluation, we propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning, which directly measures similarity between video and candidate captions. Benefit from the recent development of large-scale pre-training models, we exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore. Specifically, EMScore combines matching scores of both coarse-grained (video and caption) and fine-grained (frames and words) levels, which takes the overall understanding and detailed characteristics of the video into account. Furthermore, considering the potential information gain, EMScore can be flexibly extended to the conditions where human-labeled references are available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl datasets to systematically evaluate the existing metrics. VATEX-EVAL experiments demonstrate that EMScore has higher human correlation and lower reference dependency. ActivityNet-FOIL experiment verifies that EMScore can effectively identify "hallucinating" captions. The datasets will be released to facilitate the development of video captioning metrics. The code is available at: https://github.com/ShiYaya/emscore., Comment: cvpr2022
Published: 2021

44. Achieving Human Parity on Visual Question Answering

Author: Yan, Ming, Xu, Haiyang, Li, Chenliang, Tian, Junfeng, Bi, Bin, Wang, Wei, Chen, Weihua, Xu, Xianzhe, Wang, Fan, Cao, Zheng, Zhang, Zhicheng, Zhang, Qiyu, Zhang, Ji, Huang, Songfang, Huang, Fei, Si, Luo, and Jin, Rong
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper describes our recent research of AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding) that obtains similar or even slightly better results than human being does on VQA. This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.
Published: 2021

45. Interfacial engineering of the nickel/zinc oxides p-n heterojunction for promoting photo-assisted oxygen evolution reaction

Author: Wei, Shengjie, Xu, Haiyang, Sun, Dingcheng, Lin, Shan, Ji, Xu, Yang, Yue, and Zhang, Le
Published: 2024
Full Text: View/download PDF

46. Development and CO2 capture of nitrogen-enriched microporous carbon by coupling waste polyamides with lignocellulosic biomass

Author: Zhou, Shaojie, Ding, Shaoqiu, Xu, Haiyang, Zhu, Lingjun, and Wang, Shurong
Published: 2024
Full Text: View/download PDF

47. Tri-signal CdS@SiO2 nanoprobes for accurate and sensitive detection of human immunoglobulin G with enhanced flexibility and internal validation

Author: Wen, Xiaokun, Li, Hongyi, Chen, Hong, Wang, Kexin, Ding, Yadan, Wang, Guorui, Xu, Haiyang, and Hong, Xia
Published: 2024
Full Text: View/download PDF

48. Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Author: Yan, Ming, Xu, Haiyang, Li, Chenliang, Bi, Bin, Tian, Junfeng, Gui, Min, and Wang, Wei
Subjects: Computer Science - Multimedia, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.
Published: 2021

49. Solar blind avalanche photodetector based on a n-β-Ga2O3/n-Si heterojunction via an introduction of AlN buffer layer for interface lattice and band Engineering

Author: Gao, Chong, Wang, Yuefei, Fu, Shihao, Song, Youheng, Han, Yurui, Fu, Rongpeng, Wu, Zhe, Cui, Weizhe, Ma, Jiangang, Li, Bingsheng, Xu, Haiyang, Shen, Aidong, and Liu, Yichun
Published: 2024
Full Text: View/download PDF

50. Ferromagnetism with above-room-temperature Curie temperature in Fe-doped β-Ga2O3 studied by first-principles calculations

Author: Xia, Danyang, Fu, Rongpeng, Wang, Yuefei, Li, Bingsheng, Ma, Jiangang, Xu, Haiyang, Shen, Aidong, and Liu, Yichun
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,311 results on '"Xu, Haiyang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources