Author: "Yan, Ming" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yan, Ming"' showing total 189 results

Start Over Author "Yan, Ming" Publication Type Reports

189 results on '"Yan, Ming"'

1. Decoupling Layout from Glyph in Online Chinese Handwriting Generation

Author: Ren, Min-Si, Zhang, Yan-Ming, and Chen, Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving {complete text line generation largely unexplored}. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.
Published: 2024

2. Trustworthy Hate Speech Detection Through Visual Augmentation

Author: Yang, Ziyuan, Yan, Ming, Chen, Yingyu, Wang, Hui, Lu, Zexin, and Zhang, Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The surge of hate speech on social media platforms poses a significant challenge, with hate speech detection~(HSD) becoming increasingly critical. Current HSD methods focus on enriching contextual information to enhance detection performance, but they overlook the inherent uncertainty of hate speech. We propose a novel HSD method, named trustworthy hate speech detection method through visual augmentation (TrusV-HSD), which enhances semantic information through integration with diffused visual images and mitigates uncertainty with trustworthy loss. TrusV-HSD learns semantic representations by effectively extracting trustworthy information through multi-modal connections without paired data. Our experiments on public HSD datasets demonstrate the effectiveness of TrusV-HSD, showing remarkable improvements over conventional methods.
Published: 2024

3. SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Author: Qian, Qi, Xu, Haiyang, Yan, Ming, and Hu, Juhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models demonstrate impressive image generation performance with text guidance. Inspired by the learning process of diffusion, existing images can be edited according to text by DDIM inversion. However, the vanilla DDIM inversion is not optimized for classifier-free guidance and the accumulated error will result in the undesired performance. While many algorithms are developed to improve the framework of DDIM inversion for editing, in this work, we investigate the approximation error in DDIM inversion and propose to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework. Moreover, a better guidance scale (i.e., 0.5) than default settings can be derived theoretically. Experiments on PIE-Bench show that our proposal can improve the performance of DDIM inversion dramatically without sacrificing efficiency.
Published: 2024

4. mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Zhang, Liang, Ye, Jiabo, Yan, Ming, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2., Comment: 15 pages, 7 figures
Published: 2024

5. MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Author: Jiang, Chaoya, Hongrui, Jia, Xu, Haiyang, Ye, Wei, Dong, Mengfan, Yan, Ming, Zhang, Ji, Huang, Fei, and Zhang, Shikun
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
Published: 2024

6. ProFuser: Progressive Fusion of Large Language Models

Author: Shi, Tianyuan, Wan, Fanqi, Huang, Canbin, Quan, Xiaojun, Li, Chenliang, Yan, Ming, and Zhang, Ji
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While fusing the capacities and advantages of various large language models (LLMs) offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during the training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including vicuna-7b-v1.5, Llama-2-7b-chat, and mpt-7b-8k-chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.
Published: 2024

7. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Author: Ye, Jiabo, Xu, Haiyang, Liu, Haowei, Hu, Anwen, Yan, Ming, Qian, Qi, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
Published: 2024

8. MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Author: Liu, Haowei, Zhang, Xi, Xu, Haiyang, Shi, Yaya, Jiang, Chaoya, Yan, Ming, Zhang, Ji, Huang, Fei, Yuan, Chunfeng, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench., Comment: 10 pages, 4 figures
Published: 2024

9. Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

Author: Xu, Xuenan, Zhang, Pingyue, Yan, Ming, Zhang, Ji, and Wu, Mengyue
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture., Comment: Interspeech 2024
Published: 2024

10. DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Author: Li, Baihan, Xie, Zeyu, Xu, Xuenan, Guo, Yiwei, Yan, Ming, Zhang, Ji, Yu, Kai, and Wu, Mengyue
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.
Published: 2024

11. Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Author: Dan, Yuhao, Tian, Junfeng, Zhou, Jie, Yan, Ming, Zhang, Ji, Chen, Qin, and He, Liang
Subjects: Computer Science - Computation and Language
Abstract: Data-to-Text Generation (D2T), a classic natural language generation problem, aims at producing fluent descriptions for structured input data, such as a table. Existing D2T works mainly focus on describing the superficial associative relations among entities, while ignoring the deep comparative logical relations, such as A is better than B in a certain aspect with a corresponding opinion, which is quite common in our daily life. In this paper, we introduce a new D2T task named comparative logical relation generation (CLRG). Additionally, we propose a Comparative Logic (CoLo) based text generation method, which generates texts following specific comparative logical relations with contrastive learning. Specifically, we first construct various positive and negative samples by fine-grained perturbations in entities, aspects and opinions. Then, we perform contrastive learning in the encoder layer to have a better understanding of the comparative logical relations, and integrate it in the decoder layer to guide the model to correctly generate the relations. Noting the data scarcity problem, we construct a Chinese Comparative Logical Relation Dataset (CLRD), which is a high-quality human-annotated dataset and challenging for text generation with descriptions of multiple entities and annotations on their comparative logical relations. Extensive experiments show that our method achieves impressive performance in both automatic and human evaluations., Comment: NLPCC 2024
Published: 2024

12. Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Author: Zhang, Yang, Bao, Keqin, Yan, Ming, Wang, Wenjie, Feng, Fuli, and He, Xiangnan
Subjects: Computer Science - Information Retrieval, H.3.3
Abstract: When adapting Large Language Models for Recommendation (LLMRec), it is crucial to integrate collaborative information. Existing methods achieve this by learning collaborative embeddings in LLMs' latent space from scratch or by mapping from external models. However, they fail to represent the information in a text-like format, which may not align optimally with LLMs. To bridge this gap, we introduce BinLLM, a novel LLMRec method that seamlessly integrates collaborative information through text-like encoding. BinLLM converts collaborative embeddings from external models into binary sequences -- a specific text format that LLMs can understand and operate on directly, facilitating the direct usage of collaborative information in text-like format by LLMs. Additionally, BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths. Extensive experiments validate that BinLLM introduces collaborative information in a manner better aligned with LLMs, resulting in enhanced performance. We release our code at https://github.com/zyang1580/BinLLM., Comment: Accepted by ACL 2024
Published: 2024

13. Initialization-enhanced Physics-Informed Neural Network with Domain Decomposition (IDPINN)

Author: Si, Chenhao and Yan, Ming
Subjects: Computer Science - Machine Learning
Abstract: We propose a new physics-informed neural network framework, IDPINN, based on the enhancement of initialization and domain decomposition to improve prediction accuracy. We train a PINN using a small dataset to obtain an initial network structure, including the weighted matrix and bias, which initializes the PINN for each subdomain. Moreover, we leverage the smoothness condition on the interface to enhance the prediction performance. We numerically evaluated it on several forward problems and demonstrated the benefits of IDPINN in terms of accuracy., Comment: 20 pages, 14 figures
Published: 2024

14. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Author: Wang, Junyang, Xu, Haiyang, Jia, Haitao, Zhang, Xi, Yan, Ming, Shen, Weizhou, Zhang, Ji, Huang, Fei, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent., Comment: 22 pages, 11 figures, 10 Tables
Published: 2024

15. Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Author: Guo, Weihong, Lou, Yifei, Qin, Jing, and Yan, Ming
Subjects: Electrical Engineering and Systems Science - Signal Processing, Mathematics - Numerical Analysis, Mathematics - Optimization and Control
Abstract: Time-varying graph signal recovery has been widely used in many applications, including climate change, environmental hazard monitoring, and epidemic studies. It is crucial to choose appropriate regularizations to describe the characteristics of the underlying signals, such as the smoothness of the signal over the graph domain and the low-rank structure of the spatial-temporal signal modeled in a matrix form. As one of the most popular options, the graph Laplacian is commonly adopted in designing graph regularizations for reconstructing signals defined on a graph from partially observed data. In this work, we propose a time-varying graph signal recovery method based on the high-order Sobolev smoothness and an error-function weighted nuclear norm regularization to enforce the low-rankness. Two efficient algorithms based on the alternating direction method of multipliers and iterative reweighting are proposed, and convergence of one algorithm is shown in detail. We conduct various numerical experiments on synthetic and real-world data sets to demonstrate the proposed method's effectiveness compared to the state-of-the-art in graph signal recovery.
Published: 2024

16. TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Author: Zhang, Liang, Hu, Anwen, Xu, Haiyang, Yan, Ming, Xu, Yichen, Jin, Qin, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart., Comment: 13 pages, 11 figures
Published: 2024

17. Understanding Emotional Hijacking in Metaverse

Author: Asif, Syed Ali, Gable, Philip, Shen, Chien-Chung, and Chiou, Yan-Ming
Subjects: Computer Science - Human-Computer Interaction
Abstract: Emotions are an integral part of being human, and experiencing a range of emotions is what makes life rich and vibrant. From basic emotions like anger, fear, happiness, and sadness to more complex ones like excitement and grief, emotions help us express ourselves and connect with the world around us. In recent years, researchers have begun adopting virtual reality (VR) technology to evoke emotions as realistically as possible and quantify the strength of emotions from the electroencephalogram (EEG) signals measured from the brain to understand human emotions in realistic situations better. This is achieved by creating a sense of presence in the virtual environment, the feeling that the user is there. For instance, [6] studied the excitement of a rollercoaster ride in VR, and [5] studied the fear of navigating in a VR cave., Comment: This is an accepted position statement of CHI 2024 Workshop (Novel Approaches for Understanding and Mitigating Emerging New Harms in Immersive and Embodied Virtual Spaces: A Workshop at CHI 2024)
Published: 2024

18. Protecting Human Users Against Cognitive Attacks in Immersive Environments

Author: Chiou, Yan-Ming, Price, Bob, Shen, Chien-Chung, and Asif, Syed Ali
Subjects: Computer Science - Human-Computer Interaction
Abstract: Integrating mixed reality (MR) with artificial intelligence (AI) technologies, including vision, language, audio, reasoning, and planning, enables the AI-powered MR assistant [1] to substantially elevate human efficiency. This enhancement comes from situational awareness, quick access to essential information, and support in learning new skills in the right context throughout everyday tasks. This blend transforms interactions with both the virtual and physical environments, catering to a range of skill levels and personal preferences. For instance, computer vision enables the understanding of the user's environment, allowing for the provision of timely and relevant digital overlays in MR systems. At the same time, language models enhance comprehension of contextual information and support voice-activated dialogue to answer user questions. However, as AI-driven MR systems advance, they also unveil new vulnerabilities, posing a threat to user safety by potentially exposing them to grave dangers [5, 6]., Comment: This is an accepted position statement of CHI 2024 Workshop (Novel Approaches for Understanding and Mitigating Emerging New Harms in Immersive and Embodied Virtual Spaces: A Workshop at CHI 2024)
Published: 2024

19. Safeguarding People's Financial Health in Metaverse with Emotionally Intelligent Virtual Buddy

Author: Asif, Syed Ali, Cao, Emma, Chen, Hang, Shen, Chien-Chung, and Chiou, Yan-Ming
Subjects: Computer Science - Human-Computer Interaction
Abstract: The Metaverse, an immersive virtual world, has emerged as a shared space where people engage in various activities ranging from social interactions to commerce. Cryptocurrencies [3] and Non-Fungible Tokens (NFTs) [6] play pivotal roles within this virtual realm, reshaping interactions and transactions. Cryptocurrencies, utilizing cryptographic techniques for security, enable decentralized and secure transactions, and NFTs represent ownership or proof of authenticity of unique digital assets through the blockchain technology. While NFTs and cryptocurrencies offer innovative opportunities for ownership, trading, and monetization within the metaverse, their use also introduces potential risks and negative consequences, such as financial scams and fraud, highlighting the need for users to exercise caution and diligence in their virtual transactions., Comment: This is an accepted position statement of CHI 2024 Workshop (Novel Approaches for Understanding and Mitigating Emerging New Harms in Immersive and Embodied Virtual Spaces: A Workshop at CHI 2024)
Published: 2024

20. Adaptive Feature Fusion Neural Network for Glaucoma Segmentation on Unseen Fundus Images

Author: Zhong, Jiyuan, Ke, Hu, and Yan, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fundus image segmentation on unseen domains is challenging, especially for the over-parameterized deep models trained on the small medical datasets. To address this challenge, we propose a method named Adaptive Feature-fusion Neural Network (AFNN) for glaucoma segmentation on unseen domains, which mainly consists of three modules: domain adaptor, feature-fusion network, and self-supervised multi-task learning. Specifically, the domain adaptor helps the pretrained-model fast adapt from other image domains to the medical fundus image domain. Feature-fusion network and self-supervised multi-task learning for the encoder and decoder are introduced to improve the domain generalization ability. In addition, we also design the weighted-dice-loss to improve model performance on complex optic-cup segmentation tasks. Our proposed method achieves a competitive performance over existing fundus segmentation methods on four public glaucoma datasets., Comment: 17 pages, 11 figures
Published: 2024

21. Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning

Author: Xie, Xiaopeng, Yan, Ming, Zhou, Xiwen, Zhao, Chenlong, Wang, Suli, Zhang, Yong, and Zhou, Joey Tianyi
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Cryptography and Security, 68T50, I.2.7
Abstract: Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively., Comment: 10 pages, 6 figures, conference
Published: 2024

22. Collaborative Knowledge Infusion for Low-resource Stance Detection

Author: Yan, Ming, Zhou, Joey Tianyi, and Tsang, Ivor W.
Subjects: Computer Science - Computation and Language
Abstract: Stance detection is the view towards a specific target by a given context (\textit{e.g.} tweets, commercial reviews). Target-related knowledge is often needed to assist stance detection models in understanding the target well and making detection correctly. However, prevailing works for knowledge-infused stance detection predominantly incorporate target knowledge from a singular source that lacks knowledge verification in limited domain knowledge. The low-resource training data further increases the challenge for the data-driven large models in this task. To address those challenges, we propose a collaborative knowledge infusion approach for low-resource stance detection tasks, employing a combination of aligned knowledge enhancement and efficient parameter learning techniques. Specifically, our stance detection approach leverages target background knowledge collaboratively from different knowledge sources with the help of knowledge alignment. Additionally, we also introduce the parameter-efficient collaborative adaptor with a staged optimization algorithm, which collaboratively addresses the challenges associated with low-resource stance detection tasks from both network structure and learning perspectives. To assess the effectiveness of our method, we conduct extensive experiments on three public stance detection datasets, including low-resource and cross-target settings. The results demonstrate significant performance improvements compared to the existing stance detection approaches., Comment: 13 pages, 3 figures, Big Data Mining and Analysis
Published: 2024
Full Text: View/download PDF

23. RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Author: Yan, Ming, Zhang, Yan, Cai, Shuqiang, Fan, Shuqi, Lin, Xincheng, Dai, Yudi, Shen, Siqi, Wen, Chenglu, Xu, Lan, Ma, Yuexin, and Wang, Cheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding, we present RELI11D, a high-quality multimodal human motion dataset involves LiDAR, IMU system, RGB camera, and Event camera. It records the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event steams. Through extensive experiments, we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities, we propose LEIR, a multimodal baseline that effectively utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community, fostering collaboration and enabling further exploration in this field., Comment: CVPR2024, Project website: http://www.lidarhumanmotion.net/reli11d/
Published: 2024

24. ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

Author: Yang, Zonghan, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, and Liu, Yang
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotation or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.
Published: 2024

25. SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

Author: Chen, Hongzhan, Chen, Hehong, Yan, Ming, Xu, Wenshen, Gao, Xing, Shen, Weizhou, Quan, Xiaojun, Li, Chenliang, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench., Comment: ACL 2024 Findings
Published: 2024

26. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Zhang, Liang, Zhang, Bo, Li, Chen, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5., Comment: 21 pages, 15 figures
Published: 2024

27. Efficient sparse probability measures recovery via Bregman gradient

Author: Pan, Jianting and Yan, Ming
Subjects: Mathematics - Optimization and Control
Abstract: This paper presents an algorithm tailored for the efficient recovery of sparse probability measures incorporating $\ell_0$-sparse regularization within the probability simplex constraint. Employing the Bregman proximal gradient method, our algorithm achieves sparsity by explicitly solving underlying subproblems. We rigorously establish the convergence properties of the algorithm, showcasing its capacity to converge to a local minimum with a convergence rate of $O(1/k)$ under mild assumptions. To substantiate the efficacy of our algorithm, we conduct numerical experiments, offering a compelling demonstration of its efficiency in recovering sparse probability measures.
Published: 2024

28. Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Author: Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. Specifically, to provide more semantically meaningful supervision for MIM, we propose a local semantics enhancing approach, which harvest high-level semantics from global image features via self-supervised agreement learning and transfer them to local patch encodings by sharing the encoding space. Moreover, to achieve deep involvement of text during the entire MIM process, we propose a text-guided masking strategy and devise an efficient way of injecting textual information in both masked modeling and reconstruction target acquisition. Experimental results validate that our method improves the effectiveness of the MIM task in facilitating cross-modal semantic alignment. Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks., Comment: Accepted to LREC-COLING 2024
Published: 2024

29. Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Author: Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively., Comment: Accepted to LREC-COLING 2024
Published: 2024

30. Budget-Constrained Tool Learning with Planning

Author: Zheng, Yuanhang, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, and Liu, Yang
Subjects: Computer Science - Artificial Intelligence
Abstract: Despite intensive efforts devoted to tool learning, the problem of budget-constrained tool learning, which focuses on resolving user queries within a specific budget constraint, has been widely overlooked. This paper proposes a novel method for budget-constrained tool learning. Our approach involves creating a preferable plan under the budget constraint before utilizing the tools. This plan outlines the feasible tools and the maximum number of times they can be employed, offering a comprehensive overview of the tool learning process for large language models. This allows them to allocate the budget from a broader perspective. To devise the plan without incurring significant extra costs, we suggest initially estimating the usefulness of the candidate tools based on past experience. Subsequently, we employ dynamic programming to formulate the plan. Experimental results demonstrate that our method can be integrated with various tool learning methods, significantly enhancing their effectiveness under strict budget constraints., Comment: Accepted for Findings of ACL 2024
Published: 2024

31. Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Author: Jiang, Chaoya, Ye, Wei, Dong, Mengfan, Jia, Hongrui, Xu, Haiyang, Yan, Ming, Zhang, Ji, and Zhang, Shikun
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.
Published: 2024

32. PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs

Author: Liu, An, Yang, Zonghan, Zhang, Zhenhe, Hu, Qingyuan, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, and Liu, Yang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While Large language models (LLMs) have demonstrated considerable capabilities across various natural language tasks, they often fall short of the performance achieved by domain-specific state-of-the-art models. One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets. However, this method can be both resource and time-intensive, and not applicable to closed-source commercial LLMs. In this paper, we propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA), a method designed to augment the domain-specific capabilities of LLMs by leveraging insights from the response preference of expert models without requiring fine-tuning. Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks. Moreover, LLM with PANDA even outperforms the expert model that being learned on 4 tasks of ScienceWorld. This finding highlights the potential of exploring tuning-free approaches to achieve weak-to-strong generalization., Comment: Accepted as Findings of ACL 2024
Published: 2024

33. Model Composition for Multimodal Large Language Models

Author: Chen, Chi, Du, Yiyang, Fang, Zheng, Wang, Ziyue, Luo, Fuwen, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, Sun, Maosong, and Liu, Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters. Furthermore, we introduce DAMC to address parameter interference and mismatch issues during the merging process, thereby enhancing the model performance. To facilitate research in this area, we propose MCUB, a benchmark for assessing ability of MLLMs to understand inputs from diverse modalities. Experiments on this benchmark and four other multimodal understanding tasks show significant improvements over baselines, proving that model composition can create a versatile model capable of processing inputs from multiple modalities., Comment: ACL2024 Main Conference; Code is available at https://github.com/THUNLP-MT/ModelCompose
Published: 2024

34. Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

Author: Wang, Ziyue, Chen, Chi, Zhu, Yiqi, Luo, Fuwen, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, Sun, Maosong, and Liu, Yang
Subjects: Computer Science - Computation and Language
Abstract: With the bloom of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. A primary reason for this shortcoming is that the visual features for each images are encoded individually by frozen encoders before feeding into the LLM backbone, lacking awareness of other images and the multimodal instructions. We term this issue as prior-LLM modality isolation and propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion prior to feeding the features into LLMs. This paradigm initially "browses" through the inputs for essential insights, and then revisits the inputs to "concentrate" on crucial details, guided by these insights, to achieve a more comprehensive understanding of the multimodal inputs. Additionally, we develop training strategies specifically to enhance the understanding of multi-image inputs. Our method markedly boosts the performance on 7 multi-image scenarios, contributing to increments on average accuracy by 2.13% and 7.60% against strong MLLMs baselines with 3B and 11B LLMs, respectively., Comment: 17 pages, 5 figures
Published: 2024

35. Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Author: Liu, Zijun, Kou, Boqun, Li, Peng, Yan, Ming, Zhang, Ji, Huang, Fei, and Liu, Yang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called $\textit{Meta Ranking}$ (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, where weak LLMs, such as Phi-2, could surpass strong baselines like GPT-3.5-turbo, requiring only five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass Alpaca-13B with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness., Comment: Preprint, under review. 28 pages
Published: 2024

36. Are Generative AI systems Capable of Supporting Information Needs of Patients?

Author: Rajagopal, Shreya, Hazarika, Subhashis, Kim, Sookyung, Chiou, Yan-ming, Sohn, Jae Ho, Subramonyam, Hari, and Mohan, Shiwali
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Patients managing a complex illness such as cancer face a complex information challenge where they not only must learn about their illness but also how to manage it. Close interaction with healthcare experts (radiologists, oncologists) can improve patient learning and thereby, their disease outcome. However, this approach is resource intensive and takes expert time away from other critical tasks. Given the recent advancements in Generative AI models aimed at improving the healthcare system, our work investigates whether and how generative visual question answering systems can responsibly support patient information needs in the context of radiology imaging data. We conducted a formative need-finding study in which participants discussed chest computed tomography (CT) scans and associated radiology reports of a fictitious close relative with a cardiothoracic radiologist. Using thematic analysis of the conversation between participants and medical experts, we identified commonly occurring themes across interactions, including clarifying medical terminology, locating the problems mentioned in the report in the scanned image, understanding disease prognosis, discussing the next diagnostic steps, and comparing treatment options. Based on these themes, we evaluated two state-of-the-art generative visual language models against the radiologist's responses. Our results reveal variability in the quality of responses generated by the models across various themes. We highlight the importance of patient-facing generative AI systems to accommodate a diverse range of conversational themes, catering to the real-world informational needs of patients.
Published: 2024

37. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Author: Wang, Junyang, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Shen, Weizhou, Zhang, Ji, Huang, Fei, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent., Comment: Accepted by ICLR 2024 Workshop in Large Language Model (LLM) Agents
Published: 2024

38. Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Author: Shen, Weizhou, Li, Chenliang, Chen, Hongzhan, Yan, Ming, Quan, Xiaojun, Chen, Hehong, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning., Comment: On progress, github repo: https://github.com/X-PLUG/Multi-LLM-Agent
Published: 2024

39. Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Author: Ye, Wei, Jiang, Chaoya, Xu, Haiyang, Ye, Chenhao, Li, Chenliang, Yan, Ming, Zhang, Shikun, Huang, Songhang, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
Published: 2024

40. LARP: Language-Agent Role Play for Open-World Games

Author: Yan, Ming, Li, Ruihao, Zhang, Hao, Wang, Hao, Yang, Zhilan, and Yan, Ji
Subjects: Computer Science - Artificial Intelligence
Abstract: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios. The project page is released at https://miao-ai-lab.github.io/LARP/., Comment: 12 pages, 4 figures
Published: 2023

41. TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Author: Jiang, Chaoya, ye, Wei, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Zhang, Ji, and Zhang, Shikun
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios., Comment: Accepted on AAAI2024
Published: 2023

42. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Author: Jiang, Chaoya, Xu, Haiyang, Dong, Mengfan, Chen, Jiaxing, Ye, Wei, Yan, Ming, Ye, Qinghao, Zhang, Ji, Huang, Fei, and Zhang, Shikun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
Published: 2023

43. mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Author: Hu, Anwen, Shi, Yaya, Xu, Haiyang, Ye, Jiabo, Ye, Qinghao, Yan, Ming, Li, Chenliang, Qian, Qi, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Multimedia, Computer Science - Computation and Language
Abstract: Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl., Comment: 20 pages, 12 figures
Published: 2023

44. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Author: Wang, Junyang, Wang, Yuhang, Xu, Guohai, Zhang, Jing, Gu, Yukai, Jia, Haitao, Wang, Jiaqi, Xu, Haiyang, Yan, Ming, Zhang, Ji, and Sang, Jitao
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER., Comment: 14 pages, 9 figures
Published: 2023

45. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Author: Ye, Qinghao, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Hu, Anwen, Liu, Haowei, Qian, Qi, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
Published: 2023

46. MCC-KD: Multi-CoT Consistent Knowledge Distillation

Author: Chen, Hongzhan, Wu, Siyue, Quan, Xiaojun, Wang, Rui, Yan, Ming, and Zhang, Ji
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have showcased remarkable capabilities in complex reasoning through chain of thought (CoT) prompting. Recently, there has been a growing interest in transferring these reasoning abilities from LLMs to smaller models. However, achieving both the diversity and consistency in rationales presents a challenge. In this paper, we focus on enhancing these two aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets., Comment: Accepted to ENMLP 2023
Published: 2023

47. Physical Information Neural Networks for Solving High-index Differential-algebraic Equation Systems Based on Radau Methods

Author: Chen, Jiasheng, Tang, Juan, Yan, Ming, Lai, Shuai, Liang, Kun, Lu, Jianguang, and Yang, Wenqiang
Subjects: Mathematics - Numerical Analysis, Computer Science - Artificial Intelligence
Abstract: As is well known, differential algebraic equations (DAEs), which are able to describe dynamic changes and underlying constraints, have been widely applied in engineering fields such as fluid dynamics, multi-body dynamics, mechanical systems and control theory. In practical physical modeling within these domains, the systems often generate high-index DAEs. Classical implicit numerical methods typically result in varying order reduction of numerical accuracy when solving high-index systems.~Recently, the physics-informed neural network (PINN) has gained attention for solving DAE systems. However, it faces challenges like the inability to directly solve high-index systems, lower predictive accuracy, and weaker generalization capabilities. In this paper, we propose a PINN computational framework, combined Radau IIA numerical method with a neural network structure via the attention mechanisms, to directly solve high-index DAEs. Furthermore, we employ a domain decomposition strategy to enhance solution accuracy. We conduct numerical experiments with two classical high-index systems as illustrative examples, investigating how different orders of the Radau IIA method affect the accuracy of neural network solutions. The experimental results demonstrate that the PINN based on a 5th-order Radau IIA method achieves the highest level of system accuracy. Specifically, the absolute errors for all differential variables remains as low as $10^{-6}$, and the absolute errors for algebraic variables is maintained at $10^{-5}$, surpassing the results found in existing literature. Therefore, our method exhibits excellent computational accuracy and strong generalization capabilities, providing a feasible approach for the high-precision solution of larger-scale DAEs with higher indices or challenging high-dimensional partial differential algebraic equation systems.
Published: 2023

48. A numerical investigation of quasi-static magnetoconvection with an imposed horizontal magnetic field

Author: Calkins, Michael A., AlRefae, Talal, Hernandez, Angel, Yan, Ming, and Maffei, Stefano
Subjects: Physics - Fluid Dynamics, Physics - Geophysics
Abstract: Quasi-static Rayleigh-B\'enard convection with an imposed horizontal magnetic field is investigated numerically for Chandrasekhar numbers up to $Q=10^6$ with stress free boundary conditions. Both $Q$ and the Rayleigh number ($Ra$) are varied to identify the various dynamical regimes that are present in this system. We find three primary regimes: (I) a two-dimensional (2D) regime in which the axes of the convection rolls are oriented parallel to the imposed magnetic field; (II) an anisotropic three-dimensional (3D) regime; and (III) a mean flow regime characterized by a large scale horizontal flow directed transverse to the imposed magnetic field. The transition to 3D dynamics is preceded by a series of 2D transitions in which the number of convective rolls decreases as $Ra$ is increased. For sufficiently large $Q$, there is an eventual transition to two rolls just prior to the 2D/3D transition. The 2D/3D transition occurs when inertial forces become comparable to the Lorentz force, i.e. when $\sqrt{Q}/Re = O(1)$; 2D, magnetically constrained states persist when $\sqrt{Q}/Re \gtrsim O(1)$. Within the 2D regime we find heat and momentum transport scalings that are consistent with the hydrodynamic asymptotic predictions of Chini and Cox [Phys. Fluids \textbf{21}, 083603 (2009)]: the Nusselt number ($Nu$) and Reynolds number ($Re$) scale as $Nu \sim Ra^{1/3}$ and $Re \sim Ra^{2/3}$, respectively. For $Q=10^6$, we find that the scaling behavior of $Nu$ and $Re$ breaks down at large values of $Ra$ due to a sequence of bifurcations and the eventual manifestation of mean flows., Comment: 31 pages, 9 figures
Published: 2023

49. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Author: Ye, Jiabo, Hu, Anwen, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Xu, Guohai, Li, Chenliang, Tian, Junfeng, Qian, Qi, Zhang, Ji, Jin, Qin, He, Liang, Lin, Xin Alex, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.
Published: 2023

50. ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Author: Li, Chenliang, Chen, Hehong, Yan, Ming, Shen, Weizhou, Xu, Haiyang, Wu, Zhikai, Zhang, Zhicheng, Zhou, Wenmeng, Chen, Yingda, Cheng, Chen, Shi, Hongzhu, Zhang, Ji, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent framework that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning over tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent library\footnote{https://github.com/modelscope/modelscope-agent} and online demo\footnote{https://modelscope.cn/studios/damo/ModelScopeGPT/summary} are now publicly available.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

189 results on '"Yan, Ming"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources