Author: "Shi, Shuming" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shi, Shuming"' showing total 617 results

Start Over Author "Shi, Shuming"

617 results on '"Shi, Shuming"'

1. Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

Author: Chung, Tsz Ting, Cui, Leyang, Liu, Lemao, Huang, Xinting, Shi, Shuming, and Yeung, Dit-Yan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts., Comment: 14 pages, 5 figures, 10 tables, EMNLP 2024 Findings
Published: 2024

2. An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation

Author: Yang, Cheng, Huang, Guoping, Yu, Mo, Zhang, Zhirui, Li, Siheng, Yang, Mingming, Shi, Shuming, Yang, Yujiu, and Liu, Lemao
Subjects: Computer Science - Computation and Language
Abstract: Word-level AutoCompletion(WLAC) is a rewarding yet challenging task in Computer-aided Translation. Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label (i.e., the candidate target word is treated as a label). Since the context hidden vector itself does not take the label into account and it is projected to the label through a linear classifier, the model can not sufficiently leverage valuable information from the source sentence as verified in our experiments, which eventually hinders its overall performance. To alleviate this issue, this work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence. Unfortunately, training and inference suffer from efficiency and effectiveness challenges, thereby we employ three simple yet effective strategies to put our model into practice. Experiments on four standard benchmarks demonstrate that our reranking-based approach achieves substantial improvements (about 6.07%) over the previous state-of-the-art model. Further analyses show that each strategy of our approach contributes to the final performance., Comment: Accepted to TACL 2024
Published: 2024
Full Text: View/download PDF

3. Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Author: Yang, Sen, Cui, Leyang, Cai, Deng, Huang, Xinting, Shi, Shuming, and Lam, Wai
Subjects: Computer Science - Computation and Language
Abstract: Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations., Comment: Findings of EMNLP 2024
Published: 2024

4. On the Transformations across Reward Model, Parameter Update, and In-Context Prompt

Author: Cai, Deng, Li, Huayang, Fu, Tingchen, Li, Siheng, Xu, Weiwen, Li, Shuaiyi, Cao, Bowen, Zhang, Zhisong, Huang, Xinting, Cui, Leyang, Wang, Yan, Liu, Lemao, Watanabe, Taro, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation directions, each of which facilitates a variety of applications. Our work offers a holistic view that unifies numerous existing studies and suggests potential research directions. We envision our work as a useful roadmap for future research on LLMs.
Published: 2024

5. Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

Author: Fu, Tingchen, Cai, Deng, Liu, Lemao, Shi, Shuming, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework. To be concrete, we disperse the instruction-following data into portions and train multiple sub-models using different data portions. Then we merge multiple models into a single one via model merging techniques. Despite its simplicity, our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks., Comment: Accepted to the findings of ACL2024
Published: 2024

6. Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text

Author: Li, Yafu, Wang, Zhilin, Cui, Leyang, Bi, Wei, Shi, Shuming, and Zhang, Yue
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: AI-generated text detection has attracted increasing attention as powerful language models approach human-level generation. Limited work is devoted to detecting (partially) AI-paraphrased texts. However, AI paraphrasing is commonly employed in various application scenarios for text refinement and diversity. To this end, we propose a novel detection framework, paraphrased text span detection (PTD), aiming to identify paraphrased text spans within a text. Different from text-level detection, PTD takes in the full text and assigns each of the sentences with a score indicating the paraphrasing degree. We construct a dedicated dataset, PASTED, for paraphrased text span detection. Both in-distribution and out-of-distribution results demonstrate the effectiveness of PTD models in identifying AI-paraphrased text spans. Statistical and model analysis explains the crucial role of the surrounding context of the paraphrased text spans. Extensive experiments show that PTD models can generalize to versatile paraphrasing prompts and multiple paraphrased text spans. We release our resources at https://github.com/Linzwcs/PASTED., Comment: ACL 2024 Findings
Published: 2024

7. CORM: Cache Optimization with Recent Message for Large Language Model Inference

Author: Dai, Jincheng, Huang, Zhuowei, Jiang, Haiyun, Chen, Chen, Cai, Deng, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
Published: 2024

8. Pretraining without wordpieces: learning over a vocabulary of millions of words

Author: Feng, Zhangyin, Tang, Duyu, Feng, Xiaocheng, Zhou, Cong, Liao, Junwei, Wu, Shuangzhi, Qin, Bing, Cao, Yunbo, and Shi, Shuming
Published: 2024
Full Text: View/download PDF

9. Retrieval is Accurate Generation

Author: Cao, Bowen, Cai, Deng, Cui, Leyang, Cheng, Xuxin, Bi, Wei, Zou, Yuexian, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: Standard language models generate text by selecting tokens from a fixed, finite, and standalone vocabulary. We introduce a novel method that selects context-aware phrases from a collection of supporting documents. One of the most significant challenges for this paradigm shift is determining the training oracles, because a string of text can be segmented in various ways and each segment can be retrieved from numerous possible documents. To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement. Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. For instance, compared to the standard language model counterpart, our model raises the accuracy from 23.47% to 36.27% on OpenbookQA, and improves the MAUVE score from 42.61% to 81.58% in open-ended text generation. Remarkably, our model also achieves the best performance and the lowest latency among several retrieval-augmented baselines. In conclusion, we assert that retrieval is more accurate generation and hope that our work will encourage further research on this new paradigm shift., Comment: ICLR 2024
Published: 2024

10. Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model

Author: He, Zhiwei, Wang, Xing, Jiao, Wenxiang, Zhang, Zhuosheng, Wang, Rui, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation (QE), which predicts the quality of a given translation without reference, has achieved impressive alignment with human evaluations in the last two years. In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. We examine the problem and argue that the vulnerability of the QE model might lead to high rewards for incorrect translations, resulting in overoptimization and error propagation. To address the problem, we adopt a simple yet effective method that uses heuristic rules to detect the incorrect translations and assigns a penalty term to the reward scores of them. Experimental results show that the proposed QE-based feedback training achieves consistent and significant improvements across various settings, further verified through human preference studies. Our subsequent analysis demonstrates the high data efficiency of the proposed QE-based feedback training: it outperforms systems using larger parallel corpora by a small amount of monolingual data. Our code is available at: https://github.com/zwhe99/FeedbackMT, Comment: NAACL 2024
Published: 2024

11. Benchmarking LLMs via Uncertainty Quantification

Author: Ye, Fanghua, Yang, Mingming, Pang, Jianhui, Wang, Longyue, Wong, Derek F., Yilmaz, Emine, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves nine LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs., Comment: 30 pages, accepted to NeurIPS 2024
Published: 2024

12. Knowledge Verification to Nip Hallucination in the Bud

Author: Wan, Fanqi, Huang, Xinting, Cui, Leyang, Quan, Xiaojun, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: While large language models (LLMs) have demonstrated exceptional performance across various tasks following human alignment, they may still generate responses that sound plausible but contradict factual knowledge, a phenomenon known as hallucination. In this paper, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. Specifically, we propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements several specific strategies to deal with these data instances. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales. This confirms the effectiveness of mitigating hallucinations by reducing knowledge inconsistency. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/KCA}., Comment: Accepted to EMNLP 2024 (Main Conference)
Published: 2024

13. Knowledge Fusion of Large Language Models

Author: Wan, Fanqi, Huang, Xinting, Cai, Deng, Quan, Xiaojun, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}., Comment: Accepted to ICLR 2024
Published: 2024

14. Alleviating Hallucinations of Large Language Models through Induced Hallucinations

Author: Zhang, Yue, Cui, Leyang, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Despite their impressive capabilities, large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon commonly known as ``hallucination''. In this work, we propose a simple \textit{Induce-then-Contrast} Decoding (ICD) strategy to alleviate hallucinations. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallucinations during decoding to enhance the factuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the original model and downplaying the induced untruthful predictions via contrastive decoding. Experimental results on both discrimination-based and generation-based hallucination evaluation benchmarks, such as TruthfulQA and \textsc{FActScore}, demonstrate that our proposed ICD methods can effectively enhance the factuality of LLMs across various model sizes and families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively., Comment: Work in progress
Published: 2023

15. Emage: Non-Autoregressive Text-to-Image Generation

Author: Feng, Zhangyin, Hu, Runyi, Liu, Liangxin, Zhang, Fan, Tang, Duyu, Dai, Yong, Feng, Xiaocheng, Li, Jiwei, Qin, Bing, and Shi, Shuming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Autoregressive and diffusion models drive the recent breakthroughs on text-to-image generation. Despite their huge success of generating high-realistic images, a common shortcoming of these models is their high inference latency - autoregressive models run more than a thousand times successively to produce image tokens and diffusion models convert Gaussian noise into images with many hundreds of denoising steps. In this work, we explore non-autoregressive text-to-image models that efficiently generate hundreds of image tokens in parallel. We develop many model variations with different learning and inference strategies, initialized text encoders, etc. Compared with autoregressive baselines that needs to run one thousand times, our model only runs 16 times to generate images of competitive quality with an order of magnitude lower inference latency. Our non-autoregressive model with 346M parameters generates an image of 256$\times$256 with about one second on one V100 GPU.
Published: 2023

16. Reasons to Reject? Aligning Language Models with Judgments

Author: Xu, Weiwen, Cai, Deng, Zhang, Zhisong, Lam, Wai, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: As humans, we consistently interact with our peers and receive feedback in the form of natural language. This language feedback allows us to maintain appropriate behavior, and rectify potential errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with scalar rewards, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We start with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods cannot fully capitalize on judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 50.84 points on AlpacaEval. CUT (LLaMA2-chat-13b) can also align LLMs in an iterative fashion using up-to-date model-specific judgments, improving performance from 81.09 to 91.68 points on AlpacaEval. Further analysis suggests that judgments hold greater potential than rewards in LLM alignment., Comment: Accepted at ACL 2024 Findings. Our source codes and models are publicly available at https://github.com/wwxu21/CUT
Published: 2023

17. When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding and Reasoning

Author: Ai, Qihang, Zhou, Jianwu, Jiang, Haiyun, Liu, Lemao, and Shi, Shuming
Subjects: Computer Science - Artificial Intelligence
Abstract: Graph data is ubiquitous in the physical world, and it has always been a challenge to efficiently model graph structures using a unified paradigm for the understanding and reasoning on various graphs. Moreover, in the era of large language models, integrating complex graph information into text sequences has become exceptionally difficult, which hinders the ability to interact with graph data through natural language instructions.The paper presents a new paradigm for understanding and reasoning about graph data by integrating image encoding and multimodal technologies. This approach enables the comprehension of graph data through an instruction-response format, utilizing GPT-4V's advanced capabilities. The study evaluates this paradigm on various graph types, highlighting the model's strengths and weaknesses, particularly in Chinese OCR performance and complex reasoning tasks. The findings suggest new direction for enhancing graph data processing and natural language interaction., Comment: 15 pages, 10 figures, 9 tables
Published: 2023

18. GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Author: Wang, Zhanyu, Wang, Longyue, Zhao, Zhen, Wu, Minghao, Lyu, Chenyang, Li, Huayang, Cai, Deng, Zhou, Luping, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios., Comment: ACM MM 2024, Oral
Published: 2023

19. StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

Author: Gao, Chang, Jiang, Haiyun, Cai, Deng, Shi, Shuming, and Lam, Wai
Subjects: Computer Science - Computation and Language
Abstract: Most existing prompting methods suffer from the issues of generalizability and consistency, as they often rely on instance-specific solutions that may not be applicable to other instances and lack task-level consistency across the selected few-shot examples. To address these limitations, we propose a comprehensive framework, StrategyLLM, allowing LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts. It employs four LLM-based agents: strategy generator, executor, optimizer, and evaluator, working together to generate, evaluate, and select promising strategies for a given task. Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2\% $\rightarrow$ 38.8\%), commonsense reasoning (70.3\% $\rightarrow$ 72.5\%), algorithmic reasoning (73.7\% $\rightarrow$ 85.0\%), and symbolic reasoning (30.0\% $\rightarrow$ 79.2\%). Further analysis reveals that StrategyLLM is applicable to various LLMs and demonstrates advantages across numerous scenarios.
Published: 2023

20. Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

Author: Wang, Longyue, Tu, Zhaopeng, Gu, Yan, Liu, Siyou, Yu, Dian, Ma, Qingsong, Lyu, Chenyang, Zhou, Liting, Liu, Chao-Hong, Ma, Yufeng, Chen, Weiyu, Graham, Yvette, Webber, Bonnie, Koehn, Philipp, Way, Andy, Yuan, Yulin, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 14 submissions from 7 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at http://www2.statmt.org/wmt23/literary-translation-task.html., Comment: WMT2023 Discourse-Level Literary Translation Shared Task Overview Paper
Published: 2023

21. Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Author: Liang, Tian, He, Zhiwei, Huang, Jen-tse, Wang, Wenxuan, Jiao, Wenxiang, Wang, Rui, Yang, Yujiu, Tu, Zhaopeng, Shi, Shuming, and Wang, Xing
Subjects: Computer Science - Computation and Language
Abstract: The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication., Comment: Work in progress
Published: 2023

22. Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Author: Chen, Xingyu, Liu, Lemao, Huang, Guoping, Zhang, Zhirui, Yang, Mingming, Shi, Shuming, and Wang, Rui
Subjects: Computer Science - Computation and Language
Abstract: Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted Translation. It aims at providing word-level auto-completion suggestions for human translators. While previous studies have primarily focused on designing complex model architectures, this paper takes a different perspective by rethinking the fundamental question: what kind of words are good auto-completions? We introduce a measurable criterion to answer this question and discover that existing WLAC models often fail to meet this criterion. Building upon this observation, we propose an effective approach to enhance WLAC performance by promoting adherence to the criterion. Notably, the proposed approach is general and can be applied to various encoder-based architectures. Through extensive experiments, we demonstrate that our approach outperforms the top-performing system submitted to the WLAC shared tasks in WMT2022, while utilizing significantly smaller model sizes., Comment: EMNLP2023
Published: 2023

23. On Synthetic Data for Back Translation

Author: Xu, Jiahao, Ruan, Yubin, Bi, Wei, Huang, Guoping, Shi, Shuming, Chen, Lihui, and Liu, Lemao
Subjects: Computer Science - Computation and Language
Abstract: Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: {\em what kind of synthetic data contributes to BT performance?} Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.
Published: 2023
Full Text: View/download PDF

24. IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems

Author: Huang, Xu, Zhang, Zhirui, Gao, Ruize, Du, Yichao, Liu, Lemao, Huang, Gouping, Shi, Shuming, Chen, Jiajun, and Huang, Shujian
Subjects: Computer Science - Computation and Language
Abstract: We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform that enables researchers to quickly build IMT systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting, in which human interventions can be explicitly incorporated to produce high-quality, error-free translations. To this end, a general communication interface is designed to support the flexible IMT architectures and user policies. Based on the proposed design, we construct a simulated and real interactive environment to achieve end-to-end evaluation and leverage the framework to systematically evaluate previous IMT systems. Our simulated and manual experiments show that the prefix-constrained decoding approach still gains the lowest editing cost in the end-to-end evaluation, while BiTIIMT achieves comparable editing cost with a better interactive experience., Comment: Accepted by EMNLP2023
Published: 2023

25. Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration

Author: Wan, Fanqi, Huang, Xinting, Yang, Tao, Quan, Xiaojun, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: Instruction-tuning can be substantially optimized through enhanced diversity, resulting in models capable of handling a broader spectrum of tasks. However, existing data employed for such tuning often exhibit an inadequate coverage of individual domains, limiting the scope for nuanced comprehension and interactions within these areas. To address this deficiency, we propose Explore-Instruct, a novel approach to enhance the data coverage to be used in domain-specific instruction-tuning through active exploration via Large Language Models (LLMs). Built upon representative domain use cases, Explore-Instruct explores a multitude of variations or possibilities by implementing a search algorithm to obtain diversified and domain-focused instruction-tuning data. Our data-centric analysis validates the effectiveness of this proposed approach in improving domain-specific instruction coverage. Moreover, our model's performance demonstrates considerable advancements over multiple baselines, including those utilizing domain-specific data enhancement. Our findings offer a promising opportunity to improve instruction coverage, especially in domain-specific contexts, thereby advancing the development of adaptable language models. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/Explore-Instruct}., Comment: Accepted to EMNLP 2023 (Main Conference)
Published: 2023

26. RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Author: Zhang, Yue, Cui, Leyang, Zhao, Enbo, Bi, Wei, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Grammatical Error Correction (GEC) systems play a vital role in assisting people with their daily writing tasks. However, users may sometimes come across a GEC system that initially performs well but fails to correct errors when the inputs are slightly modified. To ensure an ideal user experience, a reliable GEC system should have the ability to provide consistent and accurate suggestions when encountering irrelevant context perturbations, which we refer to as context robustness. In this paper, we introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems. RobustGEC comprises 5,000 GEC cases, each with one original error-correct sentence pair and five variants carefully devised by human annotators. Utilizing RobustGEC, we reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations. In addition, we propose a simple yet effective method for remitting this issue., Comment: Accepted to EMNLP 2023 (main conference, long paper)
Published: 2023

27. A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Author: Chen, Yi, Jiang, Haiyun, Bi, Wei, Wang, Rui, Wang, Longyue, Shi, Shuming, and Xu, Ruifeng
Subjects: Computer Science - Computation and Language
Abstract: This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifiers into proper locations of the plain text to concretize or vivify human writings. Different from existing insertion-based writing assistance tasks, TE requires the model to be more flexible in both locating and generation, and also more cautious in keeping basic semantics. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. To facilitate automatic evaluation, we design various metrics from multiple perspectives. In particular, we propose Info-Gain to effectively measure the informativeness of expansions, which is an important quality dimension in TE. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research toward better automatic text expansion.
Published: 2023

28. TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

Author: Li, Huayang, Li, Siheng, Cai, Deng, Wang, Longyue, Liu, Lemao, Watanabe, Taro, Yang, Yujiu, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following., Comment: Findings of ACL 2024
Published: 2023

29. DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

Author: Chen, Yongrui, Jiang, Haiyun, Huang, Xinting, Shi, Shuming, and Qi, Guilin
Subjects: Computer Science - Computation and Language
Abstract: The improvement of LLMs' instruction-following capabilities relies heavily on the availability of high-quality instruction-response pairs. Unfortunately, the current methods used to collect the pairs suffer from either unaffordable labor costs or severe hallucinations in the self-generation of LLM. To tackle these challenges, this paper proposes a scalable solution. It involves training LLMs to generate instruction-response pairs based on human-written documents, rather than relying solely on self-generation without context. Our proposed method not only exploits the advantages of human-written documents in reducing hallucinations but also utilizes an LLM to wrap the expression of documents, which enables us to bridge the gap between various document styles and the standard AI response. Experiments demonstrate that our method outperforms existing typical methods on multiple benchmarks. In particular, compared to the best-performing baseline, the LLM trained using our generated dataset exhibits a 10\% relative improvement in performance on AlpacaEval, despite utilizing only 1/5 of its training data. Furthermore, a comprehensive manual evaluation validates the quality of the data we generated. Our trained wrapper is publicly available at https://github.com/Bahuia/Dog-Instruct., Comment: Accepted in NAACL 2024
Published: 2023

30. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Author: Zhang, Yue, Li, Yafu, Cui, Leyang, Cai, Deng, Liu, Lemao, Fu, Tingchen, Huang, Xinting, Zhao, Enbo, Zhang, Yu, Chen, Yulong, Wang, Longyue, Luu, Anh Tuan, Bi, Wei, Shi, Freda, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning
Abstract: While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research., Comment: work in progress; 32 pages
Published: 2023

31. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Author: Yuan, Youliang, Jiao, Wenxiang, Wang, Wenxuan, Huang, Jen-tse, He, Pinjia, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat., Comment: Accepted by ICLR 2024. 21 pages, 3 figures, 13 tables
Published: 2023

32. Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Author: Wang, Longyue, Du, Zefeng, Liu, Donghuai, Cai, Deng, Yu, Dian, Jiang, Haiyun, Wang, Yan, Cui, Leyang, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge. We totally evaluate 20 general-, in-domain and commercial models based on Transformer, advanced pretraining architectures and large language models (LLMs). Our results show (1) the challenge and necessity of our evaluation benchmark; (2) fine-grained pretraining based on literary document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field: https://github.com/longyuewangdcu/Disco-Bench., Comment: Zhaopeng Tu is the corresponding author
Published: 2023

33. On the Cultural Gap in Text-to-Image Generation

Author: Liu, Bingshuai, Wang, Longyue, Lyu, Chenyang, Zhang, Yong, Su, Jinsong, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data, which signifies the disparity in generated image quality when the cultural elements of the input text are rarely collected in the training set. Although various T2I models have shown impressive but arbitrary examples, there is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. To bridge the gap, we propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture. By analyzing the flawed images generated by the Stable Diffusion model on the C3 benchmark, we find that the model often fails to generate certain cultural objects. Accordingly, we propose a novel multi-modal metric that considers object-text alignment to filter the fine-tuning data in the target culture, which is used to fine-tune a T2I model to improve cross-cultural generation. Experimental results show that our multi-modal metric provides stronger data selection performance on the C3 benchmark than existing metrics, in which the object-text alignment is crucial. We release the benchmark, data, code, and generated images to facilitate future research on culturally diverse T2I generation (https://github.com/longyuewangdcu/C3-Bench)., Comment: Equal contribution: Bingshuai Liu and Longyue Wang. Work done while Bingshuai Liu and Chengyang Lyu were interning at Tencent AI Lab. Zhaopeng Tu is the corresponding author
Published: 2023

34. SkillNet-X: A Multilingual Multitask Model with Sparsely Activated Skills

Author: Feng, Zhangyin, Dai, Yong, Zhang, Fan, Tang, Duyu, Feng, Xiaocheng, Wu, Shuangzhi, Qin, Bing, Cao, Yunbo, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: Traditional multitask learning methods basically can only exploit common knowledge in task- or language-wise, which lose either cross-language or cross-task knowledge. This paper proposes a general multilingual multitask model, named SkillNet-X, which enables a single model to tackle many different tasks from different languages. To this end, we define several language-specific skills and task-specific skills, each of which corresponds to a skill module. SkillNet-X sparsely activates parts of the skill modules which are relevant either to the target task or the target language. Acting as knowledge transit hubs, skill modules are capable of absorbing task-related knowledge and language-related knowledge consecutively. Based on Transformer, we modify the multi-head attention layer and the feed forward network layer to accommodate skill modules. We evaluate SkillNet-X on eleven natural language understanding datasets in four languages. Results show that SkillNet-X performs better than task-specific baselines and two multitask learning baselines (i.e., dense joint model and Mixture-of-Experts model). Furthermore, skill pre-training further improves the performance of SkillNet-X on almost all datasets. To investigate the generalization of our model, we conduct experiments on two new tasks and find that SkillNet-X significantly outperforms baselines.
Published: 2023

35. Explicit Syntactic Guidance for Neural Text Generation

Author: Li, Yafu, Cui, Leyang, Yan, Jianhao, Yin, Yongjing, Bi, Wei, Shi, Shuming, and Zhang, Yue
Subjects: Computer Science - Computation and Language
Abstract: Most existing text generation models follow the sequence-to-sequence paradigm. Generative Grammar suggests that humans generate natural language texts by learning language grammar. We propose a syntax-guided generation schema, which generates the sequence guided by a constituency parse tree in a top-down direction. The decoding process can be decomposed into two parts: (1) predicting the infilling texts for each constituent in the lexicalized syntax context given the source sentence; (2) mapping and expanding each constituent to construct the next-level syntax context. Accordingly, we propose a structural beam search method to find possible syntax structures hierarchically. Experiments on paraphrase generation and machine translation show that the proposed method outperforms autoregressive baselines, while also demonstrating effectiveness in terms of interpretability, controllability, and diversity., Comment: ACL 2023
Published: 2023

36. Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Author: Lyu, Chenyang, Wu, Minghao, Wang, Longyue, Huang, Xinting, Liu, Bingshuai, Du, Zefeng, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios., Comment: Longyue Wang is the corresponding author. Our project page is at https://github.com/lyuchenyang/Macaw-LLM
Published: 2023

37. Rethinking Translation Memory Augmented Neural Machine Translation

Author: Hao, Hongkun, Huang, Guoping, Liu, Lemao, Zhang, Zhirui, Shi, Shuming, and Wang, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This paper rethinks translation memory augmented neural machine translation (TM-augmented NMT) from two perspectives, i.e., a probabilistic view of retrieval and the variance-bias decomposition principle. The finding demonstrates that TM-augmented NMT is good at the ability of fitting data (i.e., lower bias) but is more sensitive to the fluctuations in the training data (i.e., higher variance), which provides an explanation to a recently reported contradictory phenomenon on the same translation task: TM-augmented NMT substantially advances vanilla NMT under the high-resource scenario whereas it fails under the low-resource scenario. Then we propose a simple yet effective TM-augmented NMT model to promote the variance and address the contradictory phenomenon. Extensive experiments show that the proposed TM-augmented NMT achieves consistent gains over both conventional NMT and existing TM-augmented NMT under two variance-preferable (low-resource and plug-and-play) scenarios as well as the high-resource scenario., Comment: 15 pages, 2 figures, accepted by ACL2023 findings
Published: 2023

38. Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model

Author: Shen, Lingfeng, Jiang, Haiyun, Liu, Lemao, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Sentence embedding is one of the most fundamental tasks in Natural Language Processing and plays an important role in various tasks. The recent breakthrough in sentence embedding is achieved by pre-trained language models (PLMs). Despite its success, an embedded vector (Sen2Vec) representing a point estimate does not naturally express uncertainty in a taskagnostic way. This paper thereby proposes an efficient framework on probabilistic sentence embedding (Sen2Pro) from PLMs, and it represents a sentence as a probability density distribution in an embedding space to reflect both model uncertainty and data uncertainty (i.e., many-to-one nature) in the sentence representation. The proposed framework performs in a plug-and-play way without retraining PLMs anymore, and it is easy to implement and generally applied on top of any PLM. The superiority of Sen2Pro over Sen2Vec has been theoretically verified and practically illustrated on different NLP tasks., Comment: Accepted to ACL2023 workshop Rep4NLP
Published: 2023

39. Comparison of neoadjuvant chemoradiotherapy versus chemoradiotherapy plus immunotherapy for esophageal squamous cell carcinoma in a real-world multicenter cohort: a propensity score matching study

Author: Shi, Shuming, Zhou, Hao, Li, Li, Xu, Fuhao, Liu, Ning, Zhang, Dexian, Xu, Xiaohui, Sun, Yawen, and Yuan, Shuanghu
Published: 2024
Full Text: View/download PDF

40. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Author: Liang, Tian, He, Zhiwei, Jiao, Wenxiang, Wang, Xing, Wang, Yan, Wang, Rui, Yang, Yujiu, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate., Comment: EMNLP 2024 (main conference)
Published: 2023

41. Improved Visual Story Generation with Adaptive Context Modeling

Author: Feng, Zhangyin, Ren, Yuchen, Yu, Xinmiao, Feng, Xiaocheng, Tang, Duyu, Shi, Shuming, and Qin, Bing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Diffusion models developed on top of powerful text-to-image generation models like Stable Diffusion achieve remarkable success in visual story generation. However, the best-performing approach considers historically generated results as flattened memory cells, ignoring the fact that not all preceding images contribute equally to the generation of the characters and scenes at the current stage. To address this, we present a simple method that improves the leading system with adaptive context modeling, which is not only incorporated in the encoder but also adopted as additional guidance in the sampling stage to boost the global consistency of the generated story. We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios. We conduct detailed model analysis and show that our model excels at generating semantically consistent images for stories.
Published: 2023

42. Enhancing Grammatical Error Correction Systems with Explanations

Author: Fei, Yuejiao, Cui, Leyang, Yang, Sen, Lam, Wai, Lan, Zhenzhong, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: Grammatical error correction systems improve written communication by detecting and correcting language mistakes. To help language learners better understand why the GEC system makes a certain correction, the causes of errors (evidence words) and the corresponding error types are two key factors. To enhance GEC systems with explanations, we introduce EXPECT, a large dataset annotated with evidence words and grammatical error types. We propose several baselines and analysis to understand this task. Furthermore, human evaluation verifies our explainable GEC system's explanations can assist second-language learners in determining whether to accept a correction suggestion and in understanding the associated grammar rule., Comment: 9 pages, 7 figures, accepted to the main conference of ACL 2023
Published: 2023

43. MAGE: Machine-generated Text Detection in the Wild

Author: Li, Yafu, Li, Qintong, Cui, Leyang, Bi, Wei, Wang, Zhilin, Wang, Longyue, Yang, Linyi, Shi, Shuming, and Zhang, Yue
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection to mitigate risks like the spread of fake news and plagiarism. Existing research has been constrained by evaluating detection methods on specific domains or particular language models. In practical scenarios, however, the detector faces texts from various domains or LLMs without knowing their sources. To this end, we build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs. Empirical results show challenges in distinguishing machine-generated texts from human-authored ones across various scenarios, especially out-of-distribution. These challenges are due to the decreasing linguistic distinctions between the two sources. Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios. We release our resources at https://github.com/yafuly/MAGE., Comment: ACL 2024
Published: 2023

44. A Frustratingly Simple Decoding Method for Neural Text Generation

Author: Yang, Haoran, Cai, Deng, Li, Huayang, Bi, Wei, Lam, Wai, and Shi, Shuming
Subjects: Computer Science - Computation and Language
Abstract: We introduce a frustratingly simple, super efficient and surprisingly effective decoding method, which we call Frustratingly Simple Decoding (FSD), for neural text generation. The idea behind FSD is straightforward: we build an anti-LM based on previously generated text and use this anti-LM to penalize future generation of what has been generated. The anti-LM can be implemented as simple as an n-gram language model or a vectorized variant. In this way, FSD introduces no extra model parameters and negligible computational overhead (FSD can be as fast as greedy search). Despite the simplicity, FSD is surprisingly effective; Experiments show that FSD can outperform the canonical methods to date (i.e., nucleus sampling) as well as several strong baselines that were proposed recently., Comment: LREC-Coling 2024
Published: 2023

45. A Survey on Zero Pronoun Translation

Author: Wang, Longyue, Liu, Siyou, Xu, Mingzhou, Song, Linfeng, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Zero pronouns (ZPs) are frequently omitted in pro-drop languages (e.g. Chinese, Hungarian, and Hindi), but should be recalled in non-pro-drop languages (e.g. English). This phenomenon has been studied extensively in machine translation (MT), as it poses a significant challenge for MT systems due to the difficulty in determining the correct antecedent for the pronoun. This survey paper highlights the major works that have been undertaken in zero pronoun translation (ZPT) after the neural revolution, so that researchers can recognise the current state and future directions of this field. We provide an organisation of the literature based on evolution, dataset, method and evaluation. In addition, we compare and analyze competing models and evaluation metrics on different benchmarks. We uncover a number of insightful findings such as: 1) ZPT is in line with the development trend of large language model; 2) data limitation causes learning bias in languages and domains; 3) performance improvements are often reported on single benchmarks, but advanced methods are still far from real-world use; 4) general-purpose metrics are not reliable on nuances and complexities of ZPT, emphasizing the necessity of targeted metrics; 5) apart from commonly-cited errors, ZPs will cause risks of gender bias., Comment: ACL2023 Main Conference Long Paper. Longyue Wang and Siyou Liu contributed equally to this work
Published: 2023

46. A Simple and Plug-and-play Method for Unsupervised Sentence Representation Enhancement

Author: Shen, Lingfeng, Jiang, Haiyun, Liu, Lemao, and Shi, Shuming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Generating proper embedding of sentences through an unsupervised way is beneficial to semantic matching and retrieval problems in real-world scenarios. This paper presents Representation ALchemy (RepAL), an extremely simple post-processing method that enhances sentence representations. The basic idea in RepAL is to de-emphasize redundant information of sentence embedding generated by pre-trained models. Through comprehensive experiments, we show that RepAL is free of training and is a plug-and-play method that can be combined with most existing unsupervised sentence learning models. We also conducted in-depth analysis to understand RepAL.
Published: 2023

47. Exploring Human-Like Translation Strategy with Large Language Models

Author: He, Zhiwei, Liang, Tian, Jiao, Wenxiang, Zhang, Zhuosheng, Yang, Yujiu, Wang, Rui, Tu, Zhaopeng, Shi, Shuming, and Wang, Xing
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in general scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses, human-level intelligence. Among their numerous skills, the translation abilities of LLMs have received considerable attention. In contrast to traditional machine translation that focuses solely on source-target mapping, LLM-based translation can potentially mimic the human translation process that takes many preparatory steps to ensure high-quality translation. This work aims to explore this possibility by proposing the MAPS framework, which stands for Multi-Aspect Prompting and Selection. Specifically, we enable LLMs to first analyze the given source text and extract three aspects of translation-related knowledge: keywords, topics and relevant demonstrations to guide the translation process. To filter out the noisy and unhelpful knowledge, we employ a selection mechanism based on quality estimation. Experiments suggest that MAPS brings significant and consistent improvements over text-davinci-003 and Alpaca on eight translation directions from the latest WMT22 test sets. Our further analysis shows that the extracted knowledge is critical in resolving up to 59% of hallucination mistakes in translation. Code is available at https://github.com/zwhe99/MAPS-mt., Comment: V2: add more experiments and case studies; polish writing
Published: 2023

48. ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback

Author: Jiao, Wenxiang, Huang, Jen-tse, Wang, Wenxuan, He, Zhiwei, Liang, Tian, Wang, Xing, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) like ChatGPT have exhibited remarkable abilities on a wide range of natural language processing~(NLP) tasks, including various machine translation abilities accomplished during chat. However, these models are only accessible through restricted APIs, which creates barriers to new research and advancements in the field. Therefore, we propose ParroT, a framework to enhance and regulate the translation abilities during chat based on open-source LLMs (e.g., LLaMA), human-written translation and feedback data. Specifically, ParroT reformulates translation data into the instruction-following style, and introduces a "$\mathbf{Hint}$" field for incorporating extra requirements to regulate the translation process. Accordingly, we propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction. Experiments on Flores subsets and WMT22 test sets suggest that translation instruction improves the translation performance of vanilla LLMs significantly while error-guided instruction can lead to further improvement, which demonstrates the importance of learning from low-quality translations annotated by humans. We also demonstrate the potential of automatic evaluation tools in providing quality information of translations, when constructing error-guided instructions for directions that lack human annotation data. Please refer to our Github project for more implementation details: https://github.com/wxjiao/ParroT, Comment: 12 pages; EMNLP 2023 (Findings)
Published: 2023

49. Document-Level Machine Translation with Large Language Models

Author: Wang, Longyue, Lyu, Chenyang, Ji, Tianbo, Zhang, Zhirui, Yu, Dian, Shi, Shuming, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) such as ChatGPT can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study focuses on three aspects: 1) Effects of Context-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of ChatGPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and shed light on impacts of training techniques on discourse modeling. By evaluating on a number of benchmarks, we surprisingly find that LLMs have demonstrated superior performance and show potential to become a new paradigm for document-level translation: 1) leveraging their powerful long-text modeling capabilities, GPT-3.5 and GPT-4 outperform commercial MT systems in terms of human evaluation; 2) GPT-4 demonstrates a stronger ability for probing linguistic knowledge than GPT-3.5. This work highlights the challenges and opportunities of LLMs for MT, which we hope can inspire the future design and evaluation of LLMs.We release our data and annotations at https://github.com/longyuewangdcu/Document-MT-LLM., Comment: Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang are equal contributors
Published: 2023

50. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study

Author: Chen, Yi, Wang, Rui, Jiang, Haiyun, Shi, Shuming, and Xu, Ruifeng
Subjects: Computer Science - Computation and Language
Abstract: Evaluating the quality of generated text is a challenging task in NLP, due to the inherent complexity and diversity of text. Recently, large language models (LLMs) have garnered significant attention due to their impressive performance in various tasks. Therefore, we present this paper to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods. The experimental results prove that ChatGPT is capable of evaluating text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts may lead to suboptimal results. We believe this paper will provide valuable insights for evaluating text quality with LLMs and have released the used data.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

617 results on '"Shi, Shuming"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources