Author: "Wang Jingang" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang Jingang"' showing total 58 results

Start Over Author "Wang Jingang" Database arXiv

58 results on '"Wang Jingang"'

1. Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

Author: Tang, Hongyin, Xiu, Di, Wang, Lanrui, Geng, Xiurui, Wang, Jingang, and Cai, Xunliang
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggle to achieve performance levels comparable to FA, particularly in retrieval tasks. In this paper, our analysis of attention head patterns reveals that LLMs' attention distributions show strong local correlations, naturally reflecting a chunking mechanism for input context. We propose Ltri-LLM framework, which divides KVs into spans, stores them in an offline index, and retrieves the relevant KVs into memory for various queries. Experimental results on popular long text benchmarks show that Ltri-LLM can achieve performance close to FA while maintaining efficient, streaming-based inference.
Published: 2024

2. Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Author: Xi, Zhiheng, Yang, Dingwen, Huang, Jixuan, Tang, Jiafu, Li, Guanyu, Ding, Yiwen, He, Wei, Hong, Boyang, Do, Shihan, Zhan, Wenyu, Wang, Xiao, Zheng, Rui, Ji, Tao, Shi, Xiaowei, Zhai, Yitao, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Gui, Tao, Wu, Zuxuan, Zhang, Qi, Qiu, Xipeng, Huang, Xuanjing, and Jiang, Yu-Gang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{https://mathcritique.github.io/}{https://mathcritique.github.io/}., Comment: Preprint
Published: 2024

3. Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

Author: Li, Bei, Zheng, Tong, Wang, Rui, Liu, Jiahao, Guo, Qingyan, Guo, Junliang, Tan, Xu, Xiao, Tong, Zhu, Jingbo, Wang, Jingang, and Cai, Xunliang
Subjects: Computer Science - Computation and Language
Abstract: Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation., Comment: Accepted by NeurIPS 2024
Published: 2024

4. Multi-Programming Language Sandbox for LLMs

Author: Dou, Shihan, Zhang, Jiazheng, Zang, Jianxiang, Tao, Yunbo, Zhou, Weikang, Jia, Haoxiang, Liu, Shichun, Yang, Yuming, Xi, Zhiheng, Wu, Shenxi, Zhang, Shaoqing, Wu, Muling, Lv, Changze, Xiong, Limao, Zhan, Wenyu, Zhang, Lin, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Wu, Yueming, Wen, Ming, Zheng, Rui, Ji, Tao, Cao, Yixin, Gui, Tao, Qiu, Xipeng, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox., Comment: 25 pages, 14 figures
Published: 2024

5. FIRP: Faster LLM inference via future intermediate representation prediction

Author: Wu, Pengfei, Liu, Jiahao, Gong, Zhuocheng, Wang, Qifan, Li, Jinpeng, Wang, Jingang, Cai, Xunliang, and Zhao, Dongyan
Subjects: Computer Science - Computation and Language
Abstract: Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic information. As the layers go deeper, the semantic gap between pseudo and real hidden states is narrowed and it becomes feasible to decode future tokens with high accuracy. To validate the effectiveness of FIRP, we conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets, analytical experiments also prove our motivations.
Published: 2024

6. Let's Ask GNN: Empowering Large Language Model for Graph In-Context Learning

Author: Hu, Zhengyu, Li, Yichuan, Chen, Zhengyu, Wang, Jingang, Liu, Han, Lee, Kyumin, and Ding, Kaize
Subjects: Computer Science - Machine Learning
Abstract: Textual Attributed Graphs (TAGs) are crucial for modeling complex real-world systems, yet leveraging large language models (LLMs) for TAGs presents unique challenges due to the gap between sequential text processing and graph-structured data. We introduce AskGNN, a novel approach that bridges this gap by leveraging In-Context Learning (ICL) to integrate graph data and task-specific information into LLMs. AskGNN employs a Graph Neural Network (GNN)-powered structure-enhanced retriever to select labeled nodes across graphs, incorporating complex graph structures and their supervision signals. Our learning-to-retrieve algorithm optimizes the retriever to select example nodes that maximize LLM performance on graph. Experiments across three tasks and seven LLMs demonstrate AskGNN's superior effectiveness in graph task performance, opening new avenues for applying LLMs to graph-structured data without extensive fine-tuning.
Published: 2024

7. Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

Author: Wang, Siqi, Chen, Zhengyu, Li, Bei, He, Keqing, Zhang, Min, and Wang, Jingang
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and Mixture of Experts (MoE) models. Through a combination of theoretical analysis and extensive experiments, including consistent loss scaling, optimal batch size and learning rate scaling, and resource allocation strategies scaling, our findings reveal that the power-law scaling framework also applies to MoE Models, indicating that the fundamental principles governing the scaling behavior of these models are preserved, even though the architecture differs. Additionally, MoE Models demonstrate superior generalization, resulting in lower testing losses with the same training compute budget compared to Dense Models. These findings indicate the scaling consistency and transfer generalization capabilities of MoE Models, providing new insights for optimizing MoE Model training and deployment strategies.
Published: 2024

8. Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models

Author: Liu, Xinyu, Zhao, Runsong, Huang, Pengcheng, Xiao, Chunyang, Li, Bei, Wang, Jingang, Xiao, Tong, and Zhu, Jingbo
Subjects: Computer Science - Computation and Language
Abstract: Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model's effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model's memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at https://github.com/1azybug/ForgettingCurve.
Published: 2024

9. Length Desensitization in Direct Preference Optimization

Author: Liu, Wei, Bai, Yang, Han, Chengcheng, Weng, Rongxiang, Xu, Jun, Cao, Xuezhi, Wang, Jingang, and Cai, Xunliang
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-like preferences., Comment: 21 pages, 9 figures
Published: 2024

10. How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

Author: Wang, Yejie, He, Keqing, Fu, Dayuan, Gongque, Zhuoma, Xu, Heyang, Chen, Yanxu, Wang, Zhexu, Fu, Yujia, Dong, Guanting, Diao, Muxi, Wang, Jingang, Zhang, Mengdi, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder, Comment: Working in progress
Published: 2024

11. ReMamba: Equip Mamba with Effective Long-Sequence Modeling

Author: Yuan, Danlong, Liu, Jiahao, Li, Bei, Zhang, Huishuai, Wang, Jingang, Cai, Xunliang, and Zhao, Dongyan
Subjects: Computer Science - Computation and Language
Abstract: While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.
Published: 2024

12. SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Author: Diao, Muxi, Li, Rumei, Liu, Shiyang, Liao, Guogang, Wang, Jingang, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the $\mathbf{S}\text{elf-}\mathbf{E}\text{volving }\mathbf{A}\text{dversarial }\mathbf{S}\text{afety }\mathbf{(SEAS)}$ optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models.
Published: 2024

13. Graph-Structured Speculative Decoding

Author: Gong, Zhuocheng, Liu, Jiahao, Wang, Ziyue, Wu, Pengfei, Wang, Jingang, Cai, Xunliang, Zhao, Dongyan, and Yan, Rui
Subjects: Computer Science - Computation and Language
Abstract: Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding.
Published: 2024

14. What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Author: Dou, Shihan, Jia, Haoxiang, Wu, Shenxi, Zheng, Huiyuan, Zhou, Weikang, Wu, Muling, Chai, Mingxu, Fan, Jessica, Huang, Caishuang, Tao, Yunbo, Liu, Yan, Zhou, Enyu, Zhang, Ming, Zhou, Yuhao, Wu, Yueming, Zheng, Rui, Wen, Ming, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Gui, Tao, Qiu, Xipeng, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of these existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and four popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. Furthermore, to better understand the performance of LLMs in real-world projects, we manually created a real-world benchmark comprising 140 code generation tasks. Our analysis highlights distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems., Comment: 17 pages, 7 figures
Published: 2024

15. Rethinking LLM-based Preference Evaluation

Author: Hu, Zhengyu, Song, Linxin, Zhang, Jieyu, Xiao, Zheyuan, Wang, Jingang, Chen, Zhenyu, and Xiong, Hui
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: The use of large language model (LLM)-based preference evaluations has become widespread for comparing model responses, but it has revealed a notable bias towards longer responses, questioning the reliability of such evaluations. This paper explores the length bias in LLM evaluations from a data-centric perspective, analyzing 14 commonly used preference datasets and 10 reward models. Our findings indicate that human preference labeling favors longer responses and this spurious correlation is learned by the reward model and subsequently propagated to the aligned model during training. We decompose the preference evaluation metric, i.e., win rate, from the perspective of human to identify the deeper factors and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. Controlled experiments demonstrate that response length impacts evaluations by influencing information mass. To ensure reliable evaluation metrics that assess content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation. Furthermore, we investigate length bias in DPO using AlpacaEval and AdapAlpaca. By testing Tulu2 and Tulu2-dpo at 7B, 13B, and 70B scales, we found that DPO leads to higher human preference, but this gain is amplified by response length, with AlpacaEval showing higher win rates gain than AdapAlpaca.
Published: 2024

16. EAVE: Efficient Product Attribute Value Extraction via Lightweight Sparse-layer Interaction

Author: Yang, Li, Wang, Qifan, Chi, Jianfeng, Liu, Jiahao, Wang, Jingang, Feng, Fuli, Xu, Zenglin, Fang, Yi, Huang, Lifu, and Liu, Dongfang
Subjects: Computer Science - Computation and Language
Abstract: Product attribute value extraction involves identifying the specific values associated with various attributes from a product profile. While existing methods often prioritize the development of effective models to improve extraction performance, there has been limited emphasis on extraction efficiency. However, in real-world scenarios, products are typically associated with multiple attributes, necessitating multiple extractions to obtain all corresponding values. In this work, we propose an Efficient product Attribute Value Extraction (EAVE) approach via lightweight sparse-layer interaction. Specifically, we employ a heavy encoder to separately encode the product context and attribute. The resulting non-interacting heavy representations of the context can be cached and reused for all attributes. Additionally, we introduce a light encoder to jointly encode the context and the attribute, facilitating lightweight interactions between them. To enrich the interaction within the lightweight encoder, we design a sparse-layer interaction module to fuse the non-interacting heavy representation into the lightweight encoder. Comprehensive evaluation on two benchmarks demonstrate that our method achieves significant efficiency gains with neutral or marginal loss in performance when the context is long and number of attributes is large. Our code is available \href{https://anonymous.4open.science/r/EAVE-EA18}{here}.
Published: 2024

17. Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

Author: Liu, Jiahao, Wang, Qifan, Wang, Jingang, and Cai, Xunliang
Subjects: Computer Science - Computation and Language
Abstract: The recent advancements in large language models (LLMs) have been extraordinary, yet the escalating inference costs associated with them present challenges in real-world applications. To address these challenges, we propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. Specifically, EESD utilizes a segment of the LLM to generate draft tokens, incorporating Early-exiting structures after the first N layers. To enhance the quality of draft tokens, a self-distillation method is integrated. This early-exiting design not only reduces deployment and training costs but also significantly accelerates the token generation speed. Moreover, we introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes, automatically determining the quantity of draft tokens in each round. The original LLM is then employed to validate these draft tokens through a single forward pass, and thus guarantees that the final output text maintains a distribution consistent with vanilla auto-regressive decoding. The experimental results on both 13B and 70B models demonstrate that our approach decodes tokens at a markedly accelerated rate compared to prior methods, showing the effectiveness of our approach., Comment: Accepted by ACL 2024 (Findings)
Published: 2024

18. Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Author: Wu, Pengfei, Liu, Jiahao, Gong, Zhuocheng, Wang, Qifan, Li, Jinpeng, Wang, Jingang, Cai, Xunliang, and Zhao, Dongyan
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
Published: 2024

19. What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

Author: Gong, Zhuocheng, Liu, Jiahao, Wang, Jingang, Cai, Xunliang, Zhao, Dongyan, and Yan, Rui
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.
Published: 2024

20. Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection

Author: Wang, Pei, He, Keqing, Wang, Yejie, Song, Xiaoshuai, Mou, Yutao, Wang, Jingang, Xian, Yunsen, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Out-of-domain (OOD) intent detection aims to examine whether the user's query falls outside the predefined domain of the system, which is crucial for the proper functioning of task-oriented dialogue (TOD) systems. Previous methods address it by fine-tuning discriminative models. Recently, some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, but it is still unclear for their ability on OOD detection task.This paper conducts a comprehensive evaluation of LLMs under various experimental settings, and then outline the strengths and weaknesses of LLMs. We find that LLMs exhibit strong zero-shot and few-shot capabilities, but is still at a disadvantage compared to models fine-tuned with full resource. More deeply, through a series of additional analysis experiments, we discuss and summarize the challenges faced by LLMs and provide guidance for future work including injecting domain knowledge, strengthening knowledge transfer from IND(In-domain) to OOD, and understanding long instructions.
Published: 2024

21. C-ICL: Contrastive In-context Learning for Information Extraction

Author: Mo, Ying, Liu, Jiahao, Yang, Jian, Wang, Qifan, Zhang, Shun, Wang, Jingang, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: There has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE). Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning process. In this paper, we present c-ICL, a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by utilizing prompts that incorporate not only the positive samples but also the reasoning behind them. This method allows for the identification and correction of potential interface errors. Specifically, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in miscellaneous scenarios., Comment: 15 pages
Published: 2024

22. DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

Author: Wang, Yejie, He, Keqing, Dong, Guanting, Wang, Pei, Zeng, Weihao, Diao, Muxi, Mou, Yutao, Zhang, Mengdi, Wang, Jingang, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Several instruction tuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model (DolphCoder) with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with distinct reasoning paths increases the code capability of LLMs. (2) Improving one's ability to evaluate the correctness of code solutions also enhances their ability to create it., Comment: 14 pages, 6 figures
Published: 2024

23. Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference

Author: Wang, Lanrui, Li, Jiangnan, Yang, Chenxu, Lin, Zheng, Tang, Hongyin, Liu, Huan, Cao, Yanan, Wang, Jingang, and Wang, Weiping
Subjects: Computer Science - Computation and Language
Abstract: Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses., Comment: Accepted by COLING 2025
Published: 2023

24. Improving Input-label Mapping with Demonstration Replay for In-context Learning

Author: Gong, Zhuocheng, Liu, Jiahao, Wang, Qifan, Wang, Jingang, Cai, Xunliang, Zhao, Dongyan, and Yan, Rui
Subjects: Computer Science - Computation and Language
Abstract: In-context learning (ICL) is an emerging capability of large autoregressive language models where a few input-label demonstrations are appended to the input to enhance the model's understanding of downstream NLP tasks, without directly adjusting the model parameters. The effectiveness of ICL can be attributed to the strong language modeling capabilities of large language models (LLMs), which enable them to learn the mapping between input and labels based on in-context demonstrations. Despite achieving promising results, the causal nature of language modeling in ICL restricts the attention to be backward only, i.e., a token only attends to its previous tokens, failing to capture the full input-label information and limiting the model's performance. In this paper, we propose a novel ICL method called Repeated Demonstration with Sliding Causal Attention, (RdSca). Specifically, we duplicate later demonstrations and concatenate them to the front, allowing the model to `observe' the later information even under the causal restriction. Besides, we introduce sliding causal attention, which customizes causal attention to avoid information leakage. Experimental results show that our method significantly improves the input-label mapping in ICL demonstrations. We also conduct an in-depth analysis of how to customize the causal attention without training, which has been an unexplored area in previous research.
Published: 2023

25. Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression

Author: Liu, Jiduan, Liu, Jiahao, Wang, Qifan, Wang, Jingang, Cai, Xunliang, Zhao, Dongyan, Wang, Ran Lucien, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. While numerous model compression techniques have been proposed, most of them are not well-suited for achieving extreme model compression when there is a significant gap in model scale. In this paper, we introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT), which effectively transfers the knowledge of LLMs to extremely small-scale models (e.g., 1%). In particular, our approach extracts knowledge from LLMs to construct a knowledge store, from which the small-scale model can retrieve relevant information and leverage it for effective inference. To improve the quality of the model, soft prompt tuning and Proximal Policy Optimization (PPO) reinforcement learning techniques are employed. Extensive experiments are conducted on low-resource tasks from SuperGLUE and GLUE benchmarks. The results demonstrate that the proposed approach significantly enhances the performance of small-scale models by leveraging the knowledge from LLMs., Comment: EMNLP 2023 Findings
Published: 2023

26. APP: Adaptive Prototypical Pseudo-Labeling for Few-shot OOD Detection

Author: Wang, Pei, He, Keqing, Mou, Yutao, Song, Xiaoshuai, Wu, Yanan, Wang, Jingang, Xian, Yunsen, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Detecting out-of-domain (OOD) intents from user queries is essential for a task-oriented dialogue system. Previous OOD detection studies generally work on the assumption that plenty of labeled IND intents exist. In this paper, we focus on a more practical few-shot OOD setting where there are only a few labeled IND data and massive unlabeled mixed data that may belong to IND or OOD. The new scenario carries two key challenges: learning discriminative representations using limited IND data and leveraging unlabeled mixed data. Therefore, we propose an adaptive prototypical pseudo-labeling (APP) method for few-shot OOD detection, including a prototypical OOD detection framework (ProtoOOD) to facilitate low-resource OOD detection using limited IND data, and an adaptive pseudo-labeling method to produce high-quality pseudo OOD\&IND labels. Extensive experiments and analysis demonstrate the effectiveness of our method for few-shot OOD detection.
Published: 2023
Full Text: View/download PDF

27. Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT

Author: Song, Xiaoshuai, He, Keqing, Wang, Pei, Dong, Guanting, Mou, Yutao, Wang, Jingang, Xian, Yunsen, Cai, Xunliang, and Xu, Weiran
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The tasks of out-of-domain (OOD) intent discovery and generalized intent discovery (GID) aim to extend a closed intent classifier to open-world intent sets, which is crucial to task-oriented dialogue (TOD) systems. Previous methods address them by fine-tuning discriminative models. Recently, although some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, it is still unclear for the ability of ChatGPT to discover and incrementally extent OOD intents. In this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT exhibits consistent advantages under zero-shot settings, but is still at a disadvantage compared to fine-tuned models. More deeply, through a series of analytical experiments, we summarize and discuss the challenges faced by LLMs including clustering, domain-specific understanding, and cross-domain in-context learning scenarios. Finally, we provide empirical guidance for future directions to address these challenges., Comment: Accpeted to EMNLP 2023 (Main Conference)
Published: 2023

28. mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning

Author: Mo, Ying, Yang, Jian, Liu, Jiahao, Wang, Qifan, Chen, Ruoyu, Wang, Jingang, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 $F_1$ scores across a broad spectrum and establishes itself as the new state-of-the-art performer., Comment: 9 pages, Accepted by AAAI 2024
Published: 2023

29. Seen to Unseen: Exploring Compositional Generalization of Multi-Attribute Controllable Dialogue Generation

Author: Zeng, Weihao, Zhao, Lulu, He, Keqing, Geng, Ruotong, Wang, Jingang, Wu, Wei, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Existing controllable dialogue generation work focuses on the single-attribute control and lacks generalization capability to out-of-distribution multiple attribute combinations. In this paper, we explore the compositional generalization for multi-attribute controllable dialogue generation where a model can learn from seen attribute values and generalize to unseen combinations. We propose a prompt-based disentangled controllable dialogue generation model, DCG. It learns attribute concept composition by generating attribute-oriented prompt vectors and uses a disentanglement loss to disentangle different attributes for better generalization. Besides, we design a unified reference-free evaluation framework for multiple attributes with different levels of granularities. Experiment results on two benchmarks prove the effectiveness of our method and the evaluation metric., Comment: ACL 2023 Main Conference
Published: 2023

30. FutureTOD: Teaching Future Knowledge to Pre-trained Language Model for Task-Oriented Dialogue

Author: Zeng, Weihao, He, Keqing, Wang, Yejie, Zeng, Chen, Wang, Jingang, Xian, Yunsen, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models based on general text enable huge success in the NLP scenario. But the intrinsical difference of linguistic patterns between general text and task-oriented dialogues makes existing pre-trained language models less useful in practice. Current dialogue pre-training methods rely on a contrastive framework and face the challenges of both selecting true positives and hard negatives. In this paper, we propose a novel dialogue pre-training model, FutureTOD, which distills future knowledge to the representation of the previous dialogue context using a self-training framework. Our intuition is that a good dialogue representation both learns local context information and predicts future information. Extensive experiments on diverse downstream dialogue tasks demonstrate the effectiveness of our model, especially the generalization, robustness, and learning discriminative dialogue representations capabilities., Comment: ACL 2023 Main Conference
Published: 2023

31. GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Author: Tan, Shicheng, Tam, Weng Lam, Wang, Yuanchun, Gong, Wenwen, Yang, Yang, Tang, Hongyin, He, Keqing, Liu, Jiahao, Wang, Jingang, Zhao, Shu, Zhang, Peng, and Tang, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs., Comment: accepted for ACL 2023 industry track
Published: 2023

32. PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models

Author: Gong, Zhuocheng, Liu, Jiahao, Wang, Qifan, Yang, Yang, Wang, Jingang, Wu, Wei, Xian, Yunsen, Zhao, Dongyan, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization methods are task-specific, requiring customized training and quantization with a large number of trainable parameters on each individual task. Inspired by the observation that the over-parameterization nature of PLMs makes it possible to freeze most of the parameters during the fine-tuning stage, in this work, we propose a novel ``quantize before fine-tuning'' framework, PreQuant, that differs from both quantization-aware training and post-training quantization. PreQuant is compatible with various quantization strategies, with outlier-aware parameter-efficient fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5. We also provide an empirical investigation into the workflow of PreQuant, which sheds light on its efficacy., Comment: Findings of ACL2023
Published: 2023

33. Decoupling Pseudo Label Disambiguation and Representation Learning for Generalized Intent Discovery

Author: Mou, Yutao, Song, Xiaoshuai, He, Keqing, Zeng, Chen, Wang, Pei, Wang, Jingang, Xian, Yunsen, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Generalized intent discovery aims to extend a closed-set in-domain intent classifier to an open-world intent set including in-domain and out-of-domain intents. The key challenges lie in pseudo label disambiguation and representation learning. Previous methods suffer from a coupling of pseudo label disambiguation and representation learning, that is, the reliability of pseudo labels relies on representation learning, and representation learning is restricted by pseudo labels in turn. In this paper, we propose a decoupled prototype learning framework (DPL) to decouple pseudo label disambiguation and representation learning. Specifically, we firstly introduce prototypical contrastive representation learning (PCL) to get discriminative representations. And then we adopt a prototype-based label disambiguation method (PLD) to obtain pseudo labels. We theoretically prove that PCL and PLD work in a collaborative fashion and facilitate pseudo label disambiguation. Experiments and analysis on three benchmark datasets show the effectiveness of our method., Comment: Accepted at ACL2023 main conference
Published: 2023

34. RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank

Author: Liu, Jiduan, Liu, Jiahao, Wang, Qifan, Wang, Jingang, Wu, Wei, Xian, Yunsen, Zhao, Dongyan, Chen, Kai, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Unsupervised sentence representation learning is one of the fundamental problems in natural language processing with various downstream applications. Recently, contrastive learning has been widely adopted which derives high-quality sentence representations by pulling similar semantics closer and pushing dissimilar ones away. However, these methods fail to capture the fine-grained ranking information among the sentences, where each sentence is only treated as either positive or negative. In many real-world scenarios, one needs to distinguish and rank the sentences based on their similarities to a query sentence, e.g., very relevant, moderate relevant, less relevant, irrelevant, etc. In this paper, we propose a novel approach, RankCSE, for unsupervised sentence representation learning, which incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework. In particular, we learn semantically discriminative sentence representations by simultaneously ensuring ranking consistency between two representations with different dropout masks, and distilling listwise ranking knowledge from the teacher. An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks. Experimental results demonstrate the superior performance of our approach over several state-of-the-art baselines., Comment: ACL 2023
Published: 2023

35. Task-agnostic Distillation of Encoder-Decoder Language Models

Author: Zhang, Chen, Yang, Yang, Wang, Jingang, and Song, Dawei
Subjects: Computer Science - Computation and Language
Abstract: Finetuning pretrained language models (LMs) have enabled appealing performance on a diverse array of tasks. The intriguing task-agnostic property has driven a shifted focus from task-specific to task-agnostic distillation of LMs. While task-agnostic, compute-efficient, performance-preserved LMs can be yielded by task-agnostic distillation, previous studies mainly sit in distillation of either encoder-only LMs (e.g., BERT) or decoder-only ones (e.g., GPT) yet largely neglect that distillation of encoder-decoder LMs (e.g., T5) can posit very distinguished behaviors. Frustratingly, we discover that existing task-agnostic distillation methods can fail to handle the distillation of encoder-decoder LMs. To the demand, we explore a few paths and uncover a path named as MiniEnD that successfully tackles the distillation of encoder-decoder LMs in a task-agnostic fashion. We examine MiniEnD on language understanding and abstractive summarization. The results showcase that MiniEnD is generally effective and is competitive compared to other alternatives. We further scale MiniEnD up to distillation of 3B encoder-decoder language models with interpolated distillation. The results imply the opportunities and challenges in distilling large language models (e.g., LLaMA)., Comment: 11 pages, 4 figure, 5 tables, work in progress. Code will be released
Published: 2023

36. Lifting the Curse of Capacity Gap in Distilling Language Models

Author: Zhang, Chen, Yang, Yang, Liu, Jiahao, Wang, Jingang, Xian, Yunsen, Wang, Benyou, and Song, Dawei
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Pretrained language models (LMs) have shown compelling performance on various downstream tasks, but unfortunately they require a tremendous amount of inference compute. Knowledge distillation finds a path to compress LMs to small ones with a teacher-student paradigm. However, when the capacity gap between the teacher and the student is large, a curse of capacity gap appears, invoking a deficiency in distilling LMs. While a few studies have been carried out to fill the gap, the curse is not yet well tackled. In this paper, we aim at lifting the curse of capacity gap via enlarging the capacity of the student without notably increasing the inference compute. Largely motivated by sparse activation regime of mixture of experts (MoE), we propose a mixture of minimal experts (MiniMoE), which imposes extra parameters to the student but introduces almost no additional inference compute. Experimental results on GLUE and CoNLL demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a large extent. MiniMoE also achieves the state-of-the-art performance at small FLOPs compared with a range of competitive baselines. With a compression rate as much as $\sim$50$\times$, MiniMoE preserves $\sim$95\% GLUE score of the teacher., Comment: 17 pages, 6 figures, 13 tables, accepted to ACL 2023. Code is available at https://github.com/GeneZC/MiniMoE
Published: 2023

37. Multi-task Transformer with Relation-attention and Type-attention for Named Entity Recognition

Author: Mo, Ying, Tang, Hongyin, Liu, Jiahao, Wang, Qifan, Xu, Zenglin, Wang, Jingang, Wu, Wei, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: Named entity recognition (NER) is an important research problem in natural language processing. There are three types of NER tasks, including flat, nested and discontinuous entity recognition. Most previous sequential labeling models are task-specific, while recent years have witnessed the rising of generative models due to the advantage of unifying all NER tasks into the seq2seq model framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. This paper proposes a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self and cross-attention mechanisms. We perform experiments on an extensive set of NER benchmarks, including two flat, three nested, and three discontinuous NER datasets. Experimental results show that our approach considerably improves the generative NER model's performance., Comment: 5 pages,accepted ICASSP 2023
Published: 2023

38. Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Author: Yan, Liqi, Wang, Qifan, Ma, Siqi, Wang, Jingang, and Yu, Changbin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.
Published: 2022
Full Text: View/download PDF

39. UniNL: Aligning Representation Learning with Scoring Function for OOD Detection via Unified Neighborhood Learning

Author: Mou, Yutao, Wang, Pei, He, Keqing, Wu, Yanan, Wang, Jingang, Wu, Wei, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Detecting out-of-domain (OOD) intents from user queries is essential for avoiding wrong operations in task-oriented dialogue systems. The key challenge is how to distinguish in-domain (IND) and OOD intents. Previous methods ignore the alignment between representation learning and scoring function, limiting the OOD detection performance. In this paper, we propose a unified neighborhood learning framework (UniNL) to detect OOD intents. Specifically, we design a K-nearest neighbor contrastive learning (KNCL) objective for representation learning and introduce a KNN-based scoring function for OOD detection. We aim to align representation learning with scoring function. Experiments and analysis on two benchmark datasets show the effectiveness of our method., Comment: Accepted at EMNLP2022 main conference
Published: 2022

40. Watch the Neighbors: A Unified K-Nearest Neighbor Contrastive Learning Framework for OOD Intent Discovery

Author: Mou, Yutao, He, Keqing, Wang, Pei, Wu, Yanan, Wang, Jingang, Wu, Wei, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Discovering out-of-domain (OOD) intent is important for developing new skills in task-oriented dialogue systems. The key challenges lie in how to transfer prior in-domain (IND) knowledge to OOD clustering, as well as jointly learn OOD representations and cluster assignments. Previous methods suffer from in-domain overfitting problem, and there is a natural gap between representation learning and clustering objectives. In this paper, we propose a unified K-nearest neighbor contrastive learning framework to discover OOD intents. Specifically, for IND pre-training stage, we propose a KCL objective to learn inter-class discriminative features, while maintaining intra-class diversity, which alleviates the in-domain overfitting problem. For OOD clustering stage, we propose a KCC method to form compact clusters by mining true hard negative samples, which bridges the gap between clustering and representation learning. Extensive experiments on three benchmark datasets show that our method achieves substantial improvements over the state-of-the-art methods., Comment: Accepted at EMNLP2022 main conference
Published: 2022

41. Semi-Supervised Knowledge-Grounded Pre-training for Task-Oriented Dialog Systems

Author: Zeng, Weihao, He, Keqing, Wang, Zechen, Fu, Dayuan, Dong, Guanting, Geng, Ruotong, Wang, Pei, Wang, Jingang, Sun, Chaobo, Wu, Wei, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Recent advances in neural approaches greatly improve task-oriented dialogue (TOD) systems which assist users to accomplish their goals. However, such systems rely on costly manually labeled dialogs which are not available in practical scenarios. In this paper, we present our models for Track 2 of the SereTOD 2022 challenge, which is the first challenge of building semi-supervised and reinforced TOD systems on a large-scale real-world Chinese TOD dataset MobileCS. We build a knowledge-grounded dialog model to formulate dialog history and local KB as input and predict the system response. And we perform semi-supervised pre-training both on the labeled and unlabeled data. Our system achieves the first place both in the automatic evaluation and human interaction, especially with higher BLEU (+7.64) and Success (+13.6\%) than the second place., Comment: Accepted at the SereTOD 2022 Workshop, EMNLP 2022
Published: 2022

42. XPrompt: Exploring the Extreme of Prompt Tuning

Author: Ma, Fang, Zhang, Chen, Ren, Lei, Wang, Jingang, Wang, Qifan, Wu, Wei, Quan, Xiaojun, and Song, Dawei
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Prompt tuning learns soft prompts to condition frozen Pre-trained Language Models (PLMs) for performing downstream tasks in a parameter-efficient manner. While prompt tuning has gradually reached the performance level of fine-tuning as the model scale increases, there is still a large performance gap between prompt tuning and fine-tuning for models of moderate and small scales (typically less than 11B parameters). In this paper, we empirically show that the trained prompt tokens can have a negative impact on a downstream task and thus degrade its performance. To bridge the gap, we propose a novel Prompt tuning model with an eXtremely small scale (XPrompt) under the regime of lottery tickets hypothesis. Specifically, XPrompt eliminates the negative prompt tokens at different granularity levels through a hierarchical structured pruning, yielding a more parameter-efficient prompt yet with a competitive performance. Comprehensive experiments are carried out on SuperGLUE tasks, and the extensive results indicate that XPrompt is able to close the performance gap at smaller model scales., Comment: 15 pages, accepted to EMNLP 2022 main conference
Published: 2022

43. Generalized Intent Discovery: Learning from Open World Dialogue System

Author: Mou, Yutao, He, Keqing, Wu, Yanan, Wang, Pei, Wang, Jingang, Wu, Wei, Huang, Yi, Feng, Junlan, and Xu, Weiran
Subjects: Computer Science - Computation and Language
Abstract: Traditional intent classification models are based on a pre-defined intent set and only recognize limited in-domain (IND) intent classes. But users may input out-of-domain (OOD) queries in a practical dialogue system. Such OOD queries can provide directions for future improvement. In this paper, we define a new task, Generalized Intent Discovery (GID), which aims to extend an IND intent classifier to an open-world intent set including IND and OOD intents. We hope to simultaneously classify a set of labeled IND intent classes while discovering and recognizing new unlabeled OOD types incrementally. We construct three public datasets for different application scenarios and propose two kinds of frameworks, pipeline-based and end-to-end for future work. Further, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide new guidance for future GID research., Comment: This paper has been accepted at COLING2022
Published: 2022

44. Structural Bias for Aspect Sentiment Triplet Extraction

Author: Zhang, Chen, Ren, Lei, Ma, Fang, Wang, Jingang, Wu, Wei, and Song, Dawei
Subjects: Computer Science - Computation and Language
Abstract: Structural bias has recently been exploited for aspect sentiment triplet extraction (ASTE) and led to improved performance. On the other hand, it is recognized that explicitly incorporating structural bias would have a negative impact on efficiency, whereas pretrained language models (PLMs) can already capture implicit structures. Thus, a natural question arises: Is structural bias still a necessity in the context of PLMs? To answer the question, we propose to address the efficiency issues by using an adapter to integrate structural bias in the PLM and using a cheap-to-compute relative position structure in place of the syntactic dependency structure. Benchmarking evaluation is conducted on the SemEval datasets. The results show that our proposed structural adapter is beneficial to PLMs and achieves state-of-the-art performance over a range of strong baselines, yet with a light parameter demand and low latency. Meanwhile, we give rise to the concern that the current evaluation default with data of small scale is under-confident. Consequently, we release a large-scale dataset for ASTE. The results on the new dataset hint that the structural adapter is confidently effective and efficient to a large scale. Overall, we draw the conclusion that structural bias shall still be a necessity even with PLMs., Comment: 10 pages, 4 figures, 5 tables, accepted to COLING 2022, code is available at https://github.com/GeneZC/StructBias
Published: 2022

45. Unified Knowledge Prompt Pre-training for Customer Service Dialogues

Author: He, Keqing, Wang, Jingang, Sun, Chaobo, and Wu, Wei
Subjects: Computer Science - Computation and Language
Abstract: Dialogue bots have been widely applied in customer service scenarios to provide timely and user-friendly experience. These bots must classify the appropriate domain of a dialogue, understand the intent of users, and generate proper responses. Existing dialogue pre-training models are designed only for several dialogue tasks and ignore weakly-supervised expert knowledge in customer service dialogues. In this paper, we propose a novel unified knowledge prompt pre-training framework, UFA (\textbf{U}nified Model \textbf{F}or \textbf{A}ll Tasks), for customer service dialogues. We formulate all the tasks of customer service dialogues as a unified text-to-text generation task and introduce a knowledge-driven prompt strategy to jointly learn from a mixture of distinct dialogue tasks. We pre-train UFA on a large-scale Chinese customer service corpus collected from practical scenarios and get significant improvements on both natural language understanding (NLU) and natural language generation (NLG) benchmarks., Comment: CIKM2022
Published: 2022

46. CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations

Author: Chen, Borun, Tang, Hongyin, Bu, Jiahao, Zhang, Kai, Wang, Jingang, Wang, Qifan, Zheng, Hai-Tao, Wu, Wei, and Yu, Liqian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding. Various Chinese PLMs have been successively proposed for learning better Chinese language representation. However, most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words. While recent pre-trained models incorporate both words and characters simultaneously, they usually suffer from deficient semantic interactions and fail to capture the semantic relation between words and characters. To address the above issues, we propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations. In particular, CLOWER implicitly encodes the coarse-grained information (i.e., words) into the fine-grained representations (i.e., characters) through contrastive learning on multi-grained information. CLOWER is of great value in realistic scenarios since it can be easily incorporated into any existing fine-grained based PLMs without modifying the production pipelines.Extensive experiments conducted on a range of downstream tasks demonstrate the superior performance of CLOWER over several state-of-the-art baselines., Comment: Accepted in COLING 2022
Published: 2022

47. MiniDisc: Minimal Distillation Schedule for Language Model Compression

Author: Zhang, Chen, Yang, Yang, Wang, Qifan, Liu, Jiahao, Wang, Jingang, Wu, Wei, and Song, Dawei
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent studies have uncovered that language model distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is of vital importance to bring the knowledge from the teacher to the student. However, existing teacher assistant-based methods require maximally many trials before scheduling an optimal teacher assistant. To this end, we propose a minimal distillation schedule (MiniDisc) for scheduling the optimal teacher assistant in minimally one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, MiniDisc is designed with a $\lambda$-tradeoff to measure the optimality of the teacher assistant without trial distillation to the student. MiniDisc then can schedule the optimal teacher assistant with the best $\lambda$-tradeoff in a sandwich framework. MiniDisc is evaluated with an extensive set of experiments on GLUE. Experimental results demonstrate the improved efficiency our MiniDisc compared to several state-of-the-art baselines. We further apply MiniDisc to a language model with billions of parameters and show its scalability., Comment: Accepted to EACL 2024. Code is available at https://github.com/GeneZC/MiniDisc
Published: 2022

48. Making Pretrained Language Models Good Long-tailed Learners

Author: Zhang, Chen, Ren, Lei, Wang, Jingang, Wu, Wei, and Song, Dawei
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Prompt-tuning has shown appealing performance in few-shot classification by virtue of its capability in effectively exploiting pre-trained knowledge. This motivates us to check the hypothesis that prompt-tuning is also a promising choice for long-tailed classification, since the tail classes are intuitively few-shot ones. To achieve this aim, we conduct empirical studies to examine the hypothesis. The results demonstrate that prompt-tuning makes pretrained language models at least good long-tailed learners. For intuitions on why prompt-tuning can achieve good performance in long-tailed classification, we carry out in-depth analyses by progressively bridging the gap between prompt-tuning and commonly used finetuning. The summary is that the classifier structure and parameterization form the key to making good long-tailed learners, in comparison with the less important input structure. Finally, we verify the applicability of our finding to few-shot classification. Good long-tailed learners can be abbreviated as Glee., Comment: 15 pages, 4 figures, 10 tables, accepted to EMNLP 2022. Code is available at https://github.com/GeneZC/Glee
Published: 2022

49. GNN-encoder: Learning a Dual-encoder Architecture via Graph Neural Networks for Dense Passage Retrieval

Author: Liu, Jiduan, Liu, Jiahao, Yang, Yang, Wang, Jingang, Wu, Wei, Zhao, Dongyan, and Yan, Rui
Subjects: Computer Science - Information Retrieval
Abstract: Recently, retrieval models based on dense representations are dominant in passage retrieval tasks, due to their outstanding ability in terms of capturing semantics of input text compared to the traditional sparse vector space models. A common practice of dense retrieval models is to exploit a dual-encoder architecture to represent a query and a passage independently. Though efficient, such a structure loses interaction between the query-passage pair, resulting in inferior accuracy. To enhance the performance of dense retrieval models without loss of efficiency, we propose a GNN-encoder model in which query (passage) information is fused into passage (query) representations via graph neural networks that are constructed by queries and their top retrieved passages. By this means, we maintain a dual-encoder structure, and retain some interaction information between query-passage pairs in their representations, which enables us to achieve both efficiency and efficacy in passage retrieval. Evaluation results indicate that our method significantly outperforms the existing models on MSMARCO, Natural Questions and TriviaQA datasets, and achieves the new state-of-the-art on these datasets., Comment: Findings of EMNLP2022
Published: 2022

50. Deep Partial Multiplex Network Embedding

Author: Wang, Qifan, Fang, Yi, Ravula, Anirudh, He, Ruining, Shen, Bin, Wang, Jingang, Quan, Xiaojun, and Liu, Dongfang
Subjects: Computer Science - Machine Learning, Computer Science - Social and Information Networks
Abstract: Network embedding is an effective technique to learn the low-dimensional representations of nodes in networks. Real-world networks are usually with multiplex or having multi-view representations from different relations. Recently, there has been increasing interest in network embedding on multiplex data. However, most existing multiplex approaches assume that the data is complete in all views. But in real applications, it is often the case that each view suffers from the missing of some data and therefore results in partial multiplex data. In this paper, we present a novel Deep Partial Multiplex Network Embedding approach to deal with incomplete data. In particular, the network embeddings are learned by simultaneously minimizing the deep reconstruction loss with the autoencoder neural network, enforcing the data consistency across views via common latent subspace learning, and preserving the data topological structure within the same network through graph Laplacian. We further prove the orthogonal invariant property of the learned embeddings and connect our approach with the binary embedding techniques. Experiments on four multiplex benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art methods on node classification, link prediction and clustering tasks., Comment: Accepted to WWW 2022 GL workshop
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

58 results on '"Wang Jingang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources