Author: "Li, JuanZi" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Li, JuanZi"' showing total 121 results

Start Over Author "Li, JuanZi" Publication Type Electronic Resources

121 results on '"Li, JuanZi"'

1. TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

Author: Zhang, Xiaokang, Zhang, Jing, Ma, Zeyao, Li, Yang, Zhang, Bohan, Li, Guanlin, Yao, Zijun, Xu, Kangli, Zhou, Jinchang, Zhang-Li, Daniel, Yu, Jifan, Zhao, Shu, Li, Juanzi, Tang, Jie, Zhang, Xiaokang, Zhang, Jing, Ma, Zeyao, Li, Yang, Zhang, Bohan, Li, Guanlin, Yao, Zijun, Xu, Kangli, Zhou, Jinchang, Zhang-Li, Daniel, Yu, Jifan, Zhao, Shu, Li, Juanzi, and Tang, Jie
Abstract: We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.Our codes and data are publicly available at https://github.com/TableLLM/TableLLM., Comment: https://tablellm.github.io
Published: 2024

2. Reverse That Number! Decoding Order Matters in Arithmetic Learning

Author: Zhang-Li, Daniel, Lin, Nianyi, Yu, Jifan, Zhang, Zheyuan, Yao, Zijun, Zhang, Xiaokang, Hou, Lei, Zhang, Jing, Li, Juanzi, Zhang-Li, Daniel, Lin, Nianyi, Yu, Jifan, Zhang, Zheyuan, Yao, Zijun, Zhang, Xiaokang, Hou, Lei, Zhang, Jing, and Li, Juanzi
Abstract: Recent advancements in pretraining have demonstrated that modern Large Language Models (LLMs) possess the capability to effectively learn arithmetic operations. However, despite acknowledging the significance of digit order in arithmetic computation, current methodologies predominantly rely on sequential, step-by-step approaches for teaching LLMs arithmetic, resulting in a conclusion where obtaining better performance involves fine-grained step-by-step. Diverging from this conventional path, our work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity. We have developed and applied this method in a comprehensive set of experiments. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training. For the purpose of facilitating replication and further research, we have made our code and dataset publicly available at \url{https://anonymous.4open.science/r/RAIT-9FB7/}.
Published: 2024

3. Event-level Knowledge Editing

Author: Peng, Hao, Wang, Xiaozhi, Li, Chunyang, Zeng, Kaisheng, Duo, Jiangshan, Cao, Yixin, Hou, Lei, Li, Juanzi, Peng, Hao, Wang, Xiaozhi, Li, Chunyang, Zeng, Kaisheng, Duo, Jiangshan, Cao, Yixin, Hou, Lei, and Li, Juanzi
Abstract: Knowledge editing aims at updating knowledge of large language models (LLMs) to prevent them from becoming outdated. Existing work edits LLMs at the level of factual knowledge triplets. However, natural knowledge updates in the real world come from the occurrences of new events rather than direct changes in factual triplets. In this paper, we propose a new task setting: event-level knowledge editing, which directly edits new events into LLMs and improves over conventional triplet-level editing on (1) Efficiency. A single event edit leads to updates in multiple entailed knowledge triplets. (2) Completeness. Beyond updating factual knowledge, event-level editing also requires considering the event influences and updating LLMs' knowledge about future trends. We construct a high-quality event-level editing benchmark ELKEN, consisting of 1,515 event edits, 6,449 questions about factual knowledge, and 10,150 questions about future tendencies. We systematically evaluate the performance of various knowledge editing methods and LLMs on this benchmark. We find that ELKEN poses significant challenges to existing knowledge editing approaches. Our codes and dataset are publicly released to facilitate further research., Comment: 18 pages, 2 figures
Published: 2024

4. EmoBench: Evaluating the Emotional Intelligence of Large Language Models

Author: Sabour, Sahand, Liu, Siyang, Zhang, Zheyuan, Liu, June M., Zhou, Jinfeng, Sunaryo, Alvionna S., Li, Juanzi, Lee, Tatia M. C., Mihalcea, Rada, Huang, Minlie, Sabour, Sahand, Liu, Siyang, Zhang, Zheyuan, Liu, June M., Zhou, Jinfeng, Sunaryo, Alvionna S., Li, Juanzi, Lee, Tatia M. C., Mihalcea, Rada, and Huang, Minlie
Abstract: Recent advances in Large Language Models (LLMs) have highlighted the need for robust, comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional Intelligence (EI) is considerably limited. Existing benchmarks have two major shortcomings: first, they mainly focus on emotion recognition, neglecting essential EI capabilities such as emotion regulation and thought facilitation through emotion understanding; second, they are primarily constructed from existing datasets, which include frequent patterns, explicit information, and annotation errors, leading to unreliable evaluation. We propose EmoBench, a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine EI, including Emotional Understanding and Emotional Application. EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing LLMs and the average human, highlighting a promising direction for future research. Our code and data will be publicly available from https://github.com/Sahandfer/EmoBench., Comment: Work in progress
Published: 2024

5. CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Author: Qi, Ji, Ding, Ming, Wang, Weihan, Bai, Yushi, Lv, Qingsong, Hong, Wenyi, Xu, Bin, Hou, Lei, Li, Juanzi, Dong, Yuxiao, Tang, Jie, Qi, Ji, Ding, Ming, Wang, Weihan, Bai, Yushi, Lv, Qingsong, Hong, Wenyi, Xu, Bin, Hou, Lei, Li, Juanzi, Dong, Yuxiao, and Tang, Jie
Abstract: Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, \textbf{CogCoM}, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM., Comment: 19 pages, 9 figures
Published: 2024

6. LongAlign: A Recipe for Long Context Alignment of Large Language Models

Author: Bai, Yushi, Lv, Xin, Zhang, Jiajie, He, Yuze, Qi, Ji, Hou, Lei, Tang, Jie, Dong, Yuxiao, Li, Juanzi, Bai, Yushi, Lv, Xin, Zhang, Jiajie, He, Yuze, Qi, Ji, Hou, Lei, Tang, Jie, Dong, Yuxiao, and Li, Juanzi
Abstract: Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.
Published: 2024

7. KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases

Author: Zhang, Jiajie, Cao, Shulin, Hu, Linmei, Feng, Ling, Hou, Lei, Li, Juanzi, Zhang, Jiajie, Cao, Shulin, Hu, Linmei, Feng, Ling, Hou, Lei, and Li, Juanzi
Abstract: Program induction (PI) has become a promising paradigm for using knowledge bases (KBs) to help large language models (LLMs) answer complex knowledge-intensive questions. Nonetheless, PI typically relies on a large number of parallel question-program pairs to make the LLM aware of the schema of the given KB, and is thus challenging for many low-resourced KBs that lack annotated data. To this end, we propose KB-Plugin, a plug-and-play framework that enables LLMs to induce programs over any low-resourced KB. Firstly, KB-Plugin adopts self-supervised learning to encode the detailed schema information of a given KB into a pluggable module, namely schema plugin. Secondly, KB-Plugin utilizes abundant annotated data from a rich-resourced KB to train another pluggable module, namely PI plugin, which can help the LLM extract question-relevant schema information from the schema plugin of any KB and utilize this information to induce programs over this KB. Experiments on five heterogeneous KBQA datasets show that KB-Plugin achieves better or comparable performance with 25$\times$ smaller backbone LLM compared to SoTA PI methods for low-resourced KBs, and even approaches the performance of supervised methods. Our code and data are available at https://github.com/THU-KEG/KB-Plugin.
Published: 2024

8. Probing Structured Semantics Understanding and Generation of Language Models via Question Answering

Author: Liu, Jinxin, Cao, Shulin, Shi, Jiaxin, Zhang, Tingjian, Hou, Lei, Li, Juanzi, Liu, Jinxin, Cao, Shulin, Shi, Jiaxin, Zhang, Tingjian, Hou, Lei, and Li, Juanzi
Abstract: Recent advancement in the capabilities of large language models (LLMs) has triggered a new surge in LLMs' evaluation. Most recent evaluation works tends to evaluate the comprehensive ability of LLMs over series of tasks. However, the deep structure understanding of natural language is rarely explored. In this work, we examine the ability of LLMs to deal with structured semantics on the tasks of question answering with the help of the human-constructed formal language. Specifically, we implement the inter-conversion of natural and formal language through in-context learning of LLMs to verify their ability to understand and generate the structured logical forms. Extensive experiments with models of different sizes and in different formal languages show that today's state-of-the-art LLMs' understanding of the logical forms can approach human level overall, but there still are plenty of room in generating correct logical forms, which suggest that it is more effective to use LLMs to generate more natural language training data to reinforce a small model than directly answering questions with LLMs. Moreover, our results also indicate that models exhibit considerable sensitivity to different formal languages. In general, the formal language with the lower the formalization level, i.e. the more similar it is to natural language, is more LLMs-friendly., Comment: ARR, 8 pages
Published: 2024

9. DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

Author: Tu, Shangqing, Zhu, Kejian, Bai, Yushi, Yao, Zijun, Hou, Lei, Li, Juanzi, Tu, Shangqing, Zhu, Kejian, Bai, Yushi, Yao, Zijun, Hou, Lei, and Li, Juanzi
Abstract: The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. In this work, we argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity, which we called In-distribution contamination. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that the DICE detection scores are positively correlated with the performance of ten LLMs fine-tuned by either us or other organizations on four math reasoning datasets (with $R^2$ values between 0.6 and 0.75). This indicates that the in-distribution contamination problem potentially lead to an overestimation of the true capabilities of many existing models. The code and data are available at https://github.com/THU-KEG/DICE., Comment: 13 pages, 7 figures
Published: 2024

10. Explainable Few-shot Knowledge Tracing

Author: Li, Haoxuan, Yu, Jifan, Ouyang, Yuanxin, Liu, Zhuang, Rong, Wenge, Li, Juanzi, Xiong, Zhang, Li, Haoxuan, Yu, Jifan, Ouyang, Yuanxin, Liu, Zhuang, Rong, Wenge, Li, Juanzi, and Xiong, Zhang
Abstract: Knowledge tracing (KT), aiming to mine students' mastery of knowledge by their exercise records and predict their performance on future test questions, is a critical task in educational assessment. While researchers achieved tremendous success with the rapid development of deep learning techniques, current knowledge tracing tasks fall into the cracks from real-world teaching scenarios. Relying heavily on extensive student data and solely predicting numerical performances differs from the settings where teachers assess students' knowledge state from limited practices and provide explanatory feedback. To fill this gap, we explore a new task formulation: Explainable Few-shot Knowledge Tracing. By leveraging the powerful reasoning and generation abilities of large language models (LLMs), we then propose a cognition-guided framework that can track the student knowledge from a few student records while providing natural language explanations. Experimental results from three widely used datasets show that LLMs can perform comparable or superior to competitive deep knowledge tracing methods. We also discuss potential directions and call for future improvements in relevant topics.
Published: 2024

11. A Solution-based LLM API-using Methodology for Academic Information Seeking

Author: Wang, Yuanchun, Yu, Jifan, Yao, Zijun, Zhang, Jing, Xie, Yuyang, Tu, Shangqing, Fu, Yiyang, Feng, Youhe, Zhang, Jinkai, Zhang, Jingyao, Huang, Bowen, Li, Yuanyao, Yuan, Huihui, Hou, Lei, Li, Juanzi, Tang, Jie, Wang, Yuanchun, Yu, Jifan, Yao, Zijun, Zhang, Jing, Xie, Yuyang, Tu, Shangqing, Fu, Yiyang, Feng, Youhe, Zhang, Jinkai, Zhang, Jingyao, Huang, Bowen, Li, Yuanyao, Yuan, Huihui, Hou, Lei, Li, Juanzi, and Tang, Jie
Abstract: Applying large language models (LLMs) for academic API usage shows promise in reducing researchers' academic information seeking efforts. However, current LLM API-using methods struggle with complex API coupling commonly encountered in academic queries. To address this, we introduce SoAy, a solution-based LLM API-using methodology for academic information seeking. It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence. The addition of the solution reduces the difficulty for the model to understand the complex relationships between APIs. Code improves the efficiency of reasoning. To evaluate SoAy, we introduce SoAyBench, an evaluation benchmark accompanied by SoAyEval, built upon a cloned environment of APIs from AMiner. Experimental results demonstrate a 34.58-75.99\% performance improvement compared to state-of-the-art LLM API-based baselines. All datasets, codes, tuned models, and deployed online services are publicly accessible at https://github.com/RUCKBReasoning/SoAy., Comment: 22 pages, 13 figures
Published: 2024

12. Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking

Author: Zhang, Xiaokang, Yao, Zijun, Zhang, Jing, Yun, Kaifeng, Yu, Jifan, Li, Juanzi, Tang, Jie, Zhang, Xiaokang, Yao, Zijun, Zhang, Jing, Yun, Kaifeng, Yu, Jifan, Li, Juanzi, and Tang, Jie
Abstract: Detecting non-factual content is a longstanding goal to increase the trustworthiness of large language models (LLMs) generations. Current factuality probes, trained using humanannotated labels, exhibit limited transferability to out-of-distribution content, while online selfconsistency checking imposes extensive computation burden due to the necessity of generating multiple outputs. This paper proposes PINOSE, which trains a probing model on offline self-consistency checking results, thereby circumventing the need for human-annotated data and achieving transferability across diverse data distributions. As the consistency check process is offline, PINOSE reduces the computational burden of generating multiple responses by online consistency verification. Additionally, it examines various aspects of internal states prior to response decoding, contributing to more effective detection of factual inaccuracies. Experiment results on both factuality detection and question answering benchmarks show that PINOSE achieves surpassing results than existing factuality detection methods. Our code and datasets are publicly available on this anonymized repository.
Published: 2024

13. A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation

Author: Yu, Jifan, Zhang, Xiaohan, Xu, Yifan, Lei, Xuanyu, Yao, Zijun, Zhang, Jing, Hou, Lei, Li, Juanzi, Yu, Jifan, Zhang, Xiaohan, Xu, Yifan, Lei, Xuanyu, Yao, Zijun, Zhang, Jing, Hou, Lei, and Li, Juanzi
Abstract: Empowered by the large-scale pretrained language models, existing dialogue systems have demonstrated impressive performance conducting fluent and natural-sounding conversations. However, they are still plagued by the hallucination problem, causing unpredictable factual errors in the generated responses. Recently, knowledge-grounded dialogue generation models, that intentionally invoke external knowledge resources to more informative responses, are also proven to be effective in reducing hallucination. Following the idea of getting high-quality knowledge, a few efforts have achieved pretty good performance on this issue. As some inevitable knowledge noises may also lead to hallucinations, it is emergent to investigate the reason and future directions for building noise-tolerant methods in KGD tasks. In this paper, we analyze the causal story behind this problem with counterfactual reasoning methods. Based on the causal effect analysis, we propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction. Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance, while keeping adaptive to different generation models. We hope our efforts can support and call for more attention to developing lightweight techniques towards robust and trusty dialogue systems., Comment: Accepted by LREC-COLING 2024
Published: 2024

14. Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models

Author: Liu, Yantao, Yao, Zijun, Lv, Xin, Fan, Yuchen, Cao, Shulin, Yu, Jifan, Hou, Lei, Li, Juanzi, Liu, Yantao, Yao, Zijun, Lv, Xin, Fan, Yuchen, Cao, Shulin, Yu, Jifan, Hou, Lei, and Li, Juanzi
Abstract: Providing knowledge documents for large language models (LLMs) has emerged as a promising solution to update the static knowledge inherent in their parameters. However, knowledge in the document may conflict with the memory of LLMs due to outdated or incorrect knowledge in the LLMs' parameters. This leads to the necessity of examining the capability of LLMs to assimilate supplemental external knowledge that conflicts with their memory. While previous studies have explained to what extent LLMs extract conflicting knowledge from the provided text, they neglect the necessity to reason with conflicting knowledge. Furthermore, there lack a detailed analysis on strategies to enable LLMs to resolve conflicting knowledge via prompting, decoding strategy, and supervised fine-tuning. To address these limitations, we construct a new dataset, dubbed KNOT, for knowledge conflict resolution examination in the form of question answering. KNOT facilitates in-depth analysis by dividing reasoning with conflicting knowledge into three levels: (1) Direct Extraction, which directly extracts conflicting knowledge to answer questions. (2) Explicit Reasoning, which reasons with conflicting knowledge when the reasoning path is explicitly provided in the question. (3) Implicit Reasoning, where reasoning with conflicting knowledge requires LLMs to infer the reasoning path independently to answer questions. We also conduct extensive experiments on KNOT to establish empirical guidelines for LLMs to utilize conflicting knowledge in complex circumstances. Dataset and associated codes can be accessed at https://github.com/THU-KEG/KNOT ., Comment: Accepted by LREC-COLING 2024 as long paper
Published: 2024

15. Evaluating Generative Language Models in Information Extraction as Subjective Question Correction

Author: Fan, Yuchen, Liu, Yantao, Yao, Zijun, Yu, Jifan, Hou, Lei, Li, Juanzi, Fan, Yuchen, Liu, Yantao, Yao, Zijun, Yu, Jifan, Hou, Lei, and Li, Juanzi
Abstract: Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at https://github.com/THU-KEG/SQC-Score., Comment: Accepted by LREC-COLING 2024, short paper
Published: 2024

16. TacoERE: Cluster-aware Compression for Event Relation Extraction

Author: Guan, Yong, Wang, Xiaozhi, Hou, Lei, Li, Juanzi, Pan, Jeff, Chen, Jiaoyan, Lecue, Freddy, Guan, Yong, Wang, Xiaozhi, Hou, Lei, Li, Juanzi, Pan, Jeff, Chen, Jiaoyan, and Lecue, Freddy
Abstract: Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE., Comment: Accepted to LREC-COLING 2024
Published: 2024

17. ADELIE: Aligning Large Language Models on Information Extraction

Author: Qi, Yunjia, Peng, Hao, Wang, Xiaozhi, Xu, Bin, Hou, Lei, Li, Juanzi, Qi, Yunjia, Peng, Hao, Wang, Xiaozhi, Xu, Bin, Hou, Lei, and Li, Juanzi
Abstract: Large language models (LLMs) usually fall short on information extraction (IE) tasks and struggle to follow the complex instructions of IE tasks. This primarily arises from LLMs not being aligned with humans, as mainstream alignment datasets typically do not include IE data. In this paper, we introduce ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE. We first collect and construct a high-quality alignment corpus IEInstruct for IE. Then we train ADELIE_SFT using instruction tuning on IEInstruct. We further train ADELIE_SFT with direct preference optimization (DPO) objective, resulting in ADELIE_DPO. Extensive experiments on various held-out IE datasets demonstrate that our models (ADELIE_SFT and ADELIE_DPO) achieve state-of-the-art (SoTA) performance among open-source models. We further explore the general capabilities of ADELIE, and experimental results reveal that their general capabilities do not exhibit a noticeable decline. We will release the code, data, and models to facilitate further research.
Published: 2024

18. GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation

Author: Zhang, Jing, Zhang, Xiaokang, Zhang-Li, Daniel, Yu, Jifan, Yao, Zijun, Ma, Zeyao, Xu, Yiqi, Wang, Haohua, Zhang, Xiaohan, Lin, Nianyi, Lu, Sunrui, Li, Juanzi, Tang, Jie, Zhang, Jing, Zhang, Xiaokang, Zhang-Li, Daniel, Yu, Jifan, Yao, Zijun, Ma, Zeyao, Xu, Yiqi, Wang, Haohua, Zhang, Xiaohan, Lin, Nianyi, Lu, Sunrui, Li, Juanzi, and Tang, Jie
Abstract: We present GLM-Dialog, a large-scale language model (LLM) with 10B parameters capable of knowledge-grounded conversation in Chinese using a search engine to access the Internet knowledge. GLM-Dialog offers a series of applicable techniques for exploiting various external knowledge including both helpful and noisy knowledge, enabling the creation of robust knowledge-grounded dialogue LLMs with limited proper datasets. To evaluate the GLM-Dialog more fairly, we also propose a novel evaluation method to allow humans to converse with multiple deployed bots simultaneously and compare their performance implicitly instead of explicitly rating using multidimensional metrics.Comprehensive evaluations from automatic to human perspective demonstrate the advantages of GLM-Dialog comparing with existing open source Chinese dialogue models. We release both the model checkpoint and source code, and also deploy it as a WeChat application to interact with users. We offer our evaluation platform online in an effort to prompt the development of open source models and reliable dialogue evaluation systems. The additional easy-to-use toolkit that consists of short text entity linking, query generation, and helpful knowledge classification is also released to enable diverse applications. All the source code is available on Github.
Published: 2023

19. Syntactically Robust Training on Partially-Observed Data for Open Information Extraction

Author: Qi, Ji, Chen, Yuxiang, Hou, Lei, Li, Juanzi, Xu, Bin, Qi, Ji, Chen, Yuxiang, Hou, Lei, Li, Juanzi, and Xu, Bin
Abstract: Open Information Extraction models have shown promising results with sufficient supervision. However, these models face a fundamental challenge that the syntactic distribution of training data is partially observable in comparison to the real world. In this paper, we propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution based on diverse paraphrase generation. To tackle the intrinsic problem of knowledge deformation of paraphrasing, two algorithms based on semantic similarity matching and syntactic tree walking are used to restore the expressionally transformed knowledge. The training framework can be generally applied to other syntactic partial observable domains. Based on the proposed framework, we build a new evaluation set called CaRB-AutoPara, a syntactically diverse dataset consistent with the real-world setting for validating the robustness of the models. Experiments including a thorough analysis show that the performance of the model degrades with the increase of the difference in syntactic distribution, while our framework gives a robust boundary. The source code is publicly available at https://github.com/qijimrc/RobustOIE.
Published: 2023

20. Benchmarking Foundation Models with Language-Model-as-an-Examiner

Author: Bai, Yushi, Ying, Jiahao, Cao, Yixin, Lv, Xin, He, Yuze, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Xiao, Yijia, Lyu, Haozhe, Zhang, Jiayin, Li, Juanzi, Hou, Lei, Bai, Yushi, Ying, Jiahao, Cao, Yixin, Lv, Xin, He, Yuze, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Xiao, Yijia, Lyu, Haozhe, Zhang, Jiayin, Li, Juanzi, and Hou, Lei
Abstract: Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http://lmexam.xlore.cn., Comment: NeurIPS 2023 Datasets and Benchmarks
Published: 2023

21. Learn to Not Link: Exploring NIL Prediction in Entity Linking

Author: Zhu, Fangwei, Yu, Jifan, Jin, Hailong, Li, Juanzi, Hou, Lei, Sui, Zhifang, Zhu, Fangwei, Yu, Jifan, Jin, Hailong, Li, Juanzi, Hou, Lei, and Sui, Zhifang
Abstract: Entity linking models have achieved significant success via utilizing pretrained language models to capture semantic features. However, the NIL prediction problem, which aims to identify mentions without a corresponding entity in the knowledge base, has received insufficient attention. We categorize mentions linking to NIL into Missing Entity and Non-Entity Phrase, and propose an entity linking dataset NEL that focuses on the NIL prediction problem. NEL takes ambiguous entities as seeds, collects relevant mention context in the Wikipedia corpus, and ensures the presence of mentions linking to NIL by human annotation and entity masking. We conduct a series of experiments with the widely used bi-encoder and cross-encoder entity linking models, results show that both types of NIL mentions in training data have a significant influence on the accuracy of NIL prediction. Our code and dataset can be accessed at https://github.com/solitaryzero/NIL_EL, Comment: ACL Findings 2023
Published: 2023

22. Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering

Author: Zhang, Jiajie, Cao, Shulin, Zhang, Tingjia, Lv, Xin, Shi, Jiaxin, Tian, Qi, Li, Juanzi, Hou, Lei, Zhang, Jiajie, Cao, Shulin, Zhang, Tingjia, Lv, Xin, Shi, Jiaxin, Tian, Qi, Li, Juanzi, and Hou, Lei
Abstract: Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework., Comment: has been accepted by ACL2023
Published: 2023

23. Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

Author: Qi, Ji, Zhang, Chuchun, Wang, Xiaozhi, Zeng, Kaisheng, Yu, Jifan, Liu, Jinxin, Sun, Jiuding, Chen, Yuxiang, Hou, Lei, Li, Juanzi, Xu, Bin, Qi, Ji, Zhang, Chuchun, Wang, Xiaozhi, Zeng, Kaisheng, Yu, Jifan, Liu, Jinxin, Sun, Jiuding, Chen, Yuxiang, Hou, Lei, Li, Juanzi, and Xu, Bin
Abstract: The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST., Comment: Accepted by EMNLP 2023 Main Conference
Published: 2023

24. ChatLog: Recording and Analyzing ChatGPT Across Time

Author: Tu, Shangqing, Li, Chunyang, Yu, Jifan, Wang, Xiaozhi, Hou, Lei, Li, Juanzi, Tu, Shangqing, Li, Chunyang, Yu, Jifan, Wang, Xiaozhi, Hou, Lei, and Li, Juanzi
Abstract: While there are abundant researches about evaluating ChatGPT on natural language understanding and generation tasks, few studies have investigated how ChatGPT's behavior changes over time. In this paper, we collect a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily: ChatLog-Monthly is a dataset of 38,730 question-answer pairs collected every month including questions from both the reasoning and classification tasks. ChatLog-Daily, on the other hand, consists of ChatGPT's responses to 1000 identical questions for long-form generation every day. We conduct comprehensive automatic and human evaluation to provide the evidence for the existence of ChatGPT evolving patterns. We further analyze the unchanged characteristics of ChatGPT over time by extracting its knowledge and linguistic features. We find some stable features to improve the robustness of a RoBERTa-based detector on new versions of ChatGPT. We will continuously maintain our project at https://github.com/THU-KEG/ChatLog., Comment: 30 pages
Published: 2023

25. MoocRadar: A Fine-grained and Multi-aspect Knowledge Repository for Improving Cognitive Student Modeling in MOOCs

Author: Yu, Jifan, Lu, Mengying, Zhong, Qingyang, Yao, Zijun, Tu, Shangqing, Liao, Zhengshan, Li, Xiaoya, Li, Manli, Hou, Lei, Zheng, Hai-Tao, Li, Juanzi, Tang, Jie, Yu, Jifan, Lu, Mengying, Zhong, Qingyang, Yao, Zijun, Tu, Shangqing, Liao, Zhengshan, Li, Xiaoya, Li, Manli, Hou, Lei, Zheng, Hai-Tao, Li, Juanzi, and Tang, Jie
Abstract: Student modeling, the task of inferring a student's learning characteristics through their interactions with coursework, is a fundamental issue in intelligent education. Although the recent attempts from knowledge tracing and cognitive diagnosis propose several promising directions for improving the usability and effectiveness of current models, the existing public datasets are still insufficient to meet the need for these potential solutions due to their ignorance of complete exercising contexts, fine-grained concepts, and cognitive labels. In this paper, we present MoocRadar, a fine-grained, multi-aspect knowledge repository consisting of 2,513 exercise questions, 5,600 knowledge concepts, and over 12 million behavioral records. Specifically, we propose a framework to guarantee a high-quality and comprehensive annotation of fine-grained concepts and cognitive labels. The statistical and experimental results indicate that our dataset provides the basis for the future improvements of existing methods. Moreover, to support the convenient usage for researchers, we release a set of tools for data querying, model adaption, and even the extension of our repository, which are now available at https://github.com/THU-KEG/MOOC-Radar., Comment: Accepted by SIGIR 2023
Published: 2023

26. GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

Author: Qi, Ji, Yu, Jifan, Tu, Teng, Gao, Kunyu, Xu, Yifan, Guan, Xinyu, Wang, Xiaozhi, Dong, Yuxiao, Xu, Bin, Hou, Lei, Li, Juanzi, Tang, Jie, Guo, Weidong, Liu, Hui, Xu, Yu, Qi, Ji, Yu, Jifan, Tu, Teng, Gao, Kunyu, Xu, Yifan, Guan, Xinyu, Wang, Xiaozhi, Dong, Yuxiao, Xu, Bin, Hou, Lei, Li, Juanzi, Tang, Jie, Guo, Weidong, Liu, Hui, and Xu, Yu
Abstract: Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task. Our data and code are available at https://github.com/THU-KEG/goal., Comment: Accepted by CIKM 2023
Published: 2023

27. CogAgent: A Visual Language Model for GUI Agents

Author: Hong, Wenyi, Wang, Weihan, Lv, Qingsong, Xu, Jiazheng, Yu, Wenmeng, Ji, Junhui, Wang, Yan, Wang, Zihan, Zhang, Yuxuan, Li, Juanzi, Xu, Bin, Dong, Yuxiao, Ding, Ming, Tang, Jie, Hong, Wenyi, Wang, Weihan, Lv, Qingsong, Xu, Jiazheng, Yu, Wenmeng, Ji, Junhui, Wang, Yan, Wang, Zihan, Zhang, Yuxuan, Li, Juanzi, Xu, Bin, Dong, Yuxiao, Ding, Ming, and Tang, Jie
Abstract: People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM ., Comment: 27 pages, 19 figures
Published: 2023

28. Probabilistic Tree-of-thought Reasoning for Answering Knowledge-intensive Complex Questions

Author: Cao, Shulin, Zhang, Jiajie, Shi, Jiaxin, Lv, Xin, Yao, Zijun, Tian, Qi, Li, Juanzi, Hou, Lei, Cao, Shulin, Zhang, Jiajie, Shi, Jiaxin, Lv, Xin, Yao, Zijun, Tian, Qi, Li, Juanzi, and Hou, Lei
Abstract: Large language models (LLMs) are capable of answering knowledge-intensive complex questions with chain-of-thought (CoT) reasoning. However, they tend to generate factually incorrect reasoning steps when the required knowledge is not available or up-to-date in models' parameters. Recent works turn to retrieving external knowledge to augment CoT reasoning. Despite being promising, these chain-based methods suffer from: 1) Negative retrieval. Unnecessary or incorrect retrieval may mislead the reasoning; 2) Limited sight. Lacking the ability to look backward or forward, a local error in one step will propagate along the chain. In this paper, we propose a novel approach: Probabilistic Tree-of-thought Reasoning (ProbTree). First, LLMs translate a complex question into a query tree, in which each non-root node denotes a sub-question of its parent node. Then, probabilistic reasoning is conducted over the tree, by solving questions from leaf to root considering the confidence of both question decomposing and answering. During reasoning, for leaf nodes, LLMs choose a more confident answer from Closed-book QA that employs parametric knowledge and Open-book QA that employs retrieved external knowledge, thus eliminating the negative retrieval problem. For non-leaf nodes, with the hierarchical structure, LLMs have broader sights and are able to globally reason with the information from child nodes, thus recovering from local errors. The experiments on three Complex QA datasets under the open-domain setting show that our approach outperforms SOTA methods significantly, demonstrating the effect of probabilistic tree-of-thought reasoning., Comment: Accepted by EMNLP 2023
Published: 2023

29. MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument Annotation

Author: Wang, Xiaozhi, Peng, Hao, Guan, Yong, Zeng, Kaisheng, Chen, Jianhui, Hou, Lei, Han, Xu, Lin, Yankai, Liu, Zhiyuan, Xie, Ruobing, Zhou, Jie, Li, Juanzi, Wang, Xiaozhi, Peng, Hao, Guan, Yong, Zeng, Kaisheng, Chen, Jianhui, Hou, Lei, Han, Xu, Lin, Yankai, Liu, Zhiyuan, Xie, Ruobing, Zhou, Jie, and Li, Juanzi
Abstract: Understanding events in texts is a core objective of natural language understanding, which requires detecting event occurrences, extracting event arguments, and analyzing inter-event relationships. However, due to the annotation challenges brought by task complexity, a large-scale dataset covering the full process of event understanding has long been absent. In this paper, we introduce MAVEN-Arg, which augments MAVEN datasets with event argument annotations, making the first all-in-one dataset supporting event detection, event argument extraction (EAE), and event relation extraction. As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained with laborious human annotation; (3) the exhaustive annotation supporting all task variants of EAE, which annotates both entity and non-entity event arguments in document level. Experiments indicate that MAVEN-Arg is quite challenging for both fine-tuned EAE models and proprietary large language models (LLMs). Furthermore, to demonstrate the benefits of an all-in-one dataset, we preliminarily explore a potential application, future event prediction, with LLMs. MAVEN-Arg and our code can be obtained from https://github.com/THU-KEG/MAVEN-Argument., Comment: Working in progress
Published: 2023

30. When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks

Author: Peng, Hao, Wang, Xiaozhi, Chen, Jianhui, Li, Weikai, Qi, Yunjia, Wang, Zimu, Wu, Zhili, Zeng, Kaisheng, Xu, Bin, Hou, Lei, Li, Juanzi, Peng, Hao, Wang, Xiaozhi, Chen, Jianhui, Li, Weikai, Qi, Yunjia, Wang, Zimu, Wu, Zhili, Zeng, Kaisheng, Xu, Bin, Hou, Lei, and Li, Juanzi
Abstract: In-context learning (ICL) has become the default method for using large language models (LLMs), making the exploration of its limitations and understanding the underlying causes crucial. In this paper, we find that ICL falls short of handling specification-heavy tasks, which are tasks with complicated and extensive task specifications, requiring several hours for ordinary humans to master, such as traditional information extraction tasks. The performance of ICL on these tasks mostly cannot reach half of the state-of-the-art results. To explore the reasons behind this failure, we conduct comprehensive experiments on 18 specification-heavy tasks with various LLMs and identify three primary reasons: inability to specifically understand context, misalignment in task schema comprehension with humans, and inadequate long-text understanding ability. Furthermore, we demonstrate that through fine-tuning, LLMs can achieve decent performance on these tasks, indicating that the failure of ICL is not an inherent flaw of LLMs, but rather a drawback of existing alignment methods that renders LLMs incapable of handling complicated specification-heavy tasks via ICL. To substantiate this, we perform dedicated instruction tuning on LLMs for these tasks and observe a notable improvement. We hope the analyses in this paper could facilitate advancements in alignment methods enabling LLMs to meet more sophisticated human demands., Comment: Under review
Published: 2023

31. WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Author: Tu, Shangqing, Sun, Yuliang, Bai, Yushi, Yu, Jifan, Hou, Lei, Li, Juanzi, Tu, Shangqing, Sun, Yuliang, Bai, Yushi, Yu, Jifan, Hou, Lei, and Li, Juanzi
Abstract: To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}., Comment: 22pages, 7 figures
Published: 2023

32. CogVLM: Visual Expert for Pretrained Language Models

Author: Wang, Weihan, Lv, Qingsong, Yu, Wenmeng, Hong, Wenyi, Qi, Ji, Wang, Yan, Ji, Junhui, Yang, Zhuoyi, Zhao, Lei, Song, Xixuan, Xu, Jiazheng, Xu, Bin, Li, Juanzi, Dong, Yuxiao, Ding, Ming, Tang, Jie, Wang, Weihan, Lv, Qingsong, Yu, Wenmeng, Hong, Wenyi, Qi, Ji, Wang, Yan, Ji, Junhui, Yang, Zhuoyi, Zhao, Lei, Song, Xixuan, Xu, Jiazheng, Xu, Bin, Li, Juanzi, Dong, Yuxiao, Ding, Ming, and Tang, Jie
Abstract: We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.
Published: 2023

33. VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Author: Qi, Ji, Ji, Kaixuan, Yu, Jifan, Wang, Duokang, Xu, Bin, Hou, Lei, Li, Juanzi, Qi, Ji, Ji, Kaixuan, Yu, Jifan, Wang, Duokang, Xu, Bin, Hou, Lei, and Li, Juanzi
Abstract: Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce \textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.
Published: 2023

34. Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment

Author: Qi, Ji, Ji, Kaixuan, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Hou, Lei, Li, Juanzi, Xu, Bin, Qi, Ji, Ji, Kaixuan, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Hou, Lei, Li, Juanzi, and Xu, Bin
Abstract: Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts, which has attracted growing attention to build dedicated models with human experience. As the large language models (LLMs) have exhibited remarkable in-context learning capabilities, a question arises as to whether the task of OIE can be effectively tackled with this paradigm? In this paper, we explore solving the OIE problem by constructing an appropriate reasoning environment for LLMs. Specifically, we first propose a method to effectively estimate the discrepancy of syntactic distribution between a LLM and test samples, which can serve as correlation evidence for preparing positive demonstrations. Upon the evidence, we introduce a simple yet effective mechanism to establish the reasoning environment for LLMs on specific tasks. Without bells and whistles, experimental results on the standard CaRB benchmark demonstrate that our $6$-shot approach outperforms state-of-the-art supervised method, achieving an $55.3$ $F_1$ score. Further experiments on TACRED and ACE05 show that our method can naturally generalize to other information extraction tasks, resulting in improvements of $5.7$ and $6.8$ $F_1$ scores, respectively.
Published: 2023

35. Exploring the Cognitive Knowledge Structure of Large Language Models: An Educational Diagnostic Assessment Approach

Author: Zhang, Zheyuan, Yu, Jifan, Li, Juanzi, Hou, Lei, Zhang, Zheyuan, Yu, Jifan, Li, Juanzi, and Hou, Lei
Abstract: Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence. Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains. However, cognitive research on the overall knowledge structure of LLMs is still lacking. In this paper, based on educational diagnostic assessment method, we conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom Taxonomy. We aim to reveal the knowledge structures of LLMs and gain insights of their cognitive capabilities. This research emphasizes the significance of investigating LLMs' knowledge and understanding the disparate cognitive patterns of LLMs. By shedding light on models' knowledge, researchers can advance development and utilization of LLMs in a more informed and effective manner., Comment: Findings of EMNLP 2023 (Short Paper)
Published: 2023

36. T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Author: He, Yuze, Bai, Yushi, Lin, Matthieu, Zhao, Wang, Hu, Yubin, Sheng, Jenny, Yi, Ran, Li, Juanzi, Liu, Yong-Jin, He, Yuze, Bai, Yushi, Lin, Matthieu, Zhao, Wang, Hu, Yubin, Sheng, Jenny, Yi, Ran, Li, Juanzi, and Liu, Yong-Jin
Abstract: Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com., Comment: Under review
Published: 2023

37. OmniEvent: A Comprehensive, Fair, and Easy-to-Use Toolkit for Event Understanding

Author: Peng, Hao, Wang, Xiaozhi, Yao, Feng, Wang, Zimu, Zhu, Chuzhao, Zeng, Kaisheng, Hou, Lei, Li, Juanzi, Peng, Hao, Wang, Xiaozhi, Yao, Feng, Wang, Zimu, Zhu, Chuzhao, Zeng, Kaisheng, Hou, Lei, and Li, Juanzi
Abstract: Event understanding aims at understanding the content and relationship of events within texts, which covers multiple complicated information extraction tasks: event detection, event argument extraction, and event relation extraction. To facilitate related research and application, we present an event understanding toolkit OmniEvent, which features three desiderata: (1) Comprehensive. OmniEvent supports mainstream modeling paradigms of all the event understanding tasks and the processing of 15 widely-used English and Chinese datasets. (2) Fair. OmniEvent carefully handles the inconspicuous evaluation pitfalls reported in Peng et al. (2023), which ensures fair comparisons between different models. (3) Easy-to-use. OmniEvent is designed to be easily used by users with varying needs. We provide off-the-shelf models that can be directly deployed as web services. The modular framework also enables users to easily implement and evaluate new event understanding models with OmniEvent. The toolkit (https://github.com/THU-KEG/OmniEvent) is publicly released along with the demonstration website and video (https://omnievent.xlore.cn/).
Published: 2023

38. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Author: Bai, Yushi, Lv, Xin, Zhang, Jiajie, Lyu, Hongchang, Tang, Jiankai, Huang, Zhidian, Du, Zhengxiao, Liu, Xiao, Zeng, Aohan, Hou, Lei, Dong, Yuxiao, Tang, Jie, Li, Juanzi, Bai, Yushi, Lv, Xin, Zhang, Jiajie, Lyu, Hongchang, Tang, Jiankai, Huang, Zhidian, Du, Zhengxiao, Liu, Xiao, Zeng, Aohan, Hou, Lei, Dong, Yuxiao, Tang, Jie, and Li, Juanzi
Abstract: Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench., Comment: 18 pages, 6 figures
Published: 2023

39. ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Author: Wang, Weihan, Yang, Zhen, Xu, Bin, Li, Juanzi, Sun, Yankui, Wang, Weihan, Yang, Zhen, Xu, Bin, Li, Juanzi, and Sun, Yankui
Abstract: Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training., Comment: 15 pages, 5 figures
Published: 2023

40. LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain of Teach Prompts

Author: Tu, Shangqing, Zhang, Zheyuan, Yu, Jifan, Li, Chunyang, Zhang, Siyu, Yao, Zijun, Hou, Lei, Li, Juanzi, Tu, Shangqing, Zhang, Zheyuan, Yu, Jifan, Li, Chunyang, Zhang, Siyu, Yao, Zijun, Hou, Lei, and Li, Juanzi
Abstract: Teaching assistants have played essential roles in the long history of education. However, few MOOC platforms are providing human or virtual teaching assistants to support learning for massive online students due to the complexity of real-world online education scenarios and the lack of training data. In this paper, we present a virtual MOOC teaching assistant, LittleMu with minimum labeled training data, to provide question answering and chit-chat services. Consisting of two interactive modules of heterogeneous retrieval and language model prompting, LittleMu first integrates structural, semi- and unstructured knowledge sources to support accurate answers for a wide range of questions. Then, we design delicate demonstrations named "Chain of Teach" prompts to exploit the large-scale pre-trained model to handle complex uncollected questions. Except for question answering, we develop other educational services such as knowledge-grounded chit-chat. We test the system's performance via both offline evaluation and online deployment. Since May 2020, our LittleMu system has served over 80,000 users with over 300,000 queries from over 500 courses on XuetangX MOOC platform, which continuously contributes to a more convenient and fair education. Our code, services, and dataset will be available at https://github.com/THU-KEG/VTA., Comment: 7 pages, 3 figures, Accepted by CIKM 23
Published: 2023

41. VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

Author: Yao, Zijun, Chen, Yuanyong, Lv, Xin, Cao, Shulin, Xin, Amy, Yu, Jifan, Jin, Hailong, Xu, Jianjun, Zhang, Peng, Hou, Lei, Li, Juanzi, Yao, Zijun, Chen, Yuanyong, Lv, Xin, Cao, Shulin, Xin, Amy, Yu, Jifan, Jin, Hailong, Xu, Jianjun, Zhang, Peng, Hou, Lei, and Li, Juanzi
Abstract: We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available.
Published: 2023

42. KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

Author: Yao, Zijun, Liu, Yantao, Lv, Xin, Cao, Shulin, Yu, Jifan, Hou, Lei, Li, Juanzi, Yao, Zijun, Liu, Yantao, Lv, Xin, Cao, Shulin, Yu, Jifan, Hou, Lei, and Li, Juanzi
Abstract: Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.
Published: 2023

43. FC-KBQA: A Fine-to-Coarse Composition Framework for Knowledge Base Question Answering

Author: Zhang, Lingxi, Zhang, Jing, Wang, Yanling, Cao, Shulin, Huang, Xinmei, Li, Cuiping, Chen, Hong, Li, Juanzi, Zhang, Lingxi, Zhang, Jing, Wang, Yanling, Cao, Shulin, Huang, Xinmei, Li, Cuiping, Chen, Hong, and Li, Juanzi
Abstract: The generalization problem on KBQA has drawn considerable attention. Existing research suffers from the generalization issue brought by the entanglement in the coarse-grained modeling of the logical expression, or inexecutability issues due to the fine-grained modeling of disconnected classes and relations in real KBs. We propose a Fine-to-Coarse Composition framework for KBQA (FC-KBQA) to both ensure the generalization ability and executability of the logical expression. The main idea of FC-KBQA is to extract relevant fine-grained knowledge components from KB and reformulate them into middle-grained knowledge pairs for generating the final logical expressions. FC-KBQA derives new state-of-the-art performance on GrailQA and WebQSP, and runs 4 times faster than the baseline.
Published: 2023

44. KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Author: Yu, Jifan, Wang, Xiaozhi, Tu, Shangqing, Cao, Shulin, Zhang-Li, Daniel, Lv, Xin, Peng, Hao, Yao, Zijun, Zhang, Xiaohan, Li, Hanming, Li, Chunyang, Zhang, Zheyuan, Bai, Yushi, Liu, Yantao, Xin, Amy, Lin, Nianyi, Yun, Kaifeng, Gong, Linlu, Chen, Jianhui, Wu, Zhili, Qi, Yunjia, Li, Weikai, Guan, Yong, Zeng, Kaisheng, Qi, Ji, Jin, Hailong, Liu, Jinxin, Gu, Yu, Yao, Yuan, Ding, Ning, Hou, Lei, Liu, Zhiyuan, Xu, Bin, Tang, Jie, Li, Juanzi, Yu, Jifan, Wang, Xiaozhi, Tu, Shangqing, Cao, Shulin, Zhang-Li, Daniel, Lv, Xin, Peng, Hao, Yao, Zijun, Zhang, Xiaohan, Li, Hanming, Li, Chunyang, Zhang, Zheyuan, Bai, Yushi, Liu, Yantao, Xin, Amy, Lin, Nianyi, Yun, Kaifeng, Gong, Linlu, Chen, Jianhui, Wu, Zhili, Qi, Yunjia, Li, Weikai, Guan, Yong, Zeng, Kaisheng, Qi, Ji, Jin, Hailong, Liu, Jinxin, Gu, Yu, Yao, Yuan, Ding, Ning, Hou, Lei, Liu, Zhiyuan, Xu, Bin, Tang, Jie, and Li, Juanzi
Abstract: The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.
Published: 2023

45. The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Author: Peng, Hao, Wang, Xiaozhi, Yao, Feng, Zeng, Kaisheng, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, Shen, Weixing, Peng, Hao, Wang, Xiaozhi, Yao, Feng, Zeng, Kaisheng, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, and Shen, Weixing
Abstract: Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent., Comment: Accepted at Findings of ACL 2023
Published: 2023

46. Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization

Author: Bai, Yushi, Lv, Xin, Li, Juanzi, Hou, Lei, Bai, Yushi, Lv, Xin, Li, Juanzi, and Hou, Lei
Abstract: Answering complex logical queries on incomplete knowledge graphs is a challenging task, and has been widely studied. Embedding-based methods require training on complex queries, and cannot generalize well to out-of-distribution query structures. Recent work frames this task as an end-to-end optimization problem, and it only requires a pretrained link predictor. However, due to the exponentially large combinatorial search space, the optimal solution can only be approximated, limiting the final accuracy. In this work, we propose QTO (Query Computation Tree Optimization) that can efficiently find the exact optimal solution. QTO finds the optimal solution by a forward-backward propagation on the tree-like computation graph, i.e., query computation tree. In particular, QTO utilizes the independence encoded in the query computation tree to reduce the search space, where only local computations are involved during the optimization procedure. Experiments on 3 datasets show that QTO obtains state-of-the-art performance on complex query answering, outperforming previous best results by an average of 22%. Moreover, QTO can interpret the intermediate solutions for each of the one-hop atoms in the query with over 90% accuracy. The code of our paper is at https://github.com/bys0318/QTO., Comment: To appear in ICML 2023
Published: 2022

47. Finding Skill Neurons in Pre-trained Transformer-based Language Models

Author: Wang, Xiaozhi, Wen, Kaiyue, Zhang, Zhengyan, Hou, Lei, Liu, Zhiyuan, Li, Juanzi, Wang, Xiaozhi, Wen, Kaiyue, Zhang, Zhengyan, Hou, Lei, Liu, Zhiyuan, and Li, Juanzi
Abstract: Transformer-based pre-trained language models have demonstrated superior performance on various natural language processing tasks. However, it remains unclear how the skills required to handle these tasks distribute among model parameters. In this paper, we find that after prompt tuning for specific tasks, the activations of some neurons within pre-trained Transformers are highly predictive of the task labels. We dub these neurons skill neurons and confirm they encode task-specific skills by finding that: (1) Skill neurons are crucial for handling tasks. Performances of pre-trained Transformers on a task significantly drop when corresponding skill neurons are perturbed. (2) Skill neurons are task-specific. Similar tasks tend to have similar distributions of skill neurons. Furthermore, we demonstrate the skill neurons are most likely generated in pre-training rather than fine-tuning by showing that the skill neurons found with prompt tuning are also crucial for other fine-tuning methods freezing neuron weights, such as the adapter-based tuning and BitFit. We also explore the applications of skill neurons, including accelerating Transformers with network pruning and building better transferability indicators. These findings may promote further research on understanding Transformers. The source code can be obtained from https://github.com/THU-KEG/Skill-Neuron., Comment: Accepted at EMNLP 2022. Camera-ready version
Published: 2022

48. MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction

Author: Wang, Xiaozhi, Chen, Yulin, Ding, Ning, Peng, Hao, Wang, Zimu, Lin, Yankai, Han, Xu, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, Li, Peng, Zhou, Jie, Wang, Xiaozhi, Chen, Yulin, Ding, Ning, Peng, Hao, Wang, Zimu, Lin, Yankai, Han, Xu, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, Li, Peng, and Zhou, Jie
Abstract: The diverse relationships among real-world events, including coreference, temporal, causal, and subevent relations, are fundamental to understanding natural languages. However, two drawbacks of existing datasets limit event relation extraction (ERE) tasks: (1) Small scale. Due to the annotation complexity, the data scale of existing datasets is limited, which cannot well train and evaluate data-hungry models. (2) Absence of unified annotation. Different types of event relations naturally interact with each other, but existing datasets only cover limited relation types at once, which prevents models from taking full advantage of relation interactions. To address these issues, we construct a unified large-scale human-annotated ERE dataset MAVEN-ERE with improved annotation schemes. It contains 103,193 event coreference chains, 1,216,217 temporal relations, 57,992 causal relations, and 15,841 subevent relations, which is larger than existing datasets of all the ERE tasks by at least an order of magnitude. Experiments show that ERE on MAVEN-ERE is quite challenging, and considering relation interactions with joint learning can improve performances. The dataset and source codes can be obtained from https://github.com/THU-KEG/MAVEN-ERE., Comment: Accepted at EMNLP 2022. Camera-ready version
Published: 2022

49. A Survey of Knowledge Enhanced Pre-trained Language Models

Author: Hu, Linmei, Liu, Zeyi, Zhao, Ziwang, Hou, Lei, Nie, Liqiang, Li, Juanzi, Hu, Linmei, Liu, Zeyi, Zhao, Ziwang, Hou, Lei, Nie, Liqiang, and Li, Juanzi
Abstract: Pre-trained Language Models (PLMs) which are trained on large text corpus via self-supervised learning method, have yielded promising performance on various tasks in Natural Language Processing (NLP). However, though PLMs with huge parameters can effectively possess rich knowledge learned from massive training text and benefit downstream tasks at the fine-tuning stage, they still have some limitations such as poor reasoning ability due to the lack of external knowledge. Research has been dedicated to incorporating knowledge into PLMs to tackle these issues. In this paper, we present a comprehensive review of Knowledge Enhanced Pre-trained Language Models (KE-PLMs) to provide a clear insight into this thriving field. We introduce appropriate taxonomies respectively for Natural Language Understanding (NLU) and Natural Language Generation (NLG) to highlight these two main tasks of NLP. For NLU, we divide the types of knowledge into four categories: linguistic knowledge, text knowledge, knowledge graph (KG), and rule knowledge. The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods. Finally, we point out some promising future directions of KE-PLMs.
Published: 2022

50. COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

Author: Peng, Hao, Wang, Xiaozhi, Hu, Shengding, Jin, Hailong, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, Liu, Qun, Peng, Hao, Wang, Xiaozhi, Hu, Shengding, Jin, Hailong, Hou, Lei, Li, Juanzi, Liu, Zhiyuan, and Liu, Qun
Abstract: Conceptual knowledge is fundamental to human cognition and knowledge bases. However, existing knowledge probing works only focus on evaluating factual knowledge of pre-trained language models (PLMs) and ignore conceptual knowledge. Since conceptual knowledge often appears as implicit commonsense behind texts, designing probes for conceptual knowledge is hard. Inspired by knowledge representation schemata, we comprehensively evaluate conceptual knowledge of PLMs by designing three tasks to probe whether PLMs organize entities by conceptual similarities, learn conceptual properties, and conceptualize entities in contexts, respectively. For the tasks, we collect and annotate 24k data instances covering 393 concepts, which is COPEN, a COnceptual knowledge Probing bENchmark. Extensive experiments on different sizes and types of PLMs show that existing PLMs systematically lack conceptual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing human-like cognition in PLMs. COPEN and our codes are publicly released at https://github.com/THU-KEG/COPEN., Comment: Accepted by EMNLP 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

121 results on '"Li, JuanZi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources