7,425 results on '"Liu, Pengfei"'
Search Results
2. The Influence of Projected Outcomes on Preferences over Alternative Regulations: Evidence from a Recreational Fishery
- Author
-
Chen, Zhenshan, Liu, Pengfei, Schultz, Eric T., Kasper, Jacob M., and Swallow, Stephen K.
- Published
- 2022
3. Incentive Compatibility and the Consequences When It Is Missing: Experiments with Water Quality Credits Purchase
- Author
-
Liu, Pengfei and Swallow, Stephen K.
- Published
- 2022
4. Heavy flavor-asymmetric pseudoscalar mesons on the light front
- Author
-
Shi, Chao, Liu, Pengfei, Du, Yi-Lun, and Jia, Wenbao
- Subjects
High Energy Physics - Phenomenology ,Nuclear Theory - Abstract
We extract the leading Fock-state light front wave functions (LF-LFWFs) of heavy flavor-asymmetric pseudoscalar mesons $D$, $B$ and $B_c$ from their Bethe-Salpeter wave functions based on Dyson-Schwinger equations approach, and study their leading twist parton distribution amplitudes, generalized parton distribution functions and transverse momentum dependent parton distributions. The spatial distributions of the quark and antiquark on the transverse plane are given, along with their charge and energy distributions on the light front. We find that in the considered mesons, the heavier quarks carry most longitudinal momentum fraction and yield narrow $x$-distributions, while the lighter quarks play an active role in shaping the transverse distributions within both spatial and momentum space, exhibiting a duality embodying characteristics from both light mesons and heavy quarkonium., Comment: 10 pages,9 figures
- Published
- 2024
5. RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
- Author
-
Ru, Dongyu, Qiu, Lin, Hu, Xiangkun, Zhang, Tianhang, Shi, Peng, Chang, Shuaichen, Jiayang, Cheng, Wang, Cunxiang, Sun, Shichao, Li, Huanyu, Zhang, Zizhao, Wang, Binjie, Jiang, Jiarong, He, Tong, Wang, Zhiguo, Liu, Pengfei, Zhang, Yue, and Zhang, Zheng
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems. This work has been open sourced at https://github.com/amazon-science/RAGChecker., Comment: Under Review. Github Repo: https://github.com/amazon-science/RAGChecker
- Published
- 2024
6. Topological Charge Quadrupole Protected by Spin-Orbit U(1) Quasi-Symmetry in Antiferromagnet NdBiPt
- Author
-
Zhang, Ao, Chen, Xiaobing, Li, Jiayu, Liu, Pengfei, Liu, Yuntian, and Liu, Qihang
- Subjects
Condensed Matter - Materials Science - Abstract
The interplay of symmetry and topology in crystal solids has given rise to various elementary excitations as quasiparticles. Among these, those with significant Berry-phase-related transport responses are of particular interest. Here, we predict a new type of quasiparticle called topological charge quadruple (TCQ), which is analogous to a charge quadrupole but consists of two closely-packed pairs of Weyl points in momentum space, specifically in a half-Heusler antiferromagnet NdBiPt. Interestingly, the TCQ is protected by the spin-orbit $U(1)$ quasi-symmetry, rather than any exact crystallographic symmetries. This quasi-symmetry restricts the energy splitting induced by symmetry-lowering perturbations to a second-order effect. Furthermore, the closely located Berry curvature sources and sinks in the TCQ lead to a large Berry curvature dipole, resulting in a significant nonlinear Hall effect. Our work opens an avenue for designing novel quasiparticles using quasi-symmetries and developing materials with enhanced nonlinear responses., Comment: 4 figures
- Published
- 2024
7. OpenResearcher: Unleashing AI for Accelerated Scientific Research
- Author
-
Zheng, Yuxiang, Sun, Shichao, Qiu, Lin, Ru, Dongyu, Jiayang, Cheng, Li, Xuefeng, Lin, Jifan, Wang, Binjie, Luo, Yun, Pan, Renjie, Xu, Yang, Min, Qingkai, Zhang, Zizhao, Wang, Yiwen, Li, Wenjie, and Liu, Pengfei
- Subjects
Computer Science - Information Retrieval - Abstract
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers' queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: https://github.com/GAIR-NLP/OpenResearcher.
- Published
- 2024
8. Data Contamination Report from the 2024 CONDA Shared Task
- Author
-
Sainz, Oscar, García-Ferrero, Iker, Jacovi, Alon, Campos, Jon Ander, Elazar, Yanai, Agirre, Eneko, Goldberg, Yoav, Chen, Wei-Lin, Chim, Jenny, Choshen, Leshem, D'Amico-Wong, Luca, Dell, Melissa, Fan, Run-Ze, Golchin, Shahriar, Li, Yucheng, Liu, Pengfei, Pahwa, Bhavish, Prabhu, Ameya, Sharma, Suryansh, Silcock, Emily, Solonko, Kateryna, Stap, David, Surdeanu, Mihai, Tseng, Yu-Min, Udandarao, Vishaal, Wang, Zengzhi, Xu, Ruijie, and Yang, Jinglin
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community., Comment: https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database
- Published
- 2024
9. OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance
- Author
-
Yao, Yongqiang, Tan, Jingru, Hu, Jiahao, Zhang, Feizhao, Jin, Xin, Li, Bo, Gong, Ruihao, and Liu, Pengfei
- Subjects
Computer Science - Artificial Intelligence - Abstract
Recently, vision-language instruct-tuning models have made significant progress due to their more comprehensive understanding of the world. In this work, we discovered that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. We rebalanced the computational loads from data, model, and memory perspectives to address this issue, achieving more balanced computation across devices. These three components are not independent but are closely connected, forming an omniverse balanced training framework. Specifically, for the data, we grouped instances into new balanced mini-batches within and across devices. For the model, we employed a search-based method to achieve a more balanced partitioning. For memory optimization, we adaptively adjusted the re-computation strategy for each partition to utilize the available memory fully. We conducted extensive experiments to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, we significantly reduced GPU days, achieving about 1.8x speed-up. Our method's efficacy and generalizability were further demonstrated across various models and datasets. Codes will be released at https://github.com/ModelTC/OmniBal.
- Published
- 2024
10. SAFETY-J: Evaluating Safety with Critique
- Author
-
Liu, Yixiu, Zheng, Yuxiang, Xia, Shijie, Li, Jiajun, Tu, Yi, Song, Chaoling, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language - Abstract
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.
- Published
- 2024
11. Understanding Reference Policies in Direct Preference Optimization
- Author
-
Liu, Yixin, Liu, Pengfei, and Cohan, Arman
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO's effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of the KL-constraint from the reference policies in DPO by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO's superiority in this controlled setting. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies., Comment: GitHub Repo: https://github.com/yale-nlp/refdpo
- Published
- 2024
12. Weak-to-Strong Reasoning
- Author
-
Yang, Yuqing, Ma, Yan, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervisions for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.
- Published
- 2024
13. Halu-J: Critique-Based Hallucination Judge
- Author
-
Wang, Binjie, Chern, Steffi, Chern, Ethan, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs) frequently generate non-factual content, known as hallucinations. Existing retrieval-augmented-based hallucination detection approaches typically address this by framing it as a classification task, evaluating hallucinations based on their consistency with retrieved evidence. However, this approach usually lacks detailed explanations for these evaluations and does not assess the reliability of these explanations. Furthermore, deficiencies in retrieval systems can lead to irrelevant or partially relevant evidence retrieval, impairing the detection process. Moreover, while real-world hallucination detection requires analyzing multiple pieces of evidence, current systems usually treat all evidence uniformly without considering its relevance to the content. To address these challenges, we introduce Halu-J, a critique-based hallucination judge with 7 billion parameters. Halu-J enhances hallucination detection by selecting pertinent evidence and providing detailed critiques. Our experiments indicate that Halu-J outperforms GPT-4o in multiple-evidence hallucination detection and matches its capability in critique generation and evidence selection. We also introduce ME-FEVER, a new dataset designed for multiple-evidence hallucination detection. Our code and dataset can be found in https://github.com/GAIR-NLP/factool .
- Published
- 2024
14. ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
- Author
-
Chern, Ethan, Su, Jiadi, Ma, Yan, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.
- Published
- 2024
15. Progress or Regress? Self-Improvement Reversal in Post-training
- Author
-
Wu, Ting, Li, Xuefeng, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language - Abstract
Self-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, as exploration deepens, it becomes crucial to assess whether these improvements genuinely signify progress in solving more challenging problems or if they could lead to unintended regressions. To address this, we propose a comprehensive evaluative framework that goes beyond the superficial pass@1 metric to scrutinize the underlying enhancements of post-training paradigms for self-improvement. Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of \emph{self-improvement reversal}, where models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities, like output diversity and out-of-distribution (OOD) generalization. These findings indicate that current self-improvement practices through post-training are inadequate for equipping models to tackle more complex problems. Furthermore, they underscore the necessity of our critical evaluation metrics in discerning the \emph{progress or regress} dichotomy for self-improving LLMs.
- Published
- 2024
16. FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models
- Author
-
Li, Yiyuan, Sun, Shichao, and Liu, Pengfei
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark., Comment: Under review
- Published
- 2024
17. Electric-field control of the perpendicular magnetization switching in ferroelectric/ferrimagnet heterostructures
- Author
-
Liu, Pengfei, Xu, Tao, Liu, Qi, Dong, Juncai, Lin, Ting, Zhang, Qinhua, Lan, Xiukai, Sheng, Yu, Wang, Chunyu, Pei, Jiajing, Yang, Hongxin, Gu, Lin, and Wang, Kaiyou
- Subjects
Condensed Matter - Materials Science - Abstract
Electric field control of the magnetic state in ferrimagnets holds great promise for developing spintronic devices due to low power consumption. Here, we demonstrate a non-volatile reversal of perpendicular net magnetization in a ferrimagnet by manipulating the electric-field driven polarization within the Pb (Zr0.2Ti0.8) O3 (PZT)/CoGd heterostructure. Electron energy loss spectra and X-ray absorption spectrum directly verify that the oxygen ion migration at the PZT/CoGd interface associated with reversing the polarization causes the enhanced/reduced oxidation in CoGd. Ab initio calculations further substantiate that the migrated oxygen ions can modulate the relative magnetization of Co/Gd sublattices, facilitating perpendicular net magnetization switching. Our findings offer an approach to effectively control ferrimagnetic net magnetization, holding significant implications for ferrimagnetic spintronic applications., Comment: 21 pages,4 figures
- Published
- 2024
18. OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
- Author
-
Huang, Zhen, Wang, Zengzhi, Xia, Shijie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena)., Comment: 10 pages
- Published
- 2024
19. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
- Author
-
Liu, Mianxin, Ding, Jinru, Xu, Jie, Hu, Weiguo, Li, Xiaoyang, Zhu, Lifeng, Bai, Zhian, Shi, Xiaoming, Wang, Benyou, Song, Haitao, Liu, Pengfei, Zhang, Xiaofan, Wang, Shanshan, Li, Kang, Wang, Haofen, Ruan, Tong, Huang, Xuanjing, Sun, Xin, and Zhang, Shaoting
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce "MedBench", a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals' perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn., Comment: 25 pages.4 figures
- Published
- 2024
20. BeHonest: Benchmarking Honesty in Large Language Models
- Author
-
Chern, Steffi, Hu, Zhulin, Yang, Yuqing, Chern, Ethan, Guo, Yuan, Jin, Jiahe, Wang, Binjie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, present severe risks that intensify as these models approach superintelligent levels. Enhancing honesty in LLMs addresses critical limitations and helps uncover latent capabilities that are not readily expressed. This underscores the urgent need for reliable methods and benchmarks to effectively ensure and evaluate the honesty of LLMs. In this paper, we introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses. Building on this foundation, we designed 10 scenarios to evaluate and analyze 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We encourage the AI community to prioritize honesty alignment in these models, which can harness their full potential to benefit society while preventing them from causing harm through deception or inconsistency. Our benchmark and code can be found at: \url{https://github.com/GAIR-NLP/BeHonest}.
- Published
- 2024
21. OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
- Author
-
Huang, Zhen, Wang, Zengzhi, Xia, Shijie, Li, Xuefeng, Zou, Haoyang, Xu, Ruijie, Fan, Run-Ze, Ye, Lyumanshan, Chern, Ethan, Ye, Yixin, Zhang, Yikai, Yang, Yuqing, Wu, Ting, Wang, Binjie, Sun, Shichao, Xiao, Yang, Li, Yiyuan, Zhou, Fan, Chern, Steffi, Qin, Yiwei, Ma, Yan, Su, Jiadi, Liu, Yixiu, Zheng, Yuxiang, Zhang, Shaoting, Lin, Dahua, Qiao, Yu, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features., Comment: 44 pages
- Published
- 2024
22. MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation
- Author
-
Ma, Yan, Qiao, Yu, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language - Abstract
A story premise succinctly defines a story's main idea, foundation, and trajectory. It serves as the initial trigger in automatic story generation. Existing sources of story premises are limited by a lack of diversity, uneven quality, and high costs that make them difficult to scale. In response, we introduce Modular Story Premise Synthesis (MoPS) which breaks down story premises into modules like background and persona for automated design and generation. MoPS consists of three phases: (1) Precollect a consistent set of candidates for each module to form a nested dictionary. (2) Extract a key path from the nested dictionary as the premise design. (3) Instruct an LLM to integrate the design into a coherent premise sentence. Thorough evaluations demonstrate that our synthesized premises excel in diversity, fascination, completeness, and originality compared to those induced from large language models and captured from public story datasets. Similarly, the extended novels and scripts generated from our premises also exhibit higher quality. In supplementary materials, we provide the MoPS code suite, along with 7.6k generated premises and 1k extended stories. Code: https://github.com/GAIR-NLP/MoPS., Comment: ACL 2024, camera-ready
- Published
- 2024
23. Prompt Chaining or Stepwise Prompt? Refinement in Text Summarization
- Author
-
Sun, Shichao, Yuan, Ruifeng, Cao, Ziqiang, Li, Wenjie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs) have demonstrated the capacity to improve summary quality by mirroring a human-like iterative process of critique and refinement starting from the initial draft. Two strategies are designed to perform this iterative process: Prompt Chaining and Stepwise Prompt. Prompt chaining orchestrates the drafting, critiquing, and refining phases through a series of three discrete prompts, while Stepwise prompt integrates these phases within a single prompt. However, the relative effectiveness of the two methods has not been extensively studied. This paper is dedicated to examining and comparing these two methods in the context of text summarization to ascertain which method stands out as the most effective. Experimental results show that the prompt chaining method can produce a more favorable outcome. This might be because stepwise prompt might produce a simulated refinement process according to our various experiments. Since refinement is adaptable to diverse tasks, our conclusions have the potential to be extrapolated to other applications, thereby offering insights that may contribute to the broader development of LLMs., Comment: Accepted to Findings of ACL 2024
- Published
- 2024
24. RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models
- Author
-
Hu, Xiangkun, Ru, Dongyu, Qiu, Lin, Guo, Qipeng, Zhang, Tianhang, Xu, Yang, Luo, Yun, Liu, Pengfei, Zhang, Yue, and Zhang, Zheng
- Subjects
Computer Science - Computation and Language - Abstract
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker
- Published
- 2024
25. Observation of Spin Splitting in Room-Temperature Metallic Antiferromagnet CrSb
- Author
-
Zeng, Meng, Zhu, Ming-Yuan, Zhu, Yu-Peng, Liu, Xiang-Rui, Ma, Xiao-Ming, Hao, Yu-Jie, Liu, Pengfei, Qu, Gexing, Yang, Yichen, Jiang, Zhicheng, Yamagami, Kohei, Arita, Masashi, Zhang, Xiaoqian, Shao, Tian-Hao, Dai, Yue, Shimada, Kenya, Liu, Zhengtai, Ye, Mao, Huang, Yaobo, Liu, Qihang, and Liu, Chang
- Subjects
Condensed Matter - Materials Science - Abstract
Recently, unconventional antiferromagnets that enable the splitting of electronic spins have been theoretically proposed and experimentally realized, where the magnetic sublattices containing moments pointing at different directions are connected by a novel set of symmetries. Such spin splitting (SS) is substantial, $k$-dependent, and independent of the spin-orbit coupling strength, making these magnets promising materials for antiferromagnetic spintronics. Here, combined with angle-resolved photoemission spectroscopy (ARPES) and density functional theory (DFT) calculations, we perform a systematic study on CrSb, a metallic spin-split antiferromagnet candidate with $T_N$ = 703 K. Our data reveals the electronic structure of CrSb along both out-of-plane and in-plane momentum directions, which renders anisotropic $k$-dependent SS and agrees well with the calculational results. The magnitude of such SS reaches up to at least 0.8 eV at non-high-symmetry momentum points, which is significantly higher than the largest known SOC-induced SS. This compound expands the choice of materials in the field of antiferromagnetic spintronics and is likely to stimulate subsequent investigations of high-efficiency spintronic devices that are functional at room temperature., Comment: 14 pages, 4 figures
- Published
- 2024
26. Storypark: Leveraging Large Language Models to Enhance Children Story Learning Through Child-AI collaboration Storytelling
- Author
-
Ye, Lyumanshan, Jiang, Jiandong, Chang, Danni, and Liu, Pengfei
- Subjects
Computer Science - Human-Computer Interaction - Abstract
Interactive storytelling has been widely adopted by educators in teaching activities of young children. Such a teaching method combines storytelling with active child participation, benefiting their expressive abilities, creative thinking, and understanding of stories. Interactive storytelling requires facilitators to unidirectionally narrate the story content and encourage children's participation in story plot creation and interpretation of central themes through multi-sensory interactive methods such as questioning and drawing. However, providing tailored guidance based on diverse feedback from children during interactive storytelling poses challenges for most facilitators. These challenges include expanding story plot development based on children's ideas, using drawings to visualize children's thoughts, and interpreting the story's central themes based on children's thinking. This necessitates facilitators to possess strong imaginative, associative, domain knowledge, and drawing skills. Large language models have demonstrated their potential in facilitating responsive and participatory dialogues, offering new design possibilities to address the challenges faced by facilitators in interactive storytelling. In this study, our goal is to leverage large language models to design an interactive storytelling system that provides children with plot frameworks and interpretations of central themes during the interactive storytelling process. Through user experiments involving 20 child participants, we evaluate this interactive system's usability, learning effectiveness, and user experience. The user study shows that Storypark improves learning outcomes in understanding story key ideas, generalization, and transfer. And high engagement and willingness to use of participants demonstrate that StoryPark provides children with a positive learning experience.
- Published
- 2024
27. Benchmarking Benchmark Leakage in Large Language Models
- Author
-
Xu, Ruijie, Wang, Zengzhi, Fan, Run-Ze, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research., Comment: 30 pages; Homepage: https://gair-nlp.github.io/benbench
- Published
- 2024
28. RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance
- Author
-
Wang, Chengrui, Liu, Pengfei, Zhou, Min, Zeng, Ming, Li, Xubin, Ge, Tiezheng, and zheng, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. Some previous works mitigate the problem by considering hand structure yet struggle to maintain style consistency between refined malformed hands and other image regions. In this paper, we aim to solve the problem of inconsistency regarding hand structure and style. We propose a conditional diffusion-based framework RHanDS to refine the hand region with the help of decoupled structure and style guidance. Specifically, the structure guidance is the hand mesh reconstructed from the malformed hand, serving to correct the hand structure. The style guidance is a hand image, e.g., the malformed hand itself, and is employed to furnish the style reference for hand refining. In order to suppress the structure leakage when referencing hand style and effectively utilize hand data to improve the capability of the model, we build a multi-style hand dataset and introduce a twostage training strategy. In the first stage, we use paired hand images for training to generate hands with the same style as the reference. In the second stage, various hand images generated based on the human mesh are used for training to enable the model to gain control over the hand structure. We evaluate our method and counterparts on the test dataset of the proposed multi-style hand dataset. The experimental results show that RHanDS can effectively refine hands structure- and style- correctly compared with previous methods. The codes and datasets will be available soon.
- Published
- 2024
29. A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
- Author
-
Liu, Pengfei, Tao, Jun, and Ren, Zhixiang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Quantitative Biology - Quantitative Methods - Abstract
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
- Published
- 2024
30. Evaluating Mathematical Reasoning Beyond Accuracy
- Author
-
Xia, Shijie, Li, Xuefeng, Liu, Yixin, Wu, Tongshuang, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language - Abstract
The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs $\textit{validity}$ and $\textit{redundancy}$ to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. Instantiated by base models that possess strong mathematical knowledge and trained with high-quality labeled data, ReasonEval achieves state-of-the-art performance on human-labeled datasets and can accurately detect different types of errors generated by perturbation. When applied to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We release the best-performing model, meta-evaluation script, and all evaluation results at https://github.com/GAIR-NLP/ReasonEval.
- Published
- 2024
31. Target-Guided Structured Attention Network for Target-Dependent Sentiment Analysis
- Author
-
Zhang, Ji, Chen, Chengyao, Liu, Pengfei, He, Chao, and Leung, Cane Wing-Ki
- Subjects
Computational linguistics. Natural language processing ,P98-98.5 - Abstract
Target-dependent sentiment analysis (TDSA) aims to classify the sentiment of a text towards a given target. The major challenge of this task lies in modeling the semantic relatedness between a target and its context sentence. This paper proposes a novel Target-Guided Structured Attention Network (TG-SAN), which captures target-related contexts for TDSA in a fine-to-coarse manner. Given a target and its context sentence, the proposed TG-SAN first identifies multiple semantic segments from the sentence using a target-guided structured attention mechanism. It then fuses the extracted segments based on their relatedness with the target for sentiment classification. We present comprehensive comparative experiments on three benchmarks with three major findings. First, TG-SAN outperforms the state-of-the-art by up to 1.61% and 3.58% in terms of accuracy and Marco-F1, respectively. Second, it shows a strong advantage in determining the sentiment of a target when the context sentence contains multiple semantic segments. Lastly, visualization results show that the attention scores produced by TG-SAN are highly interpretable
- Published
- 2020
- Full Text
- View/download PDF
32. Complex Hygroscopic Behavior of Ambient Aerosol Particles Revealed by a Piezoelectric Technique
- Author
-
Jose, Christi, Singh, Aishwarya, Kalkura, Kavyashree N, Jose, George V, Srivastava, Shailina, Ammini, Rameshchand K, Yadav, Shweta, Ravikrishna, Raghunathan, Andreae, Meinrat O, Martin, Scot T, Liu, Pengfei, and Gunthe, Sachin S
- Subjects
Earth Sciences ,Atmospheric Sciences ,Climate Action ,radiation-climate interaction ,tropical India ,preindustrial conditions ,biogenic SOA ,size-resolvedaerosol deliquescence ,mass-based hygroscopicity ,QCM ,Chemical sciences ,Earth sciences ,Physical sciences - Abstract
Understanding the complex interactions between atmospheric aerosols and water vapor in subsaturated regions of the atmosphere is crucial for modeling and predicting aerosol-cloud-radiation-climate interactions. However, the microphysical mechanisms of these interactions for ambient aerosols remain poorly understood. For this study, size-resolved samples were collected from a high-altitude, relatively clean site situated in the Western Ghats of India during the monsoon season, in order to study background and preindustrial processes as a baseline for climate functioning within the context of the most polluted region of the world. Measurements of humidity-dependent mass-based growth factors, hygroscopicity, deliquescence behavior, and aerosol liquid water content (ALWC) were made by a novel approach using a quartz crystal microbalance based on a piezo-electric sensor. The climate-relevant fine-mode aerosols (≤2.5 μm) exhibited strong size-dependent variations in their interactions with water vapor and contributed a high fraction of ALWC. Deliquescence occurred for relatively large aerosols (diameter >180 nm) but was absent for smaller aerosols. The deliquescence relative humidity for ambient aerosols was significantly lower than that of pure inorganic salts, suggesting a strong influence of organic species. Our study establishes an improved approach for accurately measuring aerosol water uptake characteristics of ambient aerosols in the subsaturated regime, aiding in the assessment of radiative forcing effects and improving climate models.
- Published
- 2024
33. CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
- Author
-
Xie, Yiqing, Xie, Alex, Sheth, Divyanshu, Liu, Pengfei, Fried, Daniel, and Rose, Carolyn
- Subjects
Computer Science - Software Engineering ,Computer Science - Computation and Language - Abstract
To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries revised from code in 367 GitHub repositories taken from the CodeSearchNet dataset. To demonstrate the complexity and solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide the code at https://github.com/Veronicium/CodeBenchGen.
- Published
- 2024
34. LLMCRIT: Teaching Large Language Models to Use Criteria
- Author
-
Yuan, Weizhe, Liu, Pengfei, and Gallé, Matthias
- Subjects
Computer Science - Computation and Language - Abstract
Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively., Comment: ACL 2024 findings
- Published
- 2024
35. Reformatted Alignment
- Author
-
Fan, Run-Ze, Li, Xuefeng, Zou, Haoyang, Li, Junlong, He, Shwai, Chern, Ethan, Hu, Jiewen, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
The quality of finetuning data is crucial for aligning large language models (LLMs) with human values. Current methods to improve data quality are either labor-intensive or prone to factual errors caused by LLM hallucinations. This paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. This approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs. Encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B's mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy. Additionally, a mere 5% of ReAlign data yields a 67% boost in general alignment ability measured by the Alpaca dataset. This work highlights the need for further research into the science and mechanistic interpretability of LLMs. We have made the associated code and data publicly accessible to support future studies at https://github.com/GAIR-NLP/ReAlign., Comment: Homepage: https://gair-nlp.github.io/ReAlign/
- Published
- 2024
36. Dissecting Human and LLM Preferences
- Author
-
Li, Junlong, Zhou, Fan, Sun, Shichao, Zhang, Yikai, Zhao, Hai, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
As a relative quality comparison of model responses, human and Large Language Model (LLM) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. Yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. In this work, we dissect the preferences of human and 32 different LLMs to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. Additionally, LLMs of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only LLMs. Finally, we show that preference-based evaluation can be intentionally manipulated. In both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. This results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94 on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. Interactive Demo: https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization Dataset: https://huggingface.co/datasets/GAIR/preference-dissection Code: https://github.com/GAIR-NLP/Preference-Dissection
- Published
- 2024
37. Deep Rib Fracture Instance Segmentation and Classification from CT on the RibFrac Challenge
- Author
-
Yang, Jiancheng, Shi, Rui, Jin, Liang, Huang, Xiaoyang, Kuang, Kaiming, Wei, Donglai, Gu, Shixuan, Liu, Jianying, Liu, Pengfei, Chai, Zhizhong, Xiao, Yongjie, Chen, Hao, Xu, Liming, Du, Bang, Yan, Xiangyi, Tang, Hao, Alessio, Adam, Holste, Gregory, Zhang, Jiapeng, Wang, Xiaoming, He, Jianye, Che, Lixuan, Pfister, Hanspeter, Li, Ming, and Ni, Bingbing
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website. As an independent contribution, we have also extended our previous internal baseline by incorporating recent advancements in large-scale pretrained networks and point-based rib segmentation techniques. The resulting FracNet+ demonstrates competitive performance in rib fracture detection, which lays a foundation for further research and development in AI-assisted rib fracture detection and diagnosis., Comment: Challenge paper for MICCAI RibFrac Challenge (https://ribfrac.grand-challenge.org/)
- Published
- 2024
38. Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey
- Author
-
Kuang, Taojie, Liu, Pengfei, and Ren, Zhixiang
- Subjects
Computer Science - Machine Learning ,Computer Science - Computational Engineering, Finance, and Science ,Quantitative Biology - Biomolecules - Abstract
The precise prediction of molecular properties is essential for advancements in drug development, particularly in virtual screening and compound optimization. The recent introduction of numerous deep learning-based methods has shown remarkable potential in enhancing molecular property prediction (MPP), especially improving accuracy and insights into molecular structures. Yet, two critical questions arise: does the integration of domain knowledge augment the accuracy of molecular property prediction and does employing multi-modal data fusion yield more precise results than unique data source methods? To explore these matters, we comprehensively review and quantitatively analyze recent deep learning methods based on various benchmarks. We discover that integrating molecular information significantly improves molecular property prediction (MPP) for both regression and classification tasks. Specifically, regression improvements, measured by reductions in root mean square error (RMSE), are up to 4.0%, while classification enhancements, measured by the area under the receiver operating characteristic curve (ROC-AUC), are up to 1.7%. We also discover that enriching 2D graphs with 1D SMILES boosts multi-modal learning performance for regression tasks by up to 9.1%, and augmenting 2D graphs with 3D information increases performance for classification tasks by up to 13.2%, with both enhancements measured using ROC-AUC. The two consolidated insights offer crucial guidance for future advancements in drug discovery.
- Published
- 2024
39. Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science
- Author
-
Liu, Pengfei, Tao, Jun, and Ren, Zhixiang
- Subjects
Computer Science - Machine Learning ,Computer Science - Computational Engineering, Finance, and Science - Abstract
Efficient molecular modeling and design are crucial for the discovery and exploration of novel molecules, and the incorporation of deep learning methods has revolutionized this field. In particular, large language models (LLMs) offer a fresh approach to tackle scientific problems from a natural language processing (NLP) perspective, introducing a research paradigm called scientific language modeling (SLM). However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our pioneering analysis offers an exploration of the learning mechanism and paves the way for advancing SLM in molecular science.
- Published
- 2024
40. Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
- Author
-
Chern, Steffi, Chern, Ethan, Neubig, Graham, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.
- Published
- 2024
41. On the Approximate Core and Nucleon of Flow Games
- Author
-
Liu, Pengfei, Xiao, Han, and Fang, Qizhi
- Subjects
Computer Science - Computer Science and Game Theory ,05C57, 91A12, 91A43, 91A46 - Abstract
The flow game with public arcs is a cooperative revenue game derived from a flow network. In this game, each player possesses an arc, while certain arcs, known as public arcs, are not owned by any specific player and are accessible to any coalition. The aim of this game is to maximize the flow that can be routed in the network through strategic coalition formation. By exploring its connection to the maximum partially disjoint path problem, we investigate the approximate core and nucleon of the flow game with public arcs. The approximate core is an extension of the core that allows for some deviation in group rationality, while the nucleon is a multiplicative analogue of the nucleolus. In this paper, we provide two complete characterizations for the optimal approximate core and show that the nucleon can be computed in polynomial time.
- Published
- 2024
42. The Critique of Critique
- Author
-
Sun, Shichao, Li, Junlong, Yuan, Weizhe, Yuan, Ruifeng, Li, Wenjie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Critique, as a natural language description for assessing the quality of model-generated content, has played a vital role in the training, evaluation, and refinement of LLMs. However, a systematic method to evaluate the quality of critique is lacking. In this paper, we pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria. To achieve a reliable evaluation outcome, we propose Atomic Information Units (AIUs), which describe the critique in a more fine-grained manner. MetaCritique aggregates each AIU's judgment for the overall score. Moreover, MetaCritique delivers a natural language rationale for the intricate reasoning within each judgment. Lastly, we construct a meta-evaluation dataset covering 4 tasks across 16 public datasets involving human-written and LLM-generated critiques. Experiments demonstrate that MetaCritique can achieve near-human performance. Our study can facilitate future research in LLM critiques based on our following observations and released resources: (1) superior critiques judged by MetaCritique can lead to better refinements, indicating that it can potentially enhance the alignment of existing LLMs; (2) the leaderboard of critique models reveals that open-source critique models commonly suffer from factuality issues; (3) relevant code and data are publicly available at https://github.com/GAIR-NLP/MetaCritique to support deeper exploration; (4) an API at PyPI with the usage documentation in Appendix C allows users to assess the critique conveniently., Comment: Accepted to Findings of ACL 2024
- Published
- 2024
43. InFoBench: Evaluating Instruction Following Ability in Large Language Models
- Author
-
Qin, Yiwei, Song, Kaiqiang, Hu, Yebowen, Yao, Wenlin, Cho, Sangwoo, Wang, Xiaoyang, Wu, Xuansheng, Liu, Fei, Liu, Pengfei, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
- Published
- 2024
44. Establishment of trypsinogen-2 Amplification Luminescent Proximity Homogeneous Assay and its Application in Acute Pancreatitis
- Author
-
Chen, Meichun, Fang, Hongming, Wu, Jialong, Huang, Yue, Cheng, Feifan, Qin, Yuan, Zhao, Xueqin, Zhou, Xiumei, Liu, Pengfei, and Huang, Biao
- Published
- 2024
- Full Text
- View/download PDF
45. Estimation of Property Value Changes from Nearby Carbon Capture and Utilization Projects in China
- Author
-
Mei, Yingdan, Qiu, Jixiang, Qiu, Yueming Lucy, and Liu, Pengfei
- Published
- 2024
- Full Text
- View/download PDF
46. Study on Combustion Characteristics and Flame Flow Behavior with Ethanol-Kerosene Mixed Fuel in HVOF Spraying
- Author
-
Li, Siyu, Li, Chang, Liu, Pengfei, and Han, Xing
- Published
- 2024
- Full Text
- View/download PDF
47. Study of Turbulent Behavior and Particle Flight Characteristics Based on Different Turbulence Models During HVOF Spraying
- Author
-
Li, Siyu, Li, Chang, Liu, Pengfei, and Han, Xing
- Published
- 2024
- Full Text
- View/download PDF
48. Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
- Author
-
Wang, Zengzhi, Xia, Rui, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce \textsc{MathPile}, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``\emph{less is more}'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our \textsc{MathPile} can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field., Comment: 37 pages. Working in Progress. https://github.com/GAIR-NLP/MathPile/
- Published
- 2023
49. How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation
- Author
-
Xiao, Yang, Cheng, Yi, Fu, Jinlan, Wang, Jiashuo, Li, Wenjie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language ,Computer Science - Computers and Society - Abstract
In recent years, AI has demonstrated remarkable capabilities in simulating human behaviors, particularly those implemented with large language models (LLMs). However, due to the lack of systematic evaluation of LLMs' simulated behaviors, the believability of LLMs among humans remains ambiguous, i.e., it is unclear which behaviors of LLMs are convincingly human-like and which need further improvements. In this work, we design SimulateBench to evaluate the believability of LLMs when simulating human behaviors. In specific, we evaluate the believability of LLMs based on two critical dimensions: 1) consistency: the extent to which LLMs can behave consistently with the given information of a human to simulate; and 2) robustness: the ability of LLMs' simulated behaviors to remain robust when faced with perturbations. SimulateBench includes 65 character profiles and a total of 8,400 questions to examine LLMs' simulated behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters. The experimental results reveal that current LLMs struggle to align their behaviors with assigned characters and are vulnerable to perturbations in certain factors.
- Published
- 2023
50. Align on the Fly: Adapting Chatbot Behavior to Established Norms
- Author
-
Xu, Chunpu, Chern, Steffi, Chern, Ethan, Zhang, Ge, Wang, Zekun, Liu, Ruibo, Li, Jing, Fu, Jie, and Liu, Pengfei
- Subjects
Computer Science - Computation and Language - Abstract
In this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. This presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. To overcome this, we propose an On-the-fly Preference Optimization (OPO) method, which is a real-time alignment that works in a streaming way. It employs an external memory to store established rules for alignment, which can constrain LLMs' behaviors without further training, allowing for convenient updates and customization of human values. We also introduce a scalable evaluation to assess the proposed method more effectively. Experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed OPO method. Our code and data are released at https://github.com/GAIR-NLP/OPO.
- Published
- 2023
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.