Author: "Chai, Yekun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chai, Yekun"' showing total 32 results

Start Over Author "Chai, Yekun"

32 results on '"Chai, Yekun"'

1. Curiosity-Driven Reinforcement Learning from Human Feedback

Author: Sun, Haoran, Chai, Yekun, Wang, Shuohuan, Sun, Yu, Wu, Hua, and Wang, Haifeng
Subjects: Computer Science - Computation and Language
Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.
Published: 2025

2. MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Author: Chai, Yekun, Sun, Haoran, Fang, Huang, Wang, Shuohuan, Sun, Yu, and Wu, Hua
Subjects: Computer Science - Computation and Language
Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .
Published: 2024

3. Tokenization Falling Short: On Subword Robustness in Large Language Models

Author: Chai, Yekun, Fang, Yewei, Peng, Qiwei, and Li, Xuhong
Subjects: Computer Science - Computation and Language
Abstract: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens--issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval., Comment: EMNLP 2024 Findings
Published: 2024

4. Autoregressive Pre-Training on Pixels and Texts

Author: Chai, Yekun, Liu, Qingyi, Xiao, Jingwu, Wang, Shuohuan, Sun, Yu, and Wu, Hua
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at \url{https://github.com/ernie-research/pixelgpt}., Comment: EMNLP 2024
Published: 2024

5. On Training Data Influence of GPT Models

Author: Chai, Yekun, Liu, Qingyi, Wang, Shuohuan, Sun, Yu, Peng, Qiwei, and Wu, Hua
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence., Comment: EMNLP 2024
Published: 2024

6. Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Author: Nakamura, Taishi, Mishra, Mayank, Tedeschi, Simone, Chai, Yekun, Stillerman, Jason T, Friedrich, Felix, Yadav, Prateek, Laud, Tanmay, Chien, Vu Minh, Zhuo, Terry Yue, Misra, Diganta, Bogin, Ben, Vu, Xuan-Son, Karpinska, Marzena, Dantuluri, Arnav Varma, Kusa, Wojciech, Furlanello, Tommaso, Yokota, Rio, Muennighoff, Niklas, Pai, Suhas, Adewumi, Tosin, Laippala, Veronika, Yao, Xiaozhe, Junior, Adalberto, Ariyak, Alpay, Drozd, Aleksandr, Clive, Jordan, Gupta, Kshitij, Chen, Liangyu, Sun, Qi, Tsui, Ken, Persaud, Noah, Fahmy, Nour, Chen, Tianlong, Bansal, Mohit, Monti, Nicolo, Dang, Tai, Luo, Ziyang, Bui, Tien-Tung, Navigli, Roberto, Mehta, Virendra, Blumberg, Matthew, May, Victor, Nguyen, Huu, and Pyysalo, Sampo
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m., Comment: Preprint
Published: 2024

7. StarCoder 2 and The Stack v2: The Next Generation

Author: Lozhkov, Anton, Li, Raymond, Allal, Loubna Ben, Cassano, Federico, Lamy-Poirier, Joel, Tazi, Nouamane, Tang, Ao, Pykhtar, Dmytro, Liu, Jiawei, Wei, Yuxiang, Liu, Tianyang, Tian, Max, Kocetkov, Denis, Zucker, Arthur, Belkada, Younes, Wang, Zijian, Liu, Qian, Abulkhanov, Dmitry, Paul, Indraneil, Li, Zhuang, Li, Wen-Ding, Risdal, Megan, Li, Jia, Zhu, Jian, Zhuo, Terry Yue, Zheltonozhskii, Evgenii, Dade, Nii Osae Osae, Yu, Wenhao, Krauß, Lucas, Jain, Naman, Su, Yixuan, He, Xuanli, Dey, Manan, Abati, Edoardo, Chai, Yekun, Muennighoff, Niklas, Tang, Xiangru, Oblokulov, Muhtasham, Akiki, Christopher, Marone, Marc, Mou, Chenghao, Mishra, Mayank, Gu, Alex, Hui, Binyuan, Dao, Tri, Zebaze, Armel, Dehaene, Olivier, Patry, Nicolas, Xu, Canwen, McAuley, Julian, Hu, Han, Scholak, Torsten, Paquet, Sebastien, Robinson, Jennifer, Anderson, Carolyn Jane, Chapados, Nicolas, Patwary, Mostofa, Tajbakhsh, Nima, Jernite, Yacine, Ferrandis, Carlos Muñoz, Zhang, Lingming, Hughes, Sean, Wolf, Thomas, Guha, Arjun, von Werra, Leandro, and de Vries, Harm
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
Published: 2024

8. HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Author: Peng, Qiwei, Chai, Yekun, and Li, Xuhong
Subjects: Computer Science - Computation and Language, Computer Science - Programming Languages, Computer Science - Software Engineering
Abstract: Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/humaneval-xl}., Comment: LREC-COLING 2024
Published: 2024

9. Tool-Augmented Reward Modeling

Author: Li, Lei, Chai, Yekun, Wang, Shuohuan, Sun, Yu, Tian, Hao, Zhang, Ningyu, and Wu, Hua
Subjects: Computer Science - Computation and Language
Abstract: Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. We validate our approach across a wide range of domains, incorporating seven distinct external tools. Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. Additionally, we provide a comprehensive collection of tool-related RM datasets, incorporating data from seven distinct tool APIs, totaling 15,000 instances. We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements\footnote{\url{https://github.com/ernie-research/Tool-Augmented-Reward-Model}}., Comment: ICLR 2024 Spotlight
Published: 2023

10. Improved Training of Mixture-of-Experts Language GANs

Author: Chai, Yekun, Yin, Qiyue, and Zhang, Junge
Subjects: Computer Science - Computation and Language
Abstract: Despite the dramatic success in image generation, Generative Adversarial Networks (GANs) still face great challenges in synthesizing sequences of discrete elements, in particular human language. The difficulty in generator training arises from the limited representation capacity and uninformative learning signals obtained from the discriminator. In this work, we (1) first empirically show that the mixture-of-experts approach is able to enhance the representation capacity of the generator for language GANs and (2) harness the Feature Statistics Alignment (FSA) paradigm to render fine-grained learning signals to advance the generator training. Specifically, FSA forces the mean statistics of the distribution of fake data to approach that of real samples as close as possible in the finite-dimensional feature space. Empirical study on synthetic and real benchmarks shows the superior performance in quantitative evaluation and demonstrates the effectiveness of our approach to adversarial text generation., Comment: Accepted at ICASSP 2023
Published: 2023

11. ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

Author: Zhu, Pengfei, Pang, Chao, Chai, Yekun, Li, Lei, Wang, Shuohuan, Sun, Yu, Tian, Hao, and Wu, Hua
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, the burgeoning interest in diffusion models has led to significant advances in image and speech generation. Nevertheless, the direct synthesis of music waveforms from unrestricted textual prompts remains a relatively underexplored domain. In response to this lacuna, this paper introduces a pioneering contribution in the form of a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process within the diffusion model framework. Addressing the challenge of limited text-music parallel data, we undertake the creation of a dataset by harnessing web resources, a task facilitated by weak supervision techniques. Furthermore, a rigorous empirical inquiry is undertaken to contrast the efficacy of two distinct prompt formats for text conditioning, namely, music tags and unconstrained textual descriptions. The outcomes of this comparative analysis affirm the superior performance of our proposed model in terms of enhancing text-music relevance. Finally, our work culminates in a demonstrative exhibition of the excellent capabilities of our model in text-to-music generation. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance., Comment: Accepted by AACL demo 2023
Published: 2023

12. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Author: Chai, Yekun, Wang, Shuohuan, Pang, Chao, Sun, Yu, Tian, Hao, and Wu, Hua
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering
Abstract: Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints., Comment: Accepted at ACL 2023 (Findings)
Published: 2022

13. X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection

Author: Han, Yaqian, Chai, Yekun, Wang, Shuohuan, Sun, Yu, Huang, Hongyi, Chen, Guanghao, Xu, Yitong, and Yang, Yang
Subjects: Computer Science - Computation and Language
Abstract: Detecting sarcasm and verbal irony from people's subjective statements is crucial to understanding their intended meanings and real sentiments and positions in social scenarios. This paper describes the X-PuDu system that participated in SemEval-2022 Task 6, iSarcasmEval - Intended Sarcasm Detection in English and Arabic, which aims at detecting intended sarcasm in various settings of natural language understanding. Our solution finetunes pre-trained language models, such as ERNIE-M and DeBERTa, under the multilingual settings to recognize the irony from Arabic and English texts. Our system ranked second out of 43, and ninth out of 32 in Task A: one-sentence detection in English and Arabic; fifth out of 22 in Task B: binary multi-label classification in English; first out of 16, and fifth out of 13 in Task C: sentence-pair detection in English and Arabic., Comment: SemEval-2022 Task 6
Published: 2022

14. Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

Author: Chai, Yekun, Wang, Shuohuan, Sun, Yu, Tian, Hao, Wu, Hua, and Wang, Haifeng
Subjects: Computer Science - Computation and Language
Abstract: Derivative-free prompt learning has emerged as a lightweight alternative to prompt tuning, which only requires model inference to optimize the prompts. However, existing work did not take full advantage of the over-parameterized characteristics of large pre-trained language models (PLMs). In this paper, we propose Clip-Tuning, a simple yet effective method that adopts diverse frozen "thinned" networks of PLMs to obtain a mixture of rewards and thus advance the derivative-free prompt learning. The thinned networks consist of all the hidden units that survive a stationary dropout strategy, whose inference predictions reflect an ensemble of partial views over prompted training samples. Our method outperforms previous gradient-free prompt learning methods and achieves parity with gradient-based counterparts on seven language understanding benchmarks under few-shot settings., Comment: EMNLP 2022 (Findings)
Published: 2022

15. RefineCap: Concept-Aware Refinement for Image Captioning

Author: Chai, Yekun, Jin, Shuo, and Xing, Junliang
Subjects: Computer Science - Computation and Language
Abstract: Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics, and implicitly learns the mapping between visual tag words and images. The proposed Visual-Concept Refinement method can allow the generator to attend to semantic details in the image, thereby generating more semantically descriptive captions. Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models., Comment: Accepted at ViGIL @NAACL 2021
Published: 2021

16. Neural Text Classification by Jointly Learning to Cluster and Align

Author: Chai, Yekun, Zhang, Haidong, and Jin, Shuo
Subjects: Computer Science - Computation and Language
Abstract: Distributional text clustering delivers semantically informative representations and captures the relevance between each word and semantic clustering centroids. We extend the neural text clustering approach to text classification tasks by inducing cluster centers via a latent variable model and interacting with distributional word embeddings, to enrich the representation of tokens and measure the relatedness between tokens and each learnable cluster centroid. The proposed method jointly learns word clustering centroids and clustering-token alignments, achieving the state of the art results on multiple benchmark datasets and proving that the proposed cluster-token alignment mechanism is indeed favorable to text classification. Notably, our qualitative analysis has conspicuously illustrated that text representations learned by the proposed model are in accord well with our intuition.
Published: 2020

17. Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Author: Chai, Yekun, Jin, Shuo, and Hou, Xinwen
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary content-based SDU gates allow for the information flow of modulated latent embeddings through skipped connections, leading to a clear margin of convergence speed with gradient descent algorithms. We may unveil the role of gating mechanism to aid in the context-based Transformer modules, with hypothesizing that SDU gates, especially on shallow layers, could push it faster to step towards suboptimal points during the optimization process., Comment: Accepted at ACL 2020
Published: 2020

18. How to Evaluate Word Representations of Informal Domain?

Author: Chai, Yekun, Saphra, Naomi, and Lopez, Adam
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear-chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.
Published: 2019

19. Dual Modalities of Text: Visual and Textual Generative Pre-training

Author: Chai, Yekun, Liu, Qingyi, Xiao, Jingwu, Wang, Shuohuan, Sun, Yu, Wu, Hua, Chai, Yekun, Liu, Qingyi, Xiao, Jingwu, Wang, Shuohuan, Sun, Yu, and Wu, Hua
Abstract: Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.
Published: 2024

20. On Training Data Influence of GPT Models

Author: Liu, Qingyi, Chai, Yekun, Wang, Shuohuan, Sun, Yu, Peng, Qiwei, Wang, Keze, Wu, Hua, Liu, Qingyi, Chai, Yekun, Wang, Shuohuan, Sun, Yu, Peng, Qiwei, Wang, Keze, and Wu, Hua
Abstract: Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We will make our code and data publicly available.
Published: 2024

21. Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Author: Nakamura, Taishi, Mishra, Mayank, Tedeschi, Simone, Chai, Yekun, Stillerman, Jason T, Friedrich, Felix, Yadav, Prateek, Laud, Tanmay, Chien, Vu Minh, Zhuo, Terry Yue, Misra, Diganta, Bogin, Ben, Vu, Xuan-Son, Karpinska, Marzena, Dantuluri, Arnav Varma, Kusa, Wojciech, Furlanello, Tommaso, Yokota, Rio, Muennighoff, Niklas, Pai, Suhas, Adewumi, Tosin, Laippala, Veronika, Yao, Xiaozhe, Junior, Adalberto, Ariyak, Alpay, Drozd, Aleksandr, Clive, Jordan, Gupta, Kshitij, Chen, Liangyu, Sun, Qi, Tsui, Ken, Persaud, Noah, Fahmy, Nour, Chen, Tianlong, Bansal, Mohit, Monti, Nicolo, Dang, Tai, Luo, Ziyang, Bui, Tien-Tung, Navigli, Roberto, Mehta, Virendra, Blumberg, Matthew, May, Victor, Nguyen, Huu, Pyysalo, Sampo, Nakamura, Taishi, Mishra, Mayank, Tedeschi, Simone, Chai, Yekun, Stillerman, Jason T, Friedrich, Felix, Yadav, Prateek, Laud, Tanmay, Chien, Vu Minh, Zhuo, Terry Yue, Misra, Diganta, Bogin, Ben, Vu, Xuan-Son, Karpinska, Marzena, Dantuluri, Arnav Varma, Kusa, Wojciech, Furlanello, Tommaso, Yokota, Rio, Muennighoff, Niklas, Pai, Suhas, Adewumi, Tosin, Laippala, Veronika, Yao, Xiaozhe, Junior, Adalberto, Ariyak, Alpay, Drozd, Aleksandr, Clive, Jordan, Gupta, Kshitij, Chen, Liangyu, Sun, Qi, Tsui, Ken, Persaud, Noah, Fahmy, Nour, Chen, Tianlong, Bansal, Mohit, Monti, Nicolo, Dang, Tai, Luo, Ziyang, Bui, Tien-Tung, Navigli, Roberto, Mehta, Virendra, Blumberg, Matthew, May, Victor, Nguyen, Huu, and Pyysalo, Sampo
Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 ., Comment: Preprint
Published: 2024

22. Exponential Moving Averaged Q-Network for DDPG

Author: Shen, Xiangxiang, Yin, Chuanhuan, Chai, Yekun, Hou, Xinwen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lin, Zhouchen, editor, Wang, Liang, editor, Yang, Jian, editor, Shi, Guangming, editor, Tan, Tieniu, editor, Zheng, Nanning, editor, Chen, Xilin, editor, and Zhang, Yanning, editor
Published: 2019
Full Text: View/download PDF

23. Neural Text Classification by Jointly Learning to Cluster and Align

Author: Chai, Yekun, primary, Zhang, Haidong, additional, Yin, Qiyue, additional, and Zhang, Junge, additional
Published: 2023
Full Text: View/download PDF

24. Improved Training Of Mixture-Of-Experts Language GANs

Author: Chai, Yekun, primary, Yin, Qiyue, additional, and Zhang, Junge, additional
Published: 2023
Full Text: View/download PDF

25. Exponential Moving Averaged Q-Network for DDPG

Author: Shen, Xiangxiang, primary, Yin, Chuanhuan, additional, Chai, Yekun, additional, and Hou, Xinwen, additional
Published: 2019
Full Text: View/download PDF

26. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Author: Chai, Yekun, primary, Wang, Shuohuan, additional, Pang, Chao, additional, Sun, Yu, additional, Tian, Hao, additional, and Wu, Hua, additional
Published: 2023
Full Text: View/download PDF

27. Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

Author: Chai, Yekun, primary, Wang, Shuohuan, additional, Sun, Yu, additional, Tian, Hao, additional, Wu, Hua, additional, and Wang, Haifeng, additional
Published: 2022
Full Text: View/download PDF

28. Predicate-Argument Based Bi-Encoder for Paraphrase Identification

Author: Peng, Qiwei, primary, Weir, David, additional, Weeds, Julie, additional, and Chai, Yekun, additional
Published: 2022
Full Text: View/download PDF

29. X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection

Author: Han, Ya, primary, Chai, Yekun, additional, Wang, Shuohuan, additional, Sun, Yu, additional, Huang, Hongyi, additional, Chen, Guanghao, additional, Xu, Yitong, additional, and Yang, Yang, additional
Published: 2022
Full Text: View/download PDF

30. Counter-Contrastive Learning for Language GANs

Author: Chai, Yekun, primary, Zhang, Haidong, additional, Yin, Qiyue, additional, and Zhang, Junge, additional
Published: 2021
Full Text: View/download PDF

31. COIN: Conversational Interactive Networks for Emotion Recognition in Conversation

Author: Zhang, Haidong, primary and Chai, Yekun, additional
Published: 2021
Full Text: View/download PDF

32. Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Author: Chai, Yekun, primary, Jin, Shuo, additional, and Hou, Xinwen, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

32 results on '"Chai, Yekun"'

1. Curiosity-Driven Reinforcement Learning from Human Feedback

2. MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

3. Tokenization Falling Short: On Subword Robustness in Large Language Models

4. Autoregressive Pre-Training on Pixels and Texts

5. On Training Data Influence of GPT Models

6. Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

7. StarCoder 2 and The Stack v2: The Next Generation

8. HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

9. Tool-Augmented Reward Modeling

10. Improved Training of Mixture-of-Experts Language GANs

11. ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

12. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

13. X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection

14. Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

15. RefineCap: Concept-Aware Refinement for Image Captioning

16. Neural Text Classification by Jointly Learning to Cluster and Align

17. Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

18. How to Evaluate Word Representations of Informal Domain?

19. Dual Modalities of Text: Visual and Textual Generative Pre-training

20. On Training Data Influence of GPT Models

21. Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

22. Exponential Moving Averaged Q-Network for DDPG

23. Neural Text Classification by Jointly Learning to Cluster and Align

24. Improved Training Of Mixture-Of-Experts Language GANs

25. Exponential Moving Averaged Q-Network for DDPG

26. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

27. Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

28. Predicate-Argument Based Bi-Encoder for Paraphrase Identification

29. X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection

30. Counter-Contrastive Learning for Language GANs

31. COIN: Conversational Interactive Networks for Emotion Recognition in Conversation

32. Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

32 results on '"Chai, Yekun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources