Author: "Yang, Baosong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yang, Baosong"' showing total 167 results

Start Over Author "Yang, Baosong"

167 results on '"Yang, Baosong"'

1. P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

Author: Zhang, Yidan, Deng, Boyi, Wan, Yu, Yang, Baosong, Wei, Haoran, Huang, Fei, Yu, Bowen, Lin, Junyang, and Zhou, Jingren
Subjects: Computer Science - Computation and Language
Abstract: Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks, i.e., their ability to differentiate between models being evaluated. Leveraging this pipeline, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models, analyze dataset effectiveness, examine prompt impacts on model performances, and explore the relationship between multilingual performances and factors such as tasks, model sizes, and languages. These insights offer valuable guidance for future research. The dataset is available at https://huggingface.co/datasets/Qwen/P-MMEval.
Published: 2024

2. ZhoBLiMP: a Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Author: Liu, Yikang, Shen, Yeting, Zhu, Hongao, Xu, Lilong, Qian, Zhiheng, Song, Siyuan, Zhang, Kejia, Tang, Jialong, Zhang, Pei, Yang, Baosong, Wang, Rui, and Hu, Hai
Subjects: Computer Science - Computation and Language
Abstract: Whether and how language models (LMs) acquire the syntax of natural languages has been widely evaluated under the minimal pair paradigm. However, a lack of wide-coverage benchmarks in languages other than English has constrained systematic investigations into the issue. Addressing it, we first introduce ZhoBLiMP, the most comprehensive benchmark of linguistic minimal pairs for Chinese to date, with 118 paradigms, covering 15 linguistic phenomena. We then train 20 LMs of different sizes (14M to 1.4B) on Chinese corpora of various volumes (100M to 3B tokens) and evaluate them along with 14 off-the-shelf LLMs on ZhoBLiMP. The overall results indicate that Chinese grammar can be mostly learned by models with around 500M parameters, trained on 1B tokens with one epoch, showing limited benefits for further scaling. Most (N=95) linguistic paradigms are of easy or medium difficulty for LMs, while there are still 13 paradigms that remain challenging even for models with up to 32B parameters. In regard to how LMs acquire Chinese grammar, we observe a U-shaped learning pattern in several phenomena, similar to those observed in child language acquisition.
Published: 2024

3. Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation

Author: Wu, Suhang, Tang, Jialong, Yang, Baosong, Wang, Ante, Jia, Kaidi, Yu, Jiawei, Yao, Junfeng, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: RALMs (Retrieval-Augmented Language Models) broaden their knowledge scope by incorporating external textual resources. However, the multilingual nature of global knowledge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose \textit{Futurepedia}, a carefully crafted benchmark containing parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experimental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of expressing answers across languages; 3) English benefits from RALMs' selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer advice for improving multilingual Retrieval Augmented Generation. For monolingual knowledge extraction, careful attention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross-lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and repositioning English documents can help mitigate RALMs' selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valuable insights for future research.
Published: 2024

4. Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Author: Wang, Yiming, Zhang, Pei, Yang, Baosong, Wong, Derek F., and Wang, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: LLM self-evaluation relies on the LLM's own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs., Comment: 33 pages, 18 figures, 12 tables
Published: 2024

5. Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

Author: Hu, Tianxiang, Zhang, Pei, Yang, Baosong, Xie, Jun, Wong, Derek F., and Wang, Rui
Subjects: Computer Science - Computation and Language
Abstract: Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German$\Leftrightarrow$English and 22 Chinese$\Leftrightarrow$English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German$\rightarrow$English distinct out-of-domain tests.
Published: 2024

6. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Author: Zhang, Xin, Zhang, Yanzhao, Long, Dingkun, Xie, Wen, Dai, Ziqi, Tang, Jialong, Lin, Huan, Yang, Baosong, Xie, Pengjun, Huang, Fei, Zhang, Meishan, Li, Wenjie, and Zhang, Min
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications., Comment: Camera-ready version of EMNLP 2024: Industry Track
Published: 2024

7. Qwen2 Technical Report

Author: Yang, An, Yang, Baosong, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Zhou, Chang, Li, Chengpeng, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, Dong, Guanting, Wei, Haoran, Lin, Huan, Tang, Jialong, Wang, Jialin, Yang, Jian, Tu, Jianhong, Zhang, Jianwei, Ma, Jianxin, Yang, Jianxin, Xu, Jin, Zhou, Jingren, Bai, Jinze, He, Jinzheng, Lin, Junyang, Dang, Kai, Lu, Keming, Chen, Keqin, Yang, Kexin, Li, Mei, Xue, Mingfeng, Ni, Na, Zhang, Pei, Wang, Peng, Peng, Ru, Men, Rui, Gao, Ruize, Lin, Runji, Wang, Shijie, Bai, Shuai, Tan, Sinan, Zhu, Tianhang, Li, Tianhao, Liu, Tianyu, Ge, Wenbin, Deng, Xiaodong, Zhou, Xiaohuan, Ren, Xingzhang, Zhang, Xinyu, Wei, Xipin, Ren, Xuancheng, Liu, Xuejing, Fan, Yang, Yao, Yang, Zhang, Yichang, Wan, Yu, Chu, Yunfei, Liu, Yuqiong, Cui, Zeyu, Zhang, Zhenru, Guo, Zhifang, and Fan, Zhihao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors., Comment: 26 pages, 1 figure
Published: 2024

8. MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Author: Li, Tianhao, Li, Shangjie, Xie, Binbin, Xiong, Deyi, and Yang, Baosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies., Comment: 13 pages, 2 figures
Published: 2024

9. AnyTrans: Translate AnyText in the Image with Large Scale Models

Author: Qian, Zhipeng, Zhang, Pei, Yang, Baosong, Fan, Kai, Ma, Yiwei, Wong, Derek F., Sun, Xiaoshuai, and Ji, Rongrong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
Published: 2024

10. Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval

Author: Gao, Yan, Cao, Zhiwei, Miao, Zhongjian, Yang, Baosong, Liu, Shiyu, Zhang, Min, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: To achieve non-parametric NMT domain adaptation, $k$-Nearest-Neighbor Machine Translation ($k$NN-MT) constructs an external datastore to store domain-specific translation knowledge, which derives a $k$NN distribution to interpolate the prediction distribution of the NMT model via a linear interpolation coefficient $\lambda$. Despite its success, $k$NN retrieval at each timestep leads to substantial time overhead. To address this issue, dominant studies resort to $k$NN-MT with adaptive retrieval ($k$NN-MT-AR), which dynamically estimates $\lambda$ and skips $k$NN retrieval if $\lambda$ is less than a fixed threshold. Unfortunately, $k$NN-MT-AR does not yield satisfactory results. In this paper, we first conduct a preliminary study to reveal two key limitations of $k$NN-MT-AR: 1) the optimization gap leads to inaccurate estimation of $\lambda$ for determining $k$NN retrieval skipping, and 2) using a fixed threshold fails to accommodate the dynamic demands for $k$NN retrieval at different timesteps. To mitigate these limitations, we then propose $k$NN-MT with dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects. Firstly, we equip $k$NN-MT with a MLP-based classifier for determining whether to skip $k$NN retrieval at each timestep. Particularly, we explore several carefully-designed scalar features to fully exert the potential of the classifier. Secondly, we propose a timestep-aware threshold adjustment method to dynamically generate the threshold, which further improves the efficiency of our model. Experimental results on the widely-used datasets demonstrate the effectiveness and generality of our model.\footnote{Our code is available at \url{https://github.com/DeepLearnXMU/knn-mt-dr}., Comment: Accepted to ACL 2024 Findings
Published: 2024

11. Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning

Author: Wang, Yiming, Zhang, Pei, Yang, Baosong, Wong, Derek F., Zhang, Zhuosheng, and Wang, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions., Comment: Accepted by NeurIPS 2024
Published: 2024

12. EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning

Author: Guo, Ping, Wei, Xiangpeng, Hu, Yue, Yang, Baosong, Liu, Dayiheng, Huang, Fei, and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models., Comment: Accepted by NeurIPS 2023
Published: 2023

13. PolyLM: An Open Source Polyglot Large Language Model

Author: Wei, Xiangpeng, Wei, Haoran, Lin, Huan, Li, Tianhao, Zhang, Pei, Ren, Xingzhang, Li, Mei, Wan, Yu, Cao, Zhiwei, Xie, Binbin, Hu, Tianxiang, Li, Shangjie, Hui, Binyuan, Yu, Bowen, Liu, Dayiheng, Yang, Baosong, Huang, Fei, and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.
Published: 2023

14. Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models

Author: Wang, Yiming, Zhang, Zhuosheng, Zhang, Pei, Yang, Baosong, and Wang, Rui
Subjects: Computer Science - Computation and Language
Abstract: Neural-symbolic methods have demonstrated efficiency in enhancing the reasoning abilities of large language models (LLMs). However, existing methods mainly rely on syntactically mapping natural languages to complete formal languages like Python and SQL. Those methods require that reasoning tasks be convertible into programs, which cater to the computer execution mindset and deviate from human reasoning habits. To broaden symbolic methods' applicability and adaptability in the real world, we propose the Meta-Reasoning from a linguistic perspective. This method empowers LLMs to deconstruct reasoning-independent semantic information into generic symbolic representations, thereby efficiently capturing more generalized reasoning knowledge. We conduct extensive experiments on more than ten datasets encompassing conventional reasoning tasks like arithmetic, symbolic, and logical reasoning, and the more complex interactive reasoning tasks like theory-of-mind reasoning. Experimental results demonstrate that Meta-Reasoning significantly enhances in-context reasoning accuracy, learning efficiency, out-of-domain generalization, and output stability compared to the Chain-of-Thought technique. Code and data are publicly available at \url{https://github.com/Alsace08/Meta-Reasoning}., Comment: Accepted by ACL 2024 Findings
Published: 2023

15. Bridging the Domain Gaps in Context Representations for k-Nearest Neighbor Neural Machine Translation

Author: Cao, Zhiwei, Yang, Baosong, Lin, Huan, Wu, Suhang, Wei, Xiangpeng, Liu, Dayiheng, Xie, Jun, Zhang, Min, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: $k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains. By using an upstream NMT model to traverse the downstream training corpus, it is equipped with a datastore containing vectorized key-value pairs, which are retrieved during inference to benefit translation. However, there often exists a significant gap between upstream and downstream domains, which hurts the retrieval accuracy and the final translation quality. To deal with this issue, we propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore. Concretely, we design a reviser to revise the key representations, making them better fit for the downstream domain. The reviser is trained using the collected semantically-related key-queries pairs, and optimized by two proposed losses: one is the key-queries semantic distance ensuring each revised key representation is semantically related to its corresponding queries, and the other is an L2-norm loss encouraging revised key representations to effectively retain the knowledge learned by the upstream NMT model. Extensive experiments on domain adaptation tasks demonstrate that our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.\footnote{Our code is available at \url{https://github.com/DeepLearnXMU/RevisedKey-knn-mt}.}, Comment: Accepted to ACL 2023
Published: 2023

16. From Statistical Methods to Deep Learning, Automatic Keyphrase Prediction: A Survey

Author: Xie, Binbin, Song, Jia, Shao, Liangying, Wu, Suhang, Wei, Xiangpeng, Yang, Baosong, Lin, Huan, Xie, Jun, and Su, Jinsong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Keyphrase prediction aims to generate phrases (keyphrases) that highly summarizes a given document. Recently, researchers have conducted in-depth studies on this task from various perspectives. In this paper, we comprehensively summarize representative studies from the perspectives of dominant models, datasets and evaluation metrics. Our work analyzes up to 167 previous works, achieving greater coverage of this task than previous surveys. Particularly, we focus highly on deep learning-based keyphrase prediction, which attracts increasing attention of this task in recent years. Afterwards, we conduct several groups of experiments to carefully compare representative models. To the best of our knowledge, our work is the first attempt to compare these models using the identical commonly-used datasets and evaluation metric, facilitating in-depth analyses of their disadvantages and advantages. Finally, we discuss the possible research directions of this task in the future., Comment: Information Processing & Management
Published: 2023
Full Text: View/download PDF

17. Towards Fine-Grained Information: Identifying the Type and Location of Translation Errors

Author: Bao, Keqin, Wan, Yu, Liu, Dayiheng, Yang, Baosong, Lei, Wenqiang, He, Xiangnan, Wong, Derek F., and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: Fine-grained information on translation errors is helpful for the translation evaluation community. Existing approaches can not synchronously consider error position and type, failing to integrate the error information of both. In this paper, we propose Fine-Grained Translation Error Detection (FG-TED) task, aiming at identifying both the position and the type of translation errors on given source-hypothesis sentence pairs. Besides, we build an FG-TED model to predict the \textbf{addition} and \textbf{omission} errors -- two typical translation accuracy errors. First, we use a word-level classification paradigm to form our model and use the shortcut learning reduction to relieve the influence of monolingual features. Besides, we construct synthetic datasets for model training, and relieve the disagreement of data labeling in authoritative datasets, making the experimental benchmark concordant. Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results on the restored dataset. Our model also delivers more reliable predictions on low-resource and transfer scenarios than existing baselines. The related datasets and the source code will be released in the future.
Published: 2023

18. Competency-Aware Neural Machine Translation: Can Machine Translation Know its Own Translation Quality?

Author: Zhang, Pei, Yang, Baosong, Wei, Haoran, Liu, Dayiheng, Fan, Kai, Si, Luo, and Xie, Jun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Neural machine translation (NMT) is often criticized for failures that happen without awareness. The lack of competency awareness makes NMT untrustworthy. This is in sharp contrast to human translators who give feedback or conduct further investigations whenever they are in doubt about predictions. To fill this gap, we propose a novel competency-aware NMT by extending conventional NMT with a self-estimator, offering abilities to translate a source sentence and estimate its competency. The self-estimator encodes the information of the decoding procedure and then examines whether it can reconstruct the original semantics of the source sentence. Experimental results on four translation tasks demonstrate that the proposed method not only carries out translation tasks intact but also delivers outstanding performance on quality estimation. Without depending on any reference or annotated data typically required by state-of-the-art metric and quality estimation methods, our model yields an even higher correlation with human quality judgments than a variety of aforementioned methods, such as BLEURT, COMET, and BERTScore. Quantitative and qualitative analyses show better robustness of competency awareness in our model., Comment: accepted to EMNLP 2022
Published: 2022

19. WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation

Author: Xie, Binbin, Wei, Xiangpeng, Yang, Baosong, Lin, Huan, Xie, Jun, Wang, Xiaoli, Zhang, Min, and Su, Jinsong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Keyphrase generation aims to automatically generate short phrases summarizing an input document. The recently emerged ONE2SET paradigm (Ye et al., 2021) generates keyphrases as a set and has achieved competitive performance. Nevertheless, we observe serious calibration errors outputted by ONE2SET, especially in the over-estimation of $\varnothing$ token (means "no corresponding keyphrase"). In this paper, we deeply analyze this limitation and identify two main reasons behind: 1) the parallel generation has to introduce excessive $\varnothing$ as padding tokens into training instances; and 2) the training mechanism assigning target to each slot is unstable and further aggravates the $\varnothing$ token over-estimation. To make the model well-calibrated, we propose WR-ONE2SET which extends ONE2SET with an adaptive instance-level cost Weighting strategy and a target Re-assignment mechanism. The former dynamically penalizes the over-estimated slots for different instances thus smoothing the uneven training distribution. The latter refines the original inappropriate assignment and reduces the supervisory signals of over-estimated slots. Experimental results on commonly-used datasets demonstrate the effectiveness and generality of our proposed paradigm., Comment: EMNLP2022
Published: 2022

20. Alibaba-Translate China's Submission for WMT 2022 Quality Estimation Shared Task

Author: Bao, Keqin, Wan, Yu, Liu, Dayiheng, Yang, Baosong, Lei, Wenqiang, He, Xiangnan, Wong, Derek F., and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE (Unified Translation Evaluation). Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. First, we apply the pseudo-labeled data examples for the continuously pre-training phase. Notably, to reduce the gap between pre-training and fine-tuning, we use data pruning and a ranking-based score normalization strategy. For the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions. Finally, we collect the source-only evaluation results, and ensemble the predictions generated by two UniTE models, whose backbones are XLM-R and InfoXLM, respectively. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings, showing relatively strong performances in this year's quality estimation competition., Comment: WMT 2022 QE Shared Task. arXiv admin note: text overlap with arXiv:2210.09683
Published: 2022

21. Alibaba-Translate China's Submission for WMT 2022 Metrics Shared Task

Author: Wan, Yu, Bao, Keqin, Liu, Dayiheng, Yang, Baosong, Wong, Derek F., Chao, Lidia S., Lei, Wenqiang, and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: In this report, we present our submission to the WMT 2022 Metrics Shared Task. We build our system based on the core idea of UNITE (Unified Translation Evaluation), which unifies source-only, reference-only, and source-reference-combined evaluation scenarios into one single model. Specifically, during the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE. Notably, to reduce the gap between pre-training and fine-tuning, we use data cropping and a ranking-based score normalization strategy. During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions. Specially, we collect the results from models with different pre-trained language model backbones, and use different ensembling strategies for involved translation directions., Comment: WMT 2022 Metrics Shared Task
Published: 2022

22. Draft, Command, and Edit: Controllable Text Editing in E-Commerce

Author: Yang, Kexin, Liu, Dayiheng, Lei, Wenqiang, Yang, Baosong, Qu, Qian, and Lv, Jiancheng
Subjects: Computer Science - Computation and Language
Abstract: Product description generation is a challenging and under-explored task. Most such work takes a set of product attributes as inputs then generates a description from scratch in a single pass. However, this widespread paradigm might be limited when facing the dynamic wishes of users on constraining the description, such as deleting or adding the content of a user-specified attribute based on the previous version. To address this challenge, we explore a new draft-command-edit manner in description generation, leading to the proposed new task-controllable text editing in E-commerce. More specifically, we allow systems to receive a command (deleting or adding) from the user and then generate a description by flexibly modifying the content based on the previous version. It is easier and more practical to meet the new needs by modifying previous versions than generating from scratch. Furthermore, we design a data augmentation method to remedy the low resource challenge in this task, which contains a model-based and a rule-based strategy to imitate the edit by humans. To accompany this new task, we present a human-written draft-command-edit dataset called E-cEdits and a new metric "Attribute Edit". Our experimental results show that using the new data augmentation method outperforms baselines to a greater extent in both automatic and human evaluations.
Published: 2022

23. Should We Rely on Entity Mentions for Relation Extraction? Debiasing Relation Extraction with Counterfactual Analysis

Author: Wang, Yiwei, Chen, Muhao, Zhou, Wenxuan, Cai, Yujun, Liang, Yuxuan, Liu, Dayiheng, Yang, Baosong, Liu, Juncheng, and Hooi, Bryan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent literature focuses on utilizing the entity information in the sentence-level relation extraction (RE), but this risks leaking superficial and spurious clues of relations. As a result, RE still suffers from unintended entity bias, i.e., the spurious correlation between entity mentions (names) and relations. Entity bias can mislead the RE models to extract the relations that do not exist in the text. To combat this issue, some previous work masks the entity mentions to prevent the RE models from overfitting entity mentions. However, this strategy degrades the RE performance because it loses the semantic information of entities. In this paper, we propose the CORE (Counterfactual Analysis based Relation Extraction) debiasing method that guides the RE models to focus on the main effects of textual context without losing the entity information. We first construct a causal graph for RE, which models the dependencies between variables in RE models. Then, we propose to conduct counterfactual analysis on our causal graph to distill and mitigate the entity bias, that captures the causal effects of specific entity mentions in each instance. Note that our CORE method is model-agnostic to debias existing RE systems during inference without changing their training processes. Extensive experimental results demonstrate that our CORE yields significant gains on both effectiveness and generalization for RE. The source code is provided at: https://github.com/vanoracai/CoRE., Comment: NAACL 2022
Published: 2022

24. Dangling-Aware Entity Alignment with Mixed High-Order Proximities

Author: Liu, Juncheng, Sun, Zequn, Hooi, Bryan, Wang, Yiwei, Liu, Dayiheng, Yang, Baosong, Xiao, Xiaokui, and Chen, Muhao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We study dangling-aware entity alignment in knowledge graphs (KGs), which is an underexplored but important problem. As different KGs are naturally constructed by different sets of entities, a KG commonly contains some dangling entities that cannot find counterparts in other KGs. Therefore, dangling-aware entity alignment is more realistic than the conventional entity alignment where prior studies simply ignore dangling entities. We propose a framework using mixed high-order proximities on dangling-aware entity alignment. Our framework utilizes both the local high-order proximity in a nearest neighbor subgraph and the global high-order proximity in an embedding space for both dangling detection and entity alignment. Extensive experiments with two evaluation settings shows that our framework more precisely detects dangling entities, and better aligns matchable entities. Further investigations demonstrate that our framework can mitigate the hubness problem on dangling-aware entity alignment., Comment: NAACL 2022 (Findings)
Published: 2022

25. Tailor: A Prompt-Based Approach to Attribute-Based Controlled Text Generation

Author: Yang, Kexin, Liu, Dayiheng, Lei, Wenqiang, Yang, Baosong, Xue, Mingfeng, Chen, Boxing, and Xie, Jun
Subjects: Computer Science - Computation and Language
Abstract: Attribute-based Controlled Text Generation (CTG) refers to generating sentences that satisfy desirable attributes (e.g., emotions and topics). Existing works often utilize fine-tuning or resort to extra attribute classifiers, yet suffer from storage and inference time increases. To address these concerns, we explore attribute-based CTG in a prompt-based manner. In short, the proposed Tailor represents each attribute as a pre-trained continuous vector (i.e., single-attribute prompt) and guides the generation of a fixed PLM switch to a pre-specified attribute. We experimentally find that these prompts can be simply concatenated as a whole to multi-attribute CTG without any re-training, yet raises problems of fluency decrease and position sensitivity. To this end, Tailor provides a multi-attribute prompt mask and a re-indexing position-ids sequence to bridge the gap between the training (one prompt for each task) and testing stage (concatenating more than one prompt). To further enhance such single-attribute prompt combinations, Tailor also introduces a trainable prompt connector, which can be concatenated with any two single-attribute prompts to multi-attribute text generation. Experiments on 11 attribute-specific generation tasks demonstrate strong performances of Tailor on both single-attribute and multi-attribute CTG, with 0.08\% training parameters of a GPT-2.
Published: 2022

26. Attention Mechanism with Energy-Friendly Operations

Author: Wan, Yu, Yang, Baosong, Liu, Dayiheng, Xiao, Rong, Wong, Derek F., Zhang, Haibo, Chen, Boxing, and Chao, Lidia S.
Subjects: Computer Science - Computation and Language
Abstract: Attention mechanism has become the dominant module in natural language processing models. It is computationally intensive and depends on massive power-hungry multiplications. In this paper, we rethink variants of attention mechanism from the energy consumption aspects. After reaching the conclusion that the energy costs of several energy-friendly operations are far less than their multiplication counterparts, we build a novel attention model by replacing multiplications with either selective operations or additions. Empirical results on three machine translation tasks demonstrate that the proposed model, against the vanilla one, achieves competitable accuracy while saving 99\% and 66\% energy during alignment calculation and the whole attention procedure. Code is available at: https://github.com/NLP2CT/E-Att., Comment: Findings@ACL2022
Published: 2022
Full Text: View/download PDF

27. RoBLEURT Submission for the WMT2021 Metrics Task

Author: Wan, Yu, Liu, Dayiheng, Yang, Baosong, Bi, Tianchi, Zhang, Haibo, Chen, Boxing, Luo, Weihua, Wong, Derek F., and Chao, Lidia S.
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present our submission to Shared Metrics Task: RoBLEURT (Robustly Optimizing the training of BLEURT). After investigating the recent advances of trainable metrics, we conclude several aspects of vital importance to obtain a well-performed metric model by: 1) jointly leveraging the advantages of source-included model and reference-only model, 2) continuously pre-training the model with massive synthetic data pairs, and 3) fine-tuning the model with data denoising strategy. Experimental results show that our model reaching state-of-the-art correlations with the WMT2020 human annotations upon 8 out of 10 to-English language pairs., Comment: WMT2021 Metrics Shared Task
Published: 2022

28. UniTE: Unified Translation Evaluation

Author: Wan, Yu, Liu, Dayiheng, Yang, Baosong, Zhang, Haibo, Chen, Boxing, Wong, Derek F., and Chao, Lidia S.
Subjects: Computer Science - Computation and Language
Abstract: Translation quality evaluation plays a crucial role in machine translation. According to the input format, it is mainly separated into three tasks, i.e., reference-only, source-only and source-reference-combined. Recent methods, despite their promising results, are specifically designed and optimized on one of them. This limits the convenience of these methods, and overlooks the commonalities among tasks. In this paper, we propose UniTE, which is the first unified framework engaged with abilities to handle all three evaluation tasks. Concretely, we propose monotonic regional attention to control the interaction among input segments, and unified pretraining to better adapt multi-task learning. We testify our framework on WMT 2019 Metrics and WMT 2020 Quality Estimation benchmarks. Extensive analyses show that our \textit{single model} can universally surpass various state-of-the-art or winner methods across tasks. Both source code and associated models are available at https://github.com/NLP2CT/UniTE., Comment: ACL2022
Published: 2022
Full Text: View/download PDF

29. RMBR: A Regularized Minimum Bayes Risk Reranking Framework for Machine Translation

Author: Zhang, Yidan, Wan, Yu, Liu, Dayiheng, Yang, Baosong, and He, Zhenan
Subjects: Computer Science - Computation and Language
Abstract: Beam search is the most widely used decoding method for neural machine translation (NMT). In practice, the top-1 candidate with the highest log-probability among the n candidates is selected as the preferred one. However, this top-1 candidate may not be the best overall translation among the n-best list. Recently, Minimum Bayes Risk (MBR) decoding has been proposed to improve the quality for NMT, which seeks for a consensus translation that is closest on average to other candidates from the n-best list. We argue that MBR still suffers from the following problems: The utility function only considers the lexical-level similarity between candidates; The expected utility considers the entire n-best list which is time-consuming and inadequate candidates in the tail list may hurt the performance; Only the relationship between candidates is considered. To solve these issues, we design a regularized MBR reranking framework (RMBR), which considers semantic-based similarity and computes the expected utility for each candidate by truncating the list. We expect the proposed framework to further consider the translation quality and model uncertainty of each candidate. Thus the proposed quality regularizer and uncertainty regularizer are incorporated into the framework. Extensive experiments on multiple translation tasks demonstrate the effectiveness of our method., Comment: 10 pages
Published: 2022

30. Frequency-Aware Contrastive Learning for Neural Machine Translation

Author: Zhang, Tong, Ye, Wei, Yang, Baosong, Zhang, Long, Ren, Xingzhang, Liu, Dayiheng, Sun, Jinan, Zhang, Shikun, Zhang, Haibo, and Zhao, Wen
Subjects: Computer Science - Computation and Language
Abstract: Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems. Recent adaptive training methods promote the output of infrequent words by emphasizing their weights in the overall training objectives. Despite the improved recall of low-frequency words, their prediction precision is unexpectedly hindered by the adaptive objectives. Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective. Specifically, we propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words, in a soft contrastive way based on the corresponding word frequencies. We conduct experiments on widely used NIST Chinese-English and WMT14 English-German translation tasks. Empirical results show that our proposed methods can not only significantly improve the translation quality but also enhance lexical diversity and optimize word representation space. Further investigation reveals that, comparing with related adaptive training strategies, the superiority of our method on low-frequency word prediction lies in the robustness of token-level recall across different frequencies without sacrificing precision., Comment: Published at AAAI 2022
Published: 2021

31. KGR^4: Retrieval, Retrospect, Refine and Rethink for Commonsense Generation

Author: Liu, Xin, Liu, Dayiheng, Yang, Baosong, Zhang, Haibo, Ding, Junwei, Yao, Wenqing, Luo, Weihua, Zhang, Haiying, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: Generative commonsense reasoning requires machines to generate sentences describing an everyday scenario given several concepts, which has attracted much attention recently. However, existing models cannot perform as well as humans, since sentences they produce are often implausible and grammatically incorrect. In this paper, inspired by the process of humans creating sentences, we propose a novel Knowledge-enhanced Commonsense Generation framework, termed KGR^4, consisting of four stages: Retrieval, Retrospect, Refine, Rethink. Under this framework, we first perform retrieval to search for relevant sentences from external corpus as the prototypes. Then, we train the generator that either edits or copies these prototypes to generate candidate sentences, of which potential errors will be fixed by an autoencoder-based refiner. Finally, we select the output sentence from candidate sentences produced by generators with different hyper-parameters. Experimental results and in-depth analysis on the CommonGen benchmark strongly demonstrate the effectiveness of our framework. Particularly, KGR^4 obtains 33.56 SPICE points in the official leaderboard, outperforming the previously-reported best result by 2.49 SPICE points and achieving state-of-the-art performance.
Published: 2021

32. Leveraging Advantages of Interactive and Non-Interactive Models for Vector-Based Cross-Lingual Information Retrieval

Author: Xu, Linlong, Yang, Baosong, Lv, Xiaoyu, Bi, Tianchi, Liu, Dayiheng, and Zhang, Haibo
Subjects: Computer Science - Computation and Language
Abstract: Interactive and non-interactive model are the two de-facto standard frameworks in vector-based cross-lingual information retrieval (V-CLIR), which embed queries and documents in synchronous and asynchronous fashions, respectively. From the retrieval accuracy and computational efficiency perspectives, each model has its own superiority and shortcoming. In this paper, we propose a novel framework to leverage the advantages of these two paradigms. Concretely, we introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries. Accordingly, cross-lingual features can be better learned like an interactive model. Besides, we further transfer knowledge from a well-trained interactive model to ours by reusing its word embeddings and adopting knowledge distillation. Our model is initialized from a multilingual pre-trained language model M-BERT, and evaluated on two open-resource CLIR datasets derived from Wikipedia and an in-house dataset collected from a real-world search engine. Extensive analyses reveal that our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
Published: 2021

33. Towards User-Driven Neural Machine Translation

Author: Lin, Huan, Yao, Liang, Yang, Baosong, Liu, Dayiheng, Zhang, Haibo, Luo, Weihua, Huang, Degen, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: A good translation should not only translate the original content semantically, but also incarnate personal traits of the original text. For a real-world neural machine translation (NMT) system, these user traits (e.g., topic preference, stylistic characteristics and expression habits) can be preserved in user behavior (e.g., historical inputs). However, current NMT systems marginally consider the user behavior due to: 1) the difficulty of modeling user portraits in zero-shot scenarios, and 2) the lack of user-behavior annotated parallel dataset. To fill this gap, we introduce a novel framework called user-driven NMT. Specifically, a cache-based module and a user-driven contrastive learning method are proposed to offer NMT the ability to capture potential user traits from their historical inputs under a zero-shot learning fashion. Furthermore, we contribute the first Chinese-English parallel corpus annotated with user behavior called UDT-Corpus. Experimental results confirm that the proposed user-driven NMT can generate user-specific translations.
Published: 2021

34. Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation

Author: Liu, Xin, Yang, Baosong, Liu, Dayiheng, Zhang, Haibo, Luo, Weihua, Zhang, Min, Zhang, Haiying, and Su, Jinsong
Subjects: Computer Science - Computation and Language
Abstract: A well-known limitation in pretrain-finetune paradigm lies in its inflexibility caused by the one-size-fits-all vocabulary. This potentially weakens the effect when applying pretrained models into natural language generation (NLG) tasks, especially for the subword distributions between upstream and downstream tasks with significant discrepancy. Towards approaching this problem, we extend the vanilla pretrain-finetune pipeline with an extra embedding transfer step. Specifically, a plug-and-play embedding generator is introduced to produce the representation of any input token, according to pre-trained embeddings of its morphologically similar ones. Thus, embeddings of mismatch tokens in downstream tasks can also be efficiently initialized. We conduct experiments on a variety of NLG tasks under the pretrain-finetune fashion. Experimental results and extensive analyses show that the proposed strategy offers us opportunities to feel free to transfer the vocabulary, leading to more efficient and better performed downstream NLG models., Comment: Accepted by ACL2021
Published: 2021

35. Imitation Attacks Can Steal More Than You Think from Machine Translation Systems

Author: Hu, Tianxiang, Zhang, Pei, Yang, Baosong, Xie, Jun, Wang, Rui, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Fei, editor, Duan, Nan, editor, Xu, Qingting, editor, and Hong, Yu, editor
Published: 2023
Full Text: View/download PDF

36. Constraint Translation Candidates: A Bridge between Neural Query Translation and Cross-lingual Information Retrieval

Author: Bi, Tianchi, Yao, Liang, Yang, Baosong, Zhang, Haibo, Luo, Weihua, and Chen, Boxing
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Query translation (QT) is a key component in cross-lingual information retrieval system (CLIR). With the help of deep learning, neural machine translation (NMT) has shown promising results on various tasks. However, NMT is generally trained with large-scale out-of-domain data rather than in-domain query translation pairs. Besides, the translation model lacks a mechanism at the inference time to guarantee the generated words to match the search index. The two shortages of QT result in readable texts for human but inadequate candidates for the downstream retrieval task. In this paper, we propose a novel approach to alleviate these problems by limiting the open target vocabulary search space of QT to a set of important words mined from search index database. The constraint translation candidates are employed at both of training and inference time, thus guiding the translation model to learn and generate well performing target queries. The proposed methods are exploited and examined in a real-word CLIR system--Aliexpress e-Commerce search engine. Experimental results demonstrate that our approach yields better performance on both translation quality and retrieval accuracy than the strong NMT baseline., Comment: SIGIR eCom 2020
Published: 2020

37. Exploiting Neural Query Translation into Cross Lingual Information Retrieval

Author: Yao, Liang, Yang, Baosong, Zhang, Haibo, Luo, Weihua, and Chen, Boxing
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: As a crucial role in cross-language information retrieval (CLIR), query translation has three main challenges: 1) the adequacy of translation; 2) the lack of in-domain parallel training data; and 3) the requisite of low latency. To this end, existing CLIR systems mainly exploit statistical-based machine translation (SMT) rather than the advanced neural machine translation (NMT), limiting the further improvements on both translation and retrieval quality. In this paper, we investigate how to exploit neural query translation model into CLIR system. Specifically, we propose a novel data augmentation method that extracts query translation pairs according to user clickthrough data, thus to alleviate the problem of domain-adaptation in NMT. Then, we introduce an asynchronous strategy which is able to leverage the advantages of the real-time in SMT and the veracity in NMT. Experimental results reveal that the proposed approach yields better retrieval quality than strong baselines and can be well applied into a real-world CLIR system, i.e. Aliexpress e-Commerce search engine. Readers can examine and test their cases on our website: https://aliexpress.com ., Comment: SIGIR eCom 2020
Published: 2020

38. Self-Paced Learning for Neural Machine Translation

Author: Wan, Yu, Yang, Baosong, Wong, Derek F., Zhou, Yikai, Chao, Lidia S., Zhang, Haibo, and Chen, Boxing
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent studies have proven that the training of neural machine translation (NMT) can be facilitated by mimicking the learning process of humans. Nevertheless, achievements of such kind of curriculum learning rely on the quality of artificial schedule drawn up with the handcrafted features, e.g. sentence length or word rarity. We ameliorate this procedure with a more flexible manner by proposing self-paced learning, where NMT model is allowed to 1) automatically quantify the learning confidence over training examples; and 2) flexibly govern its learning via regulating the loss in each iteration step. Experimental results over multiple translation tasks demonstrate that the proposed model yields better performance than strong baselines and those models trained with human-designed curricula on both translation quality and convergence speed., Comment: Accepted by EMNLP2020
Published: 2020
Full Text: View/download PDF

39. Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling

Author: Wan, Yu, Yang, Baosong, Wong, Derek F., Chao, Lidia S., Du, Haihua, and Ao, Ben C. H.
Subjects: Computer Science - Computation and Language
Abstract: As a special machine translation task, dialect translation has two main characteristics: 1) lack of parallel training corpus; and 2) possessing similar grammar between two sides of the translation. In this paper, we investigate how to exploit the commonality and diversity between dialects thus to build unsupervised translation models merely accessing to monolingual data. Specifically, we leverage pivot-private embedding, layer coordination, as well as parameter sharing to sufficiently model commonality and diversity among source and target, ranging from lexical, through syntactic, to semantic levels. In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are official language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simplified and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores., Comment: AAAI 2020
Published: 2019
Full Text: View/download PDF

40. Neuron Interaction Based Representation Composition for Neural Machine Translation

Author: Li, Jian, Wang, Xing, Yang, Baosong, Shi, Shuming, Lyu, Michael R., and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent NLP studies reveal that substantial linguistic information can be attributed to single neurons, i.e., individual dimensions of the representation vectors. We hypothesize that modeling strong interactions among neurons helps to better capture complex information by composing the linguistic properties embedded in individual neurons. Starting from this intuition, we propose a novel approach to compose representations learned by different components in neural machine translation (e.g., multi-layer networks or multi-head attention), based on modeling strong interactions among neurons in the representation vectors. Specifically, we leverage bilinear pooling to model pairwise multiplicative interactions among individual neurons, and a low-rank approximation to make the model computationally feasible. We further propose extended bilinear pooling to incorporate first-order representations. Experiments on WMT14 English-German and English-French translation tasks show that our model consistently improves performances over the SOTA Transformer baseline. Further analyses demonstrate that our approach indeed captures more syntactic and semantic information as expected., Comment: AAAI 2020
Published: 2019

41. Assessing the Ability of Self-Attention Networks to Learn Word Order

Author: Yang, Baosong, Wang, Longyue, Wong, Derek F., Chao, Lidia S., and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when "lacking positional information" have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation., Comment: ACL 2019
Published: 2019

42. Convolutional Self-Attention Networks

Author: Yang, Baosong, Wang, Longyue, Wong, Derek, Chao, Lidia S., and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies. SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces. In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads. Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs. Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters., Comment: NAACL 2019
Published: 2019

43. Information Aggregation for Multi-Head Attention with Routing-by-Agreement

Author: Li, Jian, Yang, Baosong, Dou, Zi-Yi, Wang, Xing, Lyu, Michael R., and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Multi-head attention is appealing for its ability to jointly extract different types of information from multiple representation subspaces. Concerning the information aggregation, a common practice is to use a concatenation followed by a linear transformation, which may not fully exploit the expressiveness of multi-head attention. In this work, we propose to improve the information aggregation for multi-head attention with a more powerful routing-by-agreement algorithm. Specifically, the routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on linguistic probing tasks and machine translation tasks prove the superiority of the advanced information aggregation over the standard linear transformation., Comment: NAACL 2019
Published: 2019

44. Modeling Recurrence for Transformer

Author: Hao, Jie, Wang, Xing, Yang, Baosong, Wang, Longyue, Zhang, Jinfeng, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement of translation capacity. In response to this problem, we propose to directly model recurrence for Transformer with an additional recurrence encoder. In addition to the standard recurrent neural network, we introduce a novel attentive recurrent network to leverage the strengths of both attention and recurrent networks. Experimental results on the widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness of the proposed approach. Our studies also reveal that the proposed model benefits from a short-cut that bridges the source and target sequences with a single recurrent layer, which outperforms its deep counterpart., Comment: NAACL 2019
Published: 2019

45. Context-Aware Self-Attention Networks

Author: Yang, Baosong, Li, Jian, Wong, Derek, Chao, Lidia S., Wang, Xing, and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the contextual information, which have proven useful for modeling dependencies among neural representations in various natural language tasks. In this work, we focus on improving self-attention networks through capturing the richness of context. To maintain the simplicity and flexibility of the self-attention networks, we propose to contextualize the transformations of the query and key layers, which are used to calculates the relevance between elements. Specifically, we leverage the internal representations that embed both global and deep contexts, thus avoid relying on external resources. Experimental results on WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed methods. Furthermore, we conducted extensive analyses to quantity how the context vectors participate in the self-attention model., Comment: AAAI 2019
Published: 2019

46. Cross-Lingual Product Retrieval in E-Commerce Search

Author: Zhu, Wenya, Lv, Xiaoyu, Yang, Baosong, Zhang, Yinghua, Yong, Xu, Xu, Linlong, Feng, Yinfu, Zhang, Haibo, Da, Qing, Zeng, Anxiang, Chen, Ronghua, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gama, João, editor, Li, Tianrui, editor, Yu, Yang, editor, Chen, Enhong, editor, Zheng, Yu, editor, and Teng, Fei, editor
Published: 2022
Full Text: View/download PDF

47. Convolutional Self-Attention Network

Author: Yang, Baosong, Wang, Longyue, Wong, Derek F., Chao, Lidia S., and Tu, Zhaopeng
Subjects: Computer Science - Computation and Language
Abstract: Self-attention network (SAN) has recently attracted increasing interest due to its fully parallelized computation and flexibility in modeling dependencies. It can be further enhanced with multi-headed attention mechanism by allowing the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al., 2017). In this work, we propose a novel convolutional self-attention network (CSAN), which offers SAN the abilities to 1) capture neighboring dependencies, and 2) model the interaction between multiple attention heads. Experimental results on WMT14 English-to-German translation task demonstrate that the proposed approach outperforms both the strong Transformer baseline and other existing works on enhancing the locality of SAN. Comparing with previous work, our model does not introduce any new parameters., Comment: The least version of this paper has been uploaded to another link: arXiv:1904.03107
Published: 2018

48. Modeling Localness for Self-Attention Networks

Author: Yang, Baosong, Tu, Zhaopeng, Wong, Derek F., Meng, Fandong, Chao, Lidia S., and Zhang, Tong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength of capturing long distance dependencies and enhance the ability of capturing short-range dependencies, we only apply localness modeling to lower layers of self-attention networks. Quantitative and qualitative analyses on Chinese-English and English-German translation tasks demonstrate the effectiveness and universality of the proposed approach., Comment: EMNLP 2018
Published: 2018

49. Multi-Head Attention with Disagreement Regularization

Author: Li, Jian, Tu, Zhaopeng, Yang, Baosong, Lyu, Michael R., and Zhang, Tong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach., Comment: EMNLP 2018
Published: 2018

50. Multi-view self-attention networks

Author: Xu, Mingzhou, Yang, Baosong, Wong, Derek F., and Chao, Lidia S.
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

167 results on '"Yang, Baosong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources