262 results on '"Cherry, Colin"'
Search Results
2. Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts
- Author
-
Briakou, Eleftheria, Luo, Jiaming, Cherry, Colin, and Freitag, Markus
- Subjects
Computer Science - Computation and Language - Abstract
In this paper we present a step-by-step approach to long-form text translation, drawing on established processes in translation studies. Instead of viewing machine translation as a single, monolithic task, we propose a framework that engages language models in a multi-turn interaction, encompassing pre-translation research, drafting, refining, and proofreading, resulting in progressively improved translations. Extensive automatic evaluations using Gemini 1.5 Pro across ten language pairs show that translating step-by-step yields large translation quality improvements over conventional zero-shot prompting approaches and earlier human-like baseline strategies, resulting in state-of-the-art results on WMT2024.
- Published
- 2024
3. Don't Throw Away Data: Better Sequence Knowledge Distillation
- Author
-
Wang, Jun, Briakou, Eleftheria, Dadkhahi, Hamid, Agarwal, Rishabh, Cherry, Colin, and Cohn, Trevor
- Subjects
Computer Science - Computation and Language - Abstract
A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.
- Published
- 2024
4. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
- Author
-
Zhang, Biao, Liu, Zhongtao, Cherry, Colin, and Firat, Orhan
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods., Comment: ICLR24
- Published
- 2024
5. To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation
- Author
-
Luo, Jiaming, Cherry, Colin, and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
We conduct a large-scale fine-grained comparative analysis of machine translations (MT) against human translations (HT) through the lens of morphosyntactic divergence. Across three language pairs and two types of divergence defined as the structural difference between the source and the target, MT is consistently more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments. Through analysis on different decoding algorithms, we attribute this discrepancy to the use of beam search that biases MT towards more convergent patterns. This bias is most amplified when the convergent pattern appears around 50% of the time in training data. Lastly, we show that for a majority of morphosyntactic divergences, their presence in HT is correlated with decreased MT performance, presenting a greater challenge for MT systems., Comment: TACL, pre-MIT Press publication version
- Published
- 2024
6. Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model
- Author
-
Tomani, Christian, Vilar, David, Freitag, Markus, Cherry, Colin, Naskar, Subhajit, Finkelstein, Mara, Garcia, Xavier, and Cremers, Daniel
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations getting assigned a higher score by the model. However, research has shown that this assumption does not always hold, and generation quality can be improved by decoding to optimize a utility function backed by a metric or quality-estimation signal, as is done by Minimum Bayes Risk (MBR) or quality-aware decoding. The main disadvantage of these approaches is that they require an additional model to calculate the utility function during decoding, significantly increasing the computational cost. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. Using this approach for MBR decoding we can drastically reduce the size of the candidate list, resulting in a speed-up of two-orders of magnitude. When applying our method to MAP decoding we obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding., Comment: In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
- Published
- 2023
7. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
- Author
-
Ruder, Sebastian, Clark, Jonathan H., Gutkin, Alexander, Kale, Mihir, Ma, Min, Nicosia, Massimo, Rijhwani, Shruti, Riley, Parker, Sarr, Jean-Michel A., Wang, Xinyi, Wieting, John, Gupta, Nitish, Katanova, Anna, Kirov, Christo, Dickinson, Dana L., Roark, Brian, Samanta, Bidisha, Tao, Connie, Adelani, David I., Axelrod, Vera, Caswell, Isaac, Cherry, Colin, Garrette, Dan, Ingle, Reeve, Johnson, Melvin, Panteleev, Dmitry, and Talukdar, Partha
- Subjects
Computer Science - Computation and Language - Abstract
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models
- Published
- 2023
- Full Text
- View/download PDF
8. PaLM 2 Technical Report
- Author
-
Anil, Rohan, Dai, Andrew M., Firat, Orhan, Johnson, Melvin, Lepikhin, Dmitry, Passos, Alexandre, Shakeri, Siamak, Taropa, Emanuel, Bailey, Paige, Chen, Zhifeng, Chu, Eric, Clark, Jonathan H., Shafey, Laurent El, Huang, Yanping, Meier-Hellstern, Kathy, Mishra, Gaurav, Moreira, Erica, Omernick, Mark, Robinson, Kevin, Ruder, Sebastian, Tay, Yi, Xiao, Kefan, Xu, Yuanzhong, Zhang, Yujing, Abrego, Gustavo Hernandez, Ahn, Junwhan, Austin, Jacob, Barham, Paul, Botha, Jan, Bradbury, James, Brahma, Siddhartha, Brooks, Kevin, Catasta, Michele, Cheng, Yong, Cherry, Colin, Choquette-Choo, Christopher A., Chowdhery, Aakanksha, Crepy, Clément, Dave, Shachi, Dehghani, Mostafa, Dev, Sunipa, Devlin, Jacob, Díaz, Mark, Du, Nan, Dyer, Ethan, Feinberg, Vlad, Feng, Fangxiaoyu, Fienber, Vlad, Freitag, Markus, Garcia, Xavier, Gehrmann, Sebastian, Gonzalez, Lucas, Gur-Ari, Guy, Hand, Steven, Hashemi, Hadi, Hou, Le, Howland, Joshua, Hu, Andrea, Hui, Jeffrey, Hurwitz, Jeremy, Isard, Michael, Ittycheriah, Abe, Jagielski, Matthew, Jia, Wenhao, Kenealy, Kathleen, Krikun, Maxim, Kudugunta, Sneha, Lan, Chang, Lee, Katherine, Lee, Benjamin, Li, Eric, Li, Music, Li, Wei, Li, YaGuang, Li, Jian, Lim, Hyeontaek, Lin, Hanzhao, Liu, Zhongtao, Liu, Frederick, Maggioni, Marcello, Mahendru, Aroma, Maynez, Joshua, Misra, Vedant, Moussalem, Maysam, Nado, Zachary, Nham, John, Ni, Eric, Nystrom, Andrew, Parrish, Alicia, Pellat, Marie, Polacek, Martin, Polozov, Alex, Pope, Reiner, Qiao, Siyuan, Reif, Emily, Richter, Bryan, Riley, Parker, Ros, Alex Castro, Roy, Aurko, Saeta, Brennan, Samuel, Rajkumar, Shelby, Renee, Slone, Ambrose, Smilkov, Daniel, So, David R., Sohn, Daniel, Tokumine, Simon, Valter, Dasha, Vasudevan, Vijay, Vodrahalli, Kiran, Wang, Xuezhi, Wang, Pidong, Wang, Zirui, Wang, Tao, Wieting, John, Wu, Yuhuai, Xu, Kelvin, Xu, Yunhan, Xue, Linting, Yin, Pengcheng, Yu, Jiahui, Zhang, Qiao, Zheng, Steven, Zheng, Ce, Zhou, Weikang, Zhou, Denny, Petrov, Slav, and Wu, Yonghui
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
- Published
- 2023
9. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
- Author
-
Briakou, Eleftheria, Cherry, Colin, and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
Large, multilingual language models exhibit surprisingly good zero- or few-shot machine translation capabilities, despite having never seen the intentionally-included translation examples provided to typical neural translation systems. We investigate the role of incidental bilingualism -- the unintentional consumption of bilingual signals, including translation examples -- in explaining the translation capabilities of large language models, taking the Pathways Language Model (PaLM) as a case study. We introduce a mixed-method approach to measure and understand incidental bilingualism at scale. We show that PaLM is exposed to over 30 million translation pairs across at least 44 languages. Furthermore, the amount of incidental bilingual content is highly correlated with the amount of monolingual in-language content for non-English languages. We relate incidental bilingual content to zero-shot prompts and show that it can be used to mine new prompts to improve PaLM's out-of-English zero-shot translation quality. Finally, in a series of small-scale ablations, we show that its presence has a substantial impact on translation capabilities, although this impact diminishes with model scale., Comment: Accepted at ACL 2023
- Published
- 2023
10. The unreasonable effectiveness of few-shot learning for machine translation
- Author
-
Garcia, Xavier, Bansal, Yamini, Cherry, Colin, Foster, George, Krikun, Maxim, Feng, Fangxiaoyu, Johnson, Melvin, and Firat, Orhan
- Subjects
Computer Science - Computation and Language - Abstract
We demonstrate the potential of few-shot translation systems, trained with unpaired language data, for both high and low-resource language pairs. We show that with only 5 examples of high-quality translation data shown at inference, a transformer decoder-only model trained solely with self-supervised learning, is able to match specialized supervised state-of-the-art models as well as more general commercial translation systems. In particular, we outperform the best performing system on the WMT'21 English - Chinese news translation task by only using five examples of English - Chinese parallel data at inference. Moreover, our approach in building these models does not necessitate joint multilingual training or back-translation, is conceptually simple and shows the potential to extend to the multilingual setting. Furthermore, the resulting models are two orders of magnitude smaller than state-of-the-art language models. We then analyze the factors which impact the performance of few-shot translation systems, and highlight that the quality of the few-shot demonstrations heavily determines the quality of the translations generated by our models. Finally, we show that the few-shot paradigm also provides a way to control certain attributes of the translation -- we show that we are able to control for regional varieties and formality using only a five examples at inference, paving the way towards controllable machine translation systems.
- Published
- 2023
11. Prompting PaLM for Translation: Assessing Strategies and Performance
- Author
-
Vilar, David, Freitag, Markus, Cherry, Colin, Luo, Jiaming, Ratnakar, Viresh, and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work., Comment: ACL 2023
- Published
- 2022
12. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
- Author
-
Jia, Ye, Ding, Yifan, Bapna, Ankur, Cherry, Colin, Zhang, Yu, Conneau, Alexis, and Morioka, Nobuyuki
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning., Comment: Interspeech 2022
- Published
- 2022
13. XTREME-S: Evaluating Cross-lingual Speech Representations
- Author
-
Conneau, Alexis, Bapna, Ankur, Zhang, Yu, Ma, Min, von Platen, Patrick, Lozhkov, Anton, Cherry, Colin, Jia, Ye, Rivera, Clara, Kale, Mihir, Van Esch, Daan, Axelrod, Vera, Khanuja, Simran, Clark, Jonathan H., Firat, Orhan, Auli, Michael, Ruder, Sebastian, Riesa, Jason, and Johnson, Melvin
- Subjects
Computer Science - Computation and Language - Abstract
We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. Datasets and fine-tuning scripts are made easily accessible at https://hf.co/datasets/google/xtreme_s., Comment: Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"
- Published
- 2022
14. Data Scaling Laws in NMT: The Effect of Noise and Architecture
- Author
-
Bansal, Yamini, Ghorbani, Behrooz, Garg, Ankush, Zhang, Biao, Krikun, Maxim, Cherry, Colin, Neyshabur, Behnam, and Firat, Orhan
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language - Abstract
In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.
- Published
- 2022
15. mSLAM: Massively multilingual joint pre-training for speech and text
- Author
-
Bapna, Ankur, Cherry, Colin, Zhang, Yu, Jia, Ye, Johnson, Melvin, Cheng, Yong, Khanuja, Simran, Riesa, Jason, and Conneau, Alexis
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.
- Published
- 2022
16. Can Multilinguality benefit Non-autoregressive Machine Translation?
- Author
-
Agrawal, Sweta, Kreutzer, Julia, and Cherry, Colin
- Subjects
Computer Science - Computation and Language - Abstract
Non-autoregressive (NAR) machine translation has recently achieved significant improvements, and now outperforms autoregressive (AR) models on some benchmarks, providing an efficient alternative to AR inference. However, while AR translation is often implemented using multilingual models that benefit from transfer between languages and from improved serving efficiency, multilingual NAR models remain relatively unexplored. Taking Connectionist Temporal Classification (CTC) as an example NAR model and Imputer as a semi-NAR model, we present a comprehensive empirical study of multilingual NAR. We test its capabilities with respect to positive transfer between related languages and negative transfer under capacity constraints. As NAR models require distilled training sets, we carefully study the impact of bilingual versus multilingual teachers. Finally, we fit a scaling law for multilingual NAR, which quantifies its performance relative to the AR model as model scale increases.
- Published
- 2021
17. Scaling Laws for Neural Machine Translation
- Author
-
Ghorbani, Behrooz, Firat, Orhan, Freitag, Markus, Bapna, Ankur, Krikun, Maxim, Garcia, Xavier, Chelba, Ciprian, and Cherry, Colin
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study., Comment: 31 pages, 23 figures
- Published
- 2021
18. Assessing Reference-Free Peer Evaluation for Machine Translation
- Author
-
Agrawal, Sweta, Foster, George, Freitag, Markus, and Cherry, Colin
- Subjects
Computer Science - Computation and Language - Abstract
Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities., Comment: NAACL 2021
- Published
- 2021
19. Sentence Boundary Augmentation For Neural Machine Translation Robustness
- Author
-
Li, Daniel, I, Te, Arivazhagan, Naveen, Cherry, Colin, and Padfield, Dirk
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NMT models have to handle errors including phoneme substitutions, grammatical structure, and sentence boundaries, all of which pose challenges to NMT robustness. Through in-depth error analysis, we show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness., Comment: 5 pages, 4 figures
- Published
- 2020
20. Human-Paraphrased References Improve Neural Machine Translation
- Author
-
Freitag, Markus, Foster, George, Grangier, David, and Cherry, Colin
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al. When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing end-to-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements., Comment: Accepted at WMT 2020
- Published
- 2020
21. Inference Strategies for Machine Translation with Conditional Masking
- Author
-
Kreutzer, Julia, Foster, George, and Cherry, Colin
- Subjects
Computer Science - Computation and Language - Abstract
Conditional masked language model (CMLM) training has proven successful for non-autoregressive and semi-autoregressive sequence generation tasks, such as machine translation. Given a trained CMLM, however, it is not clear what the best inference strategy is. We formulate masked inference as a factorization of conditional probabilities of partial sequences, show that this does not harm performance, and investigate a number of simple heuristics motivated by this perspective. We identify a thresholding strategy that has advantages over the standard "mask-predict" algorithm, and provide analyses of its behavior on machine translation tasks., Comment: EMNLP 2020, updated Fig 3
- Published
- 2020
22. Re-translation versus Streaming for Simultaneous Translation
- Author
-
Arivazhagan, Naveen, Cherry, Colin, Macherey, Wolfgang, and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
There has been great progress in improving streaming machine translation, a simultaneous paradigm where the system appends to a growing hypothesis as more source content becomes available. We study a related problem in which revisions to the hypothesis beyond strictly appending words are permitted. This is suitable for applications such as live captioning an audio feed. In this setting, we compare custom streaming approaches to re-translation, a straightforward strategy where each new source token triggers a distinct translation from scratch. We find re-translation to be as good or better than state-of-the-art streaming systems, even when operating under constraints that allow very few revisions. We attribute much of this success to a previously proposed data-augmentation technique that adds prefix-pairs to the training data, which alongside wait-k inference forms a strong baseline for streaming translation. We also highlight re-translation's ability to wrap arbitrarily powerful MT systems with an experiment showing large improvements from an upgrade to its base model., Comment: IWSLT 2020
- Published
- 2020
23. Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation
- Author
-
Arivazhagan, Naveen, Cherry, Colin, I, Te, Macherey, Wolfgang, Baljekar, Pallavi, and Foster, George
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them., Comment: ICASSP 2020
- Published
- 2019
24. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
- Author
-
Arivazhagan, Naveen, Bapna, Ankur, Firat, Orhan, Lepikhin, Dmitry, Johnson, Melvin, Krikun, Maxim, Chen, Mia Xu, Cao, Yuan, Foster, George, Cherry, Colin, Macherey, Wolfgang, Chen, Zhifeng, and Wu, Yonghui
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to achieving quality and practicality in universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research.
- Published
- 2019
25. Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
- Author
-
Arivazhagan, Naveen, Cherry, Colin, Macherey, Wolfgang, Chiu, Chung-Cheng, Yavuz, Semih, Pang, Ruoming, Li, Wei, and Raffel, Colin
- Subjects
Computer Science - Computation and Language - Abstract
Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard, monotonic attention head to schedule the reading of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk's adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values., Comment: Accepted for publication at ACL 2019
- Published
- 2019
26. Thinking Slow about Latency Evaluation for Simultaneous Machine Translation
- Author
-
Cherry, Colin and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
Simultaneous machine translation attempts to translate a source sentence before it is finished being spoken, with applications to translation of spoken language for live streaming and conversation. Since simultaneous systems trade quality to reduce latency, having an effective and interpretable latency metric is crucial. We introduce a variant of the recently proposed Average Lagging (AL) metric, which we call Differentiable Average Lagging (DAL). It distinguishes itself by being differentiable and internally consistent to its underlying mathematical model.
- Published
- 2019
27. Reinforcement Learning based Curriculum Optimization for Neural Machine Translation
- Author
-
Kumar, Gaurav, Foster, George, Cherry, Colin, and Krikun, Maxim
- Subjects
Computer Science - Computation and Language - Abstract
We consider the problem of making efficient use of heterogeneous training data in neural machine translation (NMT). Specifically, given a training dataset with a sentence-level feature such as noise, we seek an optimal curriculum, or order for presenting examples to the system during training. Our curriculum framework allows examples to appear an arbitrary number of times, and thus generalizes data weighting, filtering, and fine-tuning schemes. Rather than relying on prior knowledge to design a curriculum, we use reinforcement learning to learn one automatically, jointly with the NMT system, in the course of a single training run. We show that this approach can beat uniform and filtering baselines on Paracrawl and WMT English-to-French datasets by up to +3.4 BLEU, and match the performance of a hand-designed, state-of-the-art curriculum., Comment: NAACL 2019 short paper. Reviewer comments not yet addressed
- Published
- 2019
28. Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
- Author
-
Shen, Jonathan, Nguyen, Patrick, Wu, Yonghui, Chen, Zhifeng, Chen, Mia X., Jia, Ye, Kannan, Anjuli, Sainath, Tara, Cao, Yuan, Chiu, Chung-Cheng, He, Yanzhang, Chorowski, Jan, Hinsu, Smit, Laurenzo, Stella, Qin, James, Firat, Orhan, Macherey, Wolfgang, Gupta, Suyog, Bapna, Ankur, Zhang, Shuyuan, Pang, Ruoming, Weiss, Ron J., Prabhavalkar, Rohit, Liang, Qiao, Jacob, Benoit, Liang, Bowen, Lee, HyoukJoong, Chelba, Ciprian, Jean, Sébastien, Li, Bo, Johnson, Melvin, Anil, Rohan, Tibrewal, Rajat, Liu, Xiaobing, Eriguchi, Akiko, Jaitly, Navdeep, Ari, Naveen, Cherry, Colin, Haghani, Parisa, Good, Otavio, Cheng, Youlong, Alvarez, Raziel, Caswell, Isaac, Hsu, Wei-Ning, Yang, Zongheng, Wang, Kuan-Chieh, Gonina, Ekaterina, Tomanek, Katrin, Vanik, Ben, Wu, Zelin, Jones, Llion, Schuster, Mike, Huang, Yanping, Chen, Dehao, Irie, Kazuki, Foster, George, Richardson, John, Macherey, Klaus, Bruguier, Antoine, Zen, Heiga, Raffel, Colin, Kumar, Shankar, Rao, Kanishka, Rybach, David, Murray, Matthew, Peddinti, Vijayaditya, Krikun, Maxim, Bacchiani, Michiel A. U., Jablin, Thomas B., Suderman, Rob, Williams, Ian, Lee, Benjamin, Bhatia, Deepti, Carlson, Justin, Yavuz, Semih, Zhang, Yu, McGraw, Ian, Galkin, Max, Ge, Qi, Pundak, Golan, Whipkey, Chad, Wang, Todd, Alon, Uri, Lepikhin, Dmitry, Tian, Ye, Sabour, Sara, Chan, William, Toshniwal, Shubham, Liao, Baohua, Nirschl, Michael, and Rondon, Pat
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
- Published
- 2019
29. Shaping the Narrative Arc: An Information-Theoretic Approach to Collaborative Dialogue
- Author
-
Mathewson, Kory W., Castro, Pablo Samuel, Cherry, Colin, Foster, George, and Bellemare, Marc G.
- Subjects
Computer Science - Human-Computer Interaction ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
We consider the problem of designing an artificial agent capable of interacting with humans in collaborative dialogue to produce creative, engaging narratives. In this task, the goal is to establish universe details, and to collaborate on an interesting story in that universe, through a series of natural dialogue exchanges. Our model can augment any probabilistic conversational agent by allowing it to reason about universe information established and what potential next utterances might reveal. Ideally, with each utterance, agents would reveal just enough information to add specificity and reduce ambiguity without limiting the conversation. We empirically show that our model allows control over the rate at which the agent reveals information and that doing so significantly improves accuracy in predicting the next line of dialogues from movies. We close with a case-study with four professional theatre performers, who preferred interactions with our model-augmented agent over an unaugmented agent., Comment: 20 pages, 9 figures
- Published
- 2019
30. Efficient Sequence Labeling with Actor-Critic Training
- Author
-
Najafi, Saeed, Cherry, Colin, and Kondrak, Grzegorz
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Statistics - Machine Learning - Abstract
Neural approaches to sequence labeling often use a Conditional Random Field (CRF) to model their output dependencies, while Recurrent Neural Networks (RNN) are used for the same purpose in other tasks. We set out to establish RNNs as an attractive alternative to CRFs for sequence labeling. To do so, we address one of the RNN's most prominent shortcomings, the fact that it is not exposed to its own errors with the maximum-likelihood training. We frame the prediction of the output sequence as a sequential decision-making process, where we train the network with an adjusted actor-critic algorithm (AC-RNN). We comprehensively compare this strategy with maximum-likelihood training for both RNNs and CRFs on three structured-output tasks. The proposed AC-RNN efficiently matches the performance of the CRF on NER and CCG tagging, and outperforms it on Machine Transliteration. We also show that our training strategy is significantly better than other techniques for addressing RNN's exposure bias, such as Scheduled Sampling, and Self-Critical policy training.
- Published
- 2018
31. Revisiting Character-Based Neural Machine Translation with Capacity and Compression
- Author
-
Cherry, Colin, Foster, George, Bapna, Ankur, Firat, Orhan, and Macherey, Wolfgang
- Subjects
Computer Science - Computation and Language - Abstract
Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and computational challenges. In this paper, we show that the modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments. This result implies that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences. From this perspective, we evaluate several techniques for character-level NMT, verify that they do not match the performance of our deep character baseline model, and evaluate the performance versus computation time tradeoffs they offer. Within this framework, we also perform the first evaluation for NMT of conditional computation over time, in which the model learns which timesteps can be skipped, rather than having them be dictated by a fixed schedule specified before training begins., Comment: To appear at EMNLP 2018
- Published
- 2018
32. A Challenge Set Approach to Evaluating Machine Translation
- Author
-
Isabelle, Pierre, Cherry, Colin, and Foster, George
- Subjects
Computer Science - Computation and Language - Abstract
Neural machine translation represents an exciting leap forward in translation quality. But what longstanding weaknesses does it resolve, and which remain? We address these questions with a challenge set approach to translation evaluation and error analysis. A challenge set consists of a small set of sentences, each hand-designed to probe a system's capacity to bridge a particular structural divergence between languages. To exemplify this approach, we present an English-French challenge set, and use it to analyze phrase-based and neural systems. The resulting analysis provides not only a more fine-grained picture of the strengths of neural systems, but also insight into which linguistic phenomena remain out of reach., Comment: EMNLP 2017. 28 pages, including appendix. Machine readable data included in a separate file. This version corrects typos in the challenge set
- Published
- 2017
33. End-to-End Multi-View Networks for Text Classification
- Author
-
Guo, Hongyu, Cherry, Colin, and Su, Jiang
- Subjects
Computer Science - Computation and Language ,Computer Science - Learning ,Computer Science - Neural and Evolutionary Computing - Abstract
We propose a multi-view network for text classification. Our method automatically creates various views of its input text, each taking the form of soft attention weights that distribute the classifier's focus among a set of base features. For a bag-of-words representation, each view focuses on a different subset of the text's words. Aggregating many such views results in a more discriminative and robust representation. Through a novel architecture that both stacks and concatenates views, we produce a network that emphasizes both depth and width, allowing training to converge quickly. Using our multi-view architecture, we establish new state-of-the-art accuracies on two benchmark tasks., Comment: 6 pages
- Published
- 2017
34. Efficient Sequence Labeling with Actor-Critic Training
- Author
-
Najafi, Saeed, Cherry, Colin, Kondrak, Grzegorz, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Meurs, Marie-Jean, editor, and Rudzicz, Frank, editor
- Published
- 2019
- Full Text
- View/download PDF
35. To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation
- Author
-
Luo, Jiaming, primary, Cherry, Colin, additional, and Foster, George, additional
- Published
- 2024
- Full Text
- View/download PDF
36. Efficient Sequence Labeling with Actor-Critic Training
- Author
-
Najafi, Saeed, primary, Cherry, Colin, additional, and Kondrak, Grzegorz, additional
- Published
- 2019
- Full Text
- View/download PDF
37. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
- Author
-
Briakou, Eleftheria, primary, Cherry, Colin, additional, and Foster, George, additional
- Published
- 2023
- Full Text
- View/download PDF
38. Prompting PaLM for Translation: Assessing Strategies and Performance
- Author
-
Vilar, David, primary, Freitag, Markus, additional, Cherry, Colin, additional, Luo, Jiaming, additional, Ratnakar, Viresh, additional, and Foster, George, additional
- Published
- 2023
- Full Text
- View/download PDF
39. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
- Author
-
Ruder, Sebastian, primary, Clark, Jonathan, additional, Gutkin, Alexander, additional, Kale, Mihir, additional, Ma, Min, additional, Nicosia, Massimo, additional, Rijhwani, Shruti, additional, Riley, Parker, additional, Sarr, Jean-Michel, additional, Wang, Xinyi, additional, Wieting, John, additional, Gupta, Nitish, additional, Katanova, Anna, additional, Kirov, Christo, additional, Dickinson, Dana, additional, Roark, Brian, additional, Samanta, Bidisha, additional, Tao, Connie, additional, Adelani, David, additional, Axelrod, Vera, additional, Caswell, Isaac, additional, Cherry, Colin, additional, Garrette, Dan, additional, Ingle, Reeve, additional, Johnson, Melvin, additional, Panteleev, Dmitry, additional, and Talukdar, Partha, additional
- Published
- 2023
- Full Text
- View/download PDF
40. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
- Author
-
Jia, Ye, primary, Ding, Yifan, additional, Bapna, Ankur, additional, Cherry, Colin, additional, Zhang, Yu, additional, Conneau, Alexis, additional, and Morioka, Nobu, additional
- Published
- 2022
- Full Text
- View/download PDF
41. XTREME-S: Evaluating Cross-lingual Speech Representations
- Author
-
Conneau, Alexis, primary, Bapna, Ankur, additional, Zhang, Yu, additional, Ma, Min, additional, von Platen, Patrick, additional, Lozhkov, Anton, additional, Cherry, Colin, additional, Jia, Ye, additional, Rivera, Clara, additional, Kale, Mihir, additional, van Esch, Daan, additional, Axelrod, Vera, additional, Khanuja, Simran, additional, Clark, Jonathan, additional, Firat, Orhan, additional, Auli, Michael, additional, Ruder, Sebastian, additional, Riesa, Jason, additional, and Johnson, Melvin, additional
- Published
- 2022
- Full Text
- View/download PDF
42. Toward Versatility of Multi-Robot Systems
- Author
-
Cherry, Colin, Zhang, Hong, Parker, Lynne E., editor, Schneider, Frank E., editor, and Schultz, Alan C., editor
- Published
- 2005
- Full Text
- View/download PDF
43. Detecting concept relations in clinical text: Insights from a state-of-the-art model
- Author
-
Zhu, Xiaodan, Cherry, Colin, Kiritchenko, Svetlana, Martin, Joel, and de Bruijn, Berry
- Published
- 2013
- Full Text
- View/download PDF
44. A Natural Diet: Towards Improving Naturalness of Machine Translation Output
- Author
-
Freitag, Markus, primary, Vilar, David, additional, Grangier, David, additional, Cherry, Colin, additional, and Foster, George, additional
- Published
- 2022
- Full Text
- View/download PDF
45. Subtitle Translation as Markup Translation
- Author
-
Cherry, Colin, primary, Arivazhagan, Naveen, additional, Padfield, Dirk, additional, and Krikun, Maxim, additional
- Published
- 2021
- Full Text
- View/download PDF
46. Sentence Boundary Augmentation for Neural Machine Translation Robustness
- Author
-
Li, Daniel, primary, I, Te, additional, Arivazhagan, Naveen, additional, Cherry, Colin, additional, and Padfield, Dirk, additional
- Published
- 2021
- Full Text
- View/download PDF
47. Assessing Reference-Free Peer Evaluation for Machine Translation
- Author
-
Agrawal, Sweta, primary, Foster, George, additional, Freitag, Markus, additional, and Cherry, Colin, additional
- Published
- 2021
- Full Text
- View/download PDF
48. Inverted Projection for Robust Speech Translation
- Author
-
Padfield, Dirk, primary and Cherry, Colin, additional
- Published
- 2021
- Full Text
- View/download PDF
49. À la Recherche du Temps Perdu: extracting temporal relations from medical text in the 2012 i2b2 NLP challenge
- Author
-
Cherry, Colin, Zhu, Xiaodan, Martin, Joel, and de Bruijn, Berry
- Published
- 2013
- Full Text
- View/download PDF
50. Biomedical named entity recognition using discriminative training
- Author
-
Jiampojamarn, Sittichai, primary, Kondrak, Grzegorz, additional, and Cherry, Colin, additional
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.