1. Cross-language Sentence Selection via Data Augmentation and Rationale Training
- Author
-
Petra Galuščáková, Douglas W. Oard, Yanda Chen, Suraj Nair, Kathleen R. McKeown, Rui Zhang, and Chris Kedzie
- Subjects
FOS: Computer and information sciences ,Phrase ,Computer Science - Computation and Language ,Machine translation ,Computer science ,business.industry ,computer.software_genre ,Variety (linguistics) ,Computer Science - Information Retrieval ,Selection (linguistics) ,Embedding ,Relevance (information retrieval) ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,Information Retrieval (cs.IR) ,Word (computer architecture) ,Sentence ,Natural language processing - Abstract
This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines., ACL 2021 main conference
- Published
- 2021