1. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
- Author
-
Élise Prieur-Gaston, Aurélie Névéol, Antonio Jimeno Yepes, National Library of Medicine (NLM), National Institutes of Health [Bethesda] (NIH)-National Center for Biotechnology Information (NCBI), National ICT Australia [Sydney] (NICTA), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU), Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), and Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistical machine translation ,020205 medical informatics ,Machine translation ,Multilingual corpus generation ,Computer science ,media_common.quotation_subject ,MEDLINE ,Automatic translation ,02 engineering and technology ,computer.software_genre ,Biochemistry ,Domain (software engineering) ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,Official language ,Molecular Biology ,media_common ,Biomedical domain ,Publishing ,Information retrieval ,Models, Statistical ,business.industry ,Applied Mathematics ,Statistical model ,Linguistics ,Translating ,Computer Science Applications ,Biomedical text ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,020201 artificial intelligence & image processing ,Artificial intelligence ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Natural language processing ,Research Article - Abstract
International audience; Background: Most of the institutional and research information in the biomedical domain is available in the formof English text. Even in countries where English is an official language, such as the United States, language can be abarrier for accessing biomedical information for non-native speakers. Recent progress in machine translationsuggests that this technique could help make English texts accessible to speakers of other languages. However, thelack of adequate specialized corpora needed to train statistical models currently limits the quality of automatictranslations in the biomedical domain.Results: We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain,using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINEand abstract text automatically retrieved from journal websites, which substantially extends the corpora used inprevious work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish)we use the Moses package to train a statistical machine translation model that outperforms previous models forautomatic translation of biomedical text.Conclusions: We have built translation data sets in the biomedical domain that can easily be extended to otherlanguages available in MEDLINE. These sets can successfully be applied to train statistical machine translationmodels. While further progress should be made by incorporating out-of-domain corpora and domain-specificlexicons, we believe that this work improves the automatic translation of biomedical texts.
- Published
- 2013
- Full Text
- View/download PDF