1. Compilation of specialized comparable corpora in French and Japanese
- Author
-
Emmanuel Morin, Lorraine Goeuriot, and Béatrice Daille
- Subjects
Typology ,Shallow parsing ,Information retrieval ,business.industry ,Computer science ,media_common.quotation_subject ,Comparability ,computer.software_genre ,Domain (software engineering) ,Scientific domain ,Quality (business) ,Artificial intelligence ,IBM ,business ,Popular science ,computer ,Natural language processing ,media_common - Abstract
We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.
- Published
- 2009
- Full Text
- View/download PDF