1. Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu
- Author
-
Toqeer Ehsan and Sarmad Hussain
- Subjects
Phrase ,General Computer Science ,Computer science ,media_common.quotation_subject ,Treebank ,free word-order ,computer.software_genre ,morphological-richness ,Rule-based machine translation ,General Materials Science ,media_common ,Parsing ,Grammar ,business.industry ,Lemmatisation ,General Engineering ,Part of speech ,Urdu ,Syntax ,language.human_language ,treebank ,language ,statistical parsing ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Artificial intelligence ,business ,lcsh:TK1-9971 ,computer ,Word (computer architecture) ,Natural language processing ,Natural language ,Word order - Abstract
This article presents an analysis of experiments with statistical and neural parsing techniques for Urdu, a widely spoken South Asian language. We demonstrate state of the art constituency parsing results for an Urdu treebank. Urdu is a morphologically rich and is characterized by free word order. Language representation (e.g. input type, lemmatization, word clusters), part of speech tag set, phrase labels and the size of a training corpus are crucial for parsing such languages. In this article, probabilistic context-free grammars, data-oriented parsing, and recursive neural network based models have been experimented with several linguistic features which show improvements in the parsing results. Features include syntactic sub-categorization of POS tags, empirically learned horizontal and vertical markovizations and lexical head words. These features enable dependency information for case markers and add phrasal and lexical context to the parse trees. The data-oriented parsing and recursive neural network model give an f-score of 87.1 by considering gold POS tags in the test set, on textual input, they show a performance with f-scores of 83.4 and 84.2, respectively. To overcome the issue of data sparsity due to the morphological richness, lemmatization and unsupervised word clustering have been performed. A treebank should cover most probable word orders of the language so that models can learn various orders accurately. To analyze the order coverage of the treebank and learning capability of different parsers, a test set has been prepared conditioning different word orders. This test set is evaluated with the best performing parsing models and with gold POS tags, f-scores are above 90 and on textual input, the average f-score is 87.6.
- Published
- 2019