Back to Search Start Over

Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Authors :
Toqeer Ehsan
Sarmad Hussain
Source :
Language Resources and Evaluation. 55:287-326
Publication Year :
2020
Publisher :
Springer Science and Business Media LLC, 2020.

Abstract

A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%.

Details

ISSN :
15740218 and 1574020X
Volume :
55
Database :
OpenAIRE
Journal :
Language Resources and Evaluation
Accession number :
edsair.doi...........af7d2827885c858c692779a7a90bb2c1