Back to Search
Start Over
Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser
- Source :
- Language Resources and Evaluation. 55:287-326
- Publication Year :
- 2020
- Publisher :
- Springer Science and Business Media LLC, 2020.
-
Abstract
- A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%.
- Subjects :
- Text corpus
050101 languages & linguistics
Linguistics and Language
Computer science
media_common.quotation_subject
Treebank
02 engineering and technology
Library and Information Sciences
computer.software_genre
Language and Linguistics
Education
Annotation
Named-entity recognition
Chunking (psychology)
0202 electrical engineering, electronic engineering, information engineering
0501 psychology and cognitive sciences
media_common
Parsing
Grammar
business.industry
Part-of-speech tagging
05 social sciences
Text segmentation
Phrase structure rules
Syntax
language.human_language
language
020201 artificial intelligence & image processing
Urdu
Artificial intelligence
Computational linguistics
business
computer
Natural language processing
Subjects
Details
- ISSN :
- 15740218 and 1574020X
- Volume :
- 55
- Database :
- OpenAIRE
- Journal :
- Language Resources and Evaluation
- Accession number :
- edsair.doi...........af7d2827885c858c692779a7a90bb2c1