1. Design and construction of an openly available Urdu web corpus.
- Author
-
Jehangir, Humaira and Hardie, Andrew
- Subjects
ONLINE chat ,CORPORA ,INTELLECTUAL property ,METADATA - Abstract
Urdu corpus linguistics is in its infancy, partly because the field lacks large, openly and freely accessible corpora. General purpose Urdu corpora created to date are unsuitable as shared reference data for the field due to barriers of cost or copyright. The novel Lancaster Urdu Web Corpus (luwc) is designed to fill this gap. It encompasses data from three news websites and an online chat forum. The corpus contains 24 million tokens, and is part-of-speech (pos) tagged. To overcome problems with distributing a corpus whose texts' intellectual property belongs to other parties, the luwc is available through a cqpweb server, disallowing access to full underlying data. However, the accessibility of source urls as text-level metadata gives users a means by which to see the full original context. In spite of issues of balance/representativeness the luwc can fulfil the role of a shared reference point for Urdu corpus analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF