Back to Search
Start Over
A comparative evaluation of modern English corpus grammatical annotation schemes
- Source :
- ICAME Journal: International Computer Archive of Modern and Medieval English Journal
- Publication Year :
- 2000
- Publisher :
- The HIT Centre - Humanities Information Technologies Research Programme, 2000.
-
Abstract
- Many English Corpus Linguistics projects reported in ICAME Journal and elsewhere involve grammatical analysis or tagging of English texts (eg Atwell 1983, Leech et al 1983, Booth 1985, Owen 1987, Souter 1989a, O’Donoghue 1991, Belmore 1991, Kytö and Voutilainen 1995, Aarts 1996, Qiao and Huang 1998). Each new project has to review existing tagging schemes, and decide which to adopt and/or adapt. The AMALGAM project can help in this decision, by providing descriptions and analyses of a range of tagging schemes, and an internet-based service for researchers to try out the range of tagging schemes on their own data. The project AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) explored a range of Part-of-Speech tagsets and phrase structure parsing schemes used in modern English corpus-based research. The PoS-tagging schemes include: Brown (Greene and Rubin 1981), LOB (Atwell 1982, Johansson et al 1986), Parts (man 1986), SEC (Taylor and Knowles 1988), POW (Souter 1989b), UPenn (Santorini 1990), LLC (Eeg-Olofsson 1991), ICE (Greenbaum 1993), and BNC (Garside 1996). The parsing schemes include some which have been used for hand annotation of corpora or manual post-editing of automatic parsers, and others which are unedited output of a parsing program. Project deliverables include: – a detailed description of each PoS-tagging scheme, at a comparable level of detail. This includes a list of PoS-tags with descriptions and example uses from the source Corpus. The description of the use of PoS-tags is also illustrated in a multi-tagged corpus: a set of sample texts PoS-tagged in parallel with each PoS-tagset (and proofread by experts), for comparative studies – an analysis of the different lexical tokenization rules used in the source Corpora, to arrive at a ‘Corpus-neutral’ tokenization scheme (and consequent adjustments to the PoS-tagsets in our study to accept modified tokenization) – an implementation of each PoS-tagset in conjunction with our standardised tokenizer, as a family of PoS-taggers, one for each PoS-tagset – a method for ‘PoS-tagset conversion’, taking a text tagged according to one PoS-tagset and outputting the text annotated with another PoS-tagset – a sample of texts parsed according to a range a parsing schemes: a Multi-Treebank resource for comparative studies – an Internet service allowing researchers worldwide free access to the above resources, including a simple email-based method for PoS-tagging any English text with any or all PoS-tagset(s).
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- ICAME Journal: International Computer Archive of Modern and Medieval English Journal
- Accession number :
- edsair.core.ac.uk....bfa195a30cb84a087dfb5c7b4d5582b5