Back to Search
Start Over
Selection criteria for word trigger pairs in language modeling
- Source :
- Grammatical Interference: Learning Syntax from Sentences ISBN: 9783540617785, ICGI
- Publication Year :
- 1996
- Publisher :
- Springer Berlin Heidelberg, 1996.
-
Abstract
- In this paper, we study selection criteria for the use of word trigger pairs in statistical language modeling. A word trigger pair is defined as a long-distance word pair. To select the most significant trigger pairs, we need suitable criteria which are the topics of this paper. We extend a baseline language model by a single word trigger pair and use the perplexity of this extended language model as selection criterion. This extension is applied to all possible trigger pairs, the number of which is the square of the vocabulary size. When a unigram language model is applied as baseline model, this approach produces the mutual information criterion used in [7, 11]. The more interesting case is to use this criterion in the context of a more powerful model such as a bigram/trigram model with a cache. We study different variants of including word trigger pairs into such a language model. This approach produced better word trigger pairs than the conventional mutual information criterion. When used on the Wall Street Journal corpus, the trigger pairs selected reduced the perplexity of a trigram/cache language model from 138 to 128 for a 5-million word training set and from 92 to 87 for a 38-million word training set.
- Subjects :
- Vocabulary
Perplexity
Computer science
business.industry
Bigram
media_common.quotation_subject
Speech recognition
Context (language use)
computer.software_genre
Cache language model
Trigram
Language model
Artificial intelligence
business
computer
Natural language processing
Word (computer architecture)
media_common
Subjects
Details
- ISBN :
- 978-3-540-61778-5
- ISBNs :
- 9783540617785
- Database :
- OpenAIRE
- Journal :
- Grammatical Interference: Learning Syntax from Sentences ISBN: 9783540617785, ICGI
- Accession number :
- edsair.doi...........fa955bcf65ba310601f2c7034c96d13a
- Full Text :
- https://doi.org/10.1007/bfb0033345