25 results on '"N-gram language models"'
Search Results
2. Hybrid hidden Markov models and artificial neural networks for handwritten music recognition in mensural notation.
- Author
-
Calvo-Zaragoza, Jorge, Toselli, Alejandro H., and Vidal, Enrique
- Subjects
- *
ARTIFICIAL neural networks , *HIDDEN Markov models , *MULTILAYER perceptrons , *HANDWRITING recognition (Computer science) , *MUSIC , *STATISTICS , *ERROR rates - Abstract
In this paper, we present a hybrid approach using hidden Markov models (HMM) and artificial neural networks to deal with the task of handwritten Music Recognition in mensural notation. Previous works have shown that the task can be addressed with Gaussian density HMMs that can be trained and used in an end-to-end manner, that is, without prior segmentation of the symbols. However, the results achieved using that approach are not sufficiently accurate to be useful in practice. In this work, we hybridize HMMs with deep multilayer perceptrons (MLPs), which lead to remarkable improvements in optical symbol modeling. Moreover, this hybrid architecture maintains important advantages of HMMs such as the ability to properly model variable-length symbol sequences through segmentation-free training, and the simplicity and robustness of combining optical models with N-gram language models, which provide statistical a priori information about regularities in musical symbol concatenation observed in the training data. The results obtained with the proposed hybrid MLP-HMM approach outperform previous works by a wide margin, achieving symbol-level error rates around 26%, as compared with about 40% reported in previous works. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. Answering Definition Questions: Dealing with Data Sparseness in Lexicalised Dependency Trees-Based Language Models
- Author
-
Figueroa, Alejandro, Atkinson, John, van der Aalst, Will, editor, Mylopoulos, John, editor, Sadeh, Norman M., editor, Shaw, Michael J., editor, Szyperski, Clemens, editor, Cordeiro, José, editor, and Filipe, Joaquim, editor
- Published
- 2010
- Full Text
- View/download PDF
4. Performance of Czech Speech Recognition with Language Models Created from Public Resources
- Author
-
V. Prochazka, P. Pollak, J. Zdansky, and J. Nouza
- Subjects
speech recognition ,LVCSR ,n-gram language models ,public language resources ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
- Published
- 2011
5. Hybrid hidden Markov models and artificial neural networks for handwritten music recognition in mensural notation
- Author
-
Enrique Vidal, Alejandro Héctor Toselli, Jorge Calvo-Zaragoza, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, and Reconocimiento de Formas e Inteligencia Artificial
- Subjects
Artificial neural networks ,Artificial neural network ,Computer science ,Speech recognition ,Concatenation ,020207 software engineering ,02 engineering and technology ,Perceptron ,Symbol (chemistry) ,Mensural notation ,Artificial Intelligence ,Robustness (computer science) ,Lenguajes y Sistemas Informáticos ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Hidden Markov models ,Computer Vision and Pattern Recognition ,Language model ,Handwritten music recognition ,Hidden Markov model ,N-gram Language Models - Abstract
In this paper, we present a hybrid approach using hidden Markov models (HMM) and artificial neural networks to deal with the task of handwritten Music Recognition in mensural notation. Previous works have shown that the task can be addressed with Gaussian density HMMs that can be trained and used in an end-to-end manner, that is, without prior segmentation of the symbols. However, the results achieved using that approach are not sufficiently accurate to be useful in practice. In this work, we hybridize HMMs with deep multilayer perceptrons (MLPs), which lead to remarkable improvements in optical symbol modeling. Moreover, this hybrid architecture maintains important advantages of HMMs such as the ability to properly model variable-length symbol sequences through segmentation-free training, and the simplicity and robustness of combining optical models with N-gram language models, which provide statistical a priori information about regularities in musical symbol concatenation observed in the training data. The results obtained with the proposed hybrid MLP-HMM approach outperform previous works by a wide margin, achieving symbol-level error rates around 26%, as compared with about 40% reported in previous works.
- Published
- 2019
- Full Text
- View/download PDF
6. Multimodal city-verification on flickr videos using acoustic and textual features.
- Author
-
Lei, Howard, Choi, Jaeyoung, and Friedland, Gerald
- Abstract
We have performed city-verification of videos based on the videos' audio and metadata, using videos from the MediaEval Placing Task's video set, which contain consumer-produced videos “from-the-wild”. 18 cities were used as targets, for which acoustic and language models were trained, and against which test videos were scored. We have obtained the first known results for the city verification task, with an EER minimum of 21.8%, suggesting that ∼80% of test videos, when tested against a correct target city, were identified as belonging to that city. This result is well above-chance, even as the videos contained very few city-specific audio and metadata features. We have also demonstrated the complementarity of audio and metadata for this task. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
7. Hybrid hidden Markov models and artificial neural networks for handwritten music recognition in mensural notation
- Author
-
Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, Calvo-Zaragoza, Jorge, Toselli, Alejandro H., Vidal Ruiz, Enrique, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, Calvo-Zaragoza, Jorge, Toselli, Alejandro H., and Vidal Ruiz, Enrique
- Abstract
In this paper, we present a hybrid approach using hidden Markov models (HMM) and artificial neural networks to deal with the task of handwritten Music Recognition in mensural notation. Previous works have shown that the task can be addressed with Gaussian density HMMs that can be trained and used in an end-to-end manner, that is, without prior segmentation of the symbols. However, the results achieved using that approach are not sufficiently accurate to be useful in practice. In this work, we hybridize HMMs with deep multilayer perceptrons (MLPs), which lead to remarkable improvements in optical symbol modeling. Moreover, this hybrid architecture maintains important advantages of HMMs such as the ability to properly model variable-length symbol sequences through segmentation-free training, and the simplicity and robustness of combining optical models with N-gram language models, which provide statistical a priori information about regularities in musical symbol concatenation observed in the training data. The results obtained with the proposed hybrid MLP-HMM approach outperform previous works by a wide margin, achieving symbol-level error rates around 26%, as compared with about 40% reported in previous works.
- Published
- 2019
8. Performance of Czech Speech Recognition with Language Models Created from Public Resources.
- Author
-
Prochazka, Vaclav, Pollak, Petr, Zdansky, Jindrich, and Nouza, Jan
- Subjects
SPEECH perception ,LANGUAGE & languages ,MATHEMATICAL models ,NEWSPAPERS ,BROADCASTING industry - Abstract
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates. [ABSTRACT FROM AUTHOR]
- Published
- 2011
9. Improving statistical MT by coupling reordering and decoding.
- Author
-
Crego, Josep Maria and Mariño, José B.
- Subjects
TRANSLATIONS ,TRANSLATING & interpreting ,LANGUAGE & languages ,TRANSLATING services ,TRANSLATORS ,LINGUISTICS ,BILINGUALISM ,SPANISH language ,ENGLISH language - Abstract
In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation, where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically motivated rewrite rules, which are used to extend a monotonic search graph with reordering hypotheses. The extended graph is traversed in the global search when a fully informed decision can be taken. Further experiments show that the n-gram translation model can be successfully used as reordering model when estimated with reordered source words. Experiments are reported on the Europarl task (Spanish-English and English-Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality with respect to monotonic search for both translation directions at a very low computational cost. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
10. Augmenting Naive Bayes Classifiers with Statistical Language Models.
- Author
-
Peng, Fuchun, Schuurmans, Dale, and Wang, Shaojun
- Abstract
We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the C hain A ugmented N aive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes-allowing a local Markov chain dependence in the observed variables-while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems-authorship attribution, text genre classification, and topic detection-in several languages-Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
11. Statistical Morphological Disambiguation for Agglutinative Languages.
- Author
-
Hakkani-Tür, Dilek Z., Oflazer, Kemal, and Tür, Gökhan
- Subjects
- *
WORD formation (Grammar) , *LANGUAGE & languages , *SEMANTICS , *COMPARATIVE linguistics , *INFORMATION theory , *STATISTICS - Abstract
We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%. [ABSTRACT FROM AUTHOR]
- Published
- 2002
12. Handwritten word recognition using Web resources and recurrent neural networks
- Author
-
Laurence Likforman-Sulem, Chafic Mokbel, Cristina Oprean, Adrian Popescu, Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Département Intelligence Ambiante et Systèmes Interactifs (DIASI), Laboratoire d'Intégration des Systèmes et des Technologies (LIST), Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay, University of Balamand - UOB (LIBAN), Laboratoire d'Intégration des Systèmes et des Technologies (LIST (CEA)), and University of Balamand [Liban] (UOB)
- Subjects
Handwriting recognition ,Computer science ,Speech recognition ,Character recognition ,Context (language use) ,Computational linguistics ,computer.software_genre ,[INFO]Computer Science [cs] ,Handwritten word recognition ,business.industry ,Dynamic dictionaries ,N-gram language models ,Linguistics ,Computer Science Applications ,Vocabulary control ,World Wide Web ,Linguistic resources ,Recurrent neural network ,Web resources ,Recurrent neural networks ,Word recognition ,Pattern recognition (psychology) ,Edit distance ,Computer Vision and Pattern Recognition ,Language model ,Artificial intelligence ,String metric ,Out of vocabulary words ,business ,Recurrent neural network (RNN) ,computer ,Software ,Natural language processing - Abstract
International audience; Handwriting recognition systems usually rely on static dictionaries and language models. Full coverage of these dictionaries is generally not achieved when dealing with unrestricted document corpora due to the presence of Out-Of-Vocabulary (OOV) words. We propose an approach which uses the World Wide Web as a corpus to improve dictionary coverage. We exploit the very large and freely available Wikipedia corpus in order to obtain dynamic dictionaries on the fly. We rely on recurrent neural network (RNN) recognizers, with and without linguistic resources, to detect words that are non-reliably recognized within a word sequence. Such words are labeled as non-anchor words (NAWs) and include OOVs and In-Vocabulary words recognized with low confidence. To recognize a non-anchor word, a dynamic dictionary is built by selecting words from the Web resource based on their string similarity with the NAW image, and their linguistic relevance in the NAW context. Similarity is evaluated by computing the edit distance between the sequence of characters generated by the RNN recognizer exploited as a filler model, and the Wikipedia words. Linguistic relevance is based on an N-gram language model estimated from the Wikipedia corpus. Experiments conducted on aword-segmented version of the publicly available RIMES database show that the proposed approach can improve recognition accuracy compared to systems based on static dictionaries only. The proposed approach shows even better behavior as the proportion of OOVs increases, in terms of both accuracy and dictionary coverage.
- Published
- 2015
- Full Text
- View/download PDF
13. Stepwise API usage assistance based on N-gram language models
- Author
-
Prendi, Gonçalo Queiroga, Santos, André Leal, and Ribeiro, Ricardo Daniel Santos Faro Marques
- Subjects
Perplexidade ,API usability ,Code completion ,N-gram language models ,Perplexity ,Usabilidade das APIs - Abstract
Software development requires the use of external Application Programming Interfaces (APIs) in order to reuse libraries and frameworks. Programmers often struggle with unfamiliar APIs due to their lack of resources or less common design. Such difficulties often lead to an incorrect sequences of API calls that may not produce the desired outcome. Language models have shown the ability to capture regularities in text as well as in code. In this work we explore the use of n-gram language models and their ability to capture regularities in API usage through an intrinsic and extrinsic evaluation of these models on some of the most widely used APIs for the Java programming language. To achieve this, several language models were trained over a source code corpora containing several hundreds of GitHub Java projects that use the desired APIs. In order to fully assess the performance of the language models, we have selected APIs from multiple domains and vocabulary sizes. This work allowed us to conclude that n-gram language models are able to capture the API usage patterns due to their low perplexity values and their high overall coverage, going up to 100% in some cases, which encouraged us to create a code completion tool to help programmers stay in the right path when using unknown APIs while allowing for some exploration. O desenvolvimento de software requer a utilização de Application Programming Interfaces (APIs) externas com o objectivo de reutilizar bibliotecas e frameworks. Muitas vezes, os programadores têm dificuldade em utilizar APIs desconhecidas, devido à falta de recursos ou desenho fora do comum. Essas dificuldades provocam inúmeras vezes sequências incorrectas de chamadas às APIs que poderão não produzir o resultado desejado. Os modelos de língua mostraram-se capazes de capturar regularidades em texto, bem como em código. Neste trabalho é explorada a utilização de modelos de língua de n-gramas e a sua capacidade de capturar regularidades na utilização de APIs, através de uma avaliação intrínseca e extrínseca destes modelos em algumas das APIs mais utilizadas na linguagem de programação Java. Para alcançar este objectivo, vários modelos foram treinados sobre repositórios de código do GitHub, contendo centenas de projectos Java que utilizam estas APIs. Com o objectivo de ter uma avaliação completa do desempenho dos modelos de língua, foram seleccionadas APIs de múltiplos domínios e tamanhos de vocabulário. Este trabalho permite concluir que os modelos de língua de n-gramas são capazes de capturar padrões de utilização de APIs devido aos seus baixos valores de perplexidade e a sua alta cobertura, chegando a atingir 100% em alguns casos, o que levou à criação de uma ferramenta de code completion para guiar os programadores na utilização de uma API desconhecida, mas mantendo a possibilidade de a explorar.
- Published
- 2015
14. Markov models for offline handwriting recognition: a survey
- Author
-
Plötz, Thomas and Fink, Gernot A.
- Published
- 2009
- Full Text
- View/download PDF
15. Computer-assisted revision in Spanish academic texts: peer-assessment
- Author
-
Sergi Torner, Irene Renau, Rogelio Nazar, and Carmen López Ferrero
- Subjects
Vocabulary ,Grammar ,Computer science ,business.industry ,Academic discourse ,media_common.quotation_subject ,N-gram language models ,written competence evaluation ,computer.software_genre ,Spelling ,Linguistics ,Written competence evaluation ,Peer-assessment ,Peer assessment ,Academic writing ,General Materials Science ,Artificial intelligence ,Computer-assisted revision ,business ,computer ,Natural language processing ,media_common - Abstract
This paper presents a series of experiments in automatic correction of spelling and grammar errors with a statistic and corpus-driven methodology. The language of the experiments is Spanish, but the method can be easily extrapolated to other languages since we do not use language-specific resources. Our main motivation is to develop a tool that could assist university students to write academic texts, because this kind of system is practically nonexistent in the present, especially in Spanish. Our work is based on previous descriptions, which identify the most problematic phenomena in academic writing at university level. We aim to develop a tool for automatic detection and correction of some of those problematic issues at different linguistic levels such as spelling, grammar and vocabulary. The paper received funding from project 20 PlaCQUID 2012-2013 1, from Universitat Pompeu Fabra.
- Published
- 2014
16. Automatic Transcription of Lyrics in Monophonic and Poliphonic Songs
- Author
-
Fernández Torres, Miguel Ángel, Gallardo Antolín, Ascensión, Universidad Carlos III de Madrid. Departamento de Teoría de la Señal y Comunicaciones, and UC3M. Departamento de Teoría de la Señal y Comunicaciones
- Subjects
Telecomunicaciones ,MAP ,N-gram language models ,MLLR ,Automatic lyrics transcription ,RPCA ,Singing adaptation ,Singing voice separation - Abstract
The paper proposes the implementation of a system for automatic transcription of lyrics in monophonic and polyphonic songs. The basis of the system is an automatic speech recognizer. Taking into account the differences between singing and spoken voice, acoustic models are adapted to singing voice, using several methods, and Language Models (LM) trained on songs lyrics are built. Moreover, background music is attenuated in polyphonic music using the Robust Principal Component Analysis (RPCA) algorithm, trying to facilitate the recognition task avoiding its effect. The results show that, using as adaptation data the same type of tracks that are transcribed then, both adaptation methods and specific LM for songs improve the performance of the baseline system at phonemeand word-level. However, the use of RPCA over polyphonic songs introduces distortions in singing voice, and therefore, in general, it is not useful for improving the performance of the whole system. Master in Multimedia and Communications = Master Universitario en Multimedia y Comunicaciones. Curso 2013/2014
- Published
- 2014
17. Performance of Czech Speech Recognition with Language Models Created from Public Resources
- Abstract
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
- Published
- 2011
18. Performance of Czech Speech Recognition with Language Models Created from Public Resources
- Abstract
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
- Published
- 2011
19. Novel statistical approaches to text classification, machine translation and computer-assisted translation
- Author
-
Juan Císcar, Alfonso, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Civera Saiz, Jorge, Juan Císcar, Alfonso, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, and Civera Saiz, Jorge
- Abstract
Esta tesis presenta diversas contribuciones en los campos de la clasificación automática de texto, traducción automática y traducción asistida por ordenador bajo el marco estadístico. En clasificación automática de texto, se propone una nueva aplicación llamada clasificación de texto bilingüe junto con una serie de modelos orientados a capturar dicha información bilingüe. Con tal fin se presentan dos aproximaciones a esta aplicación; la primera de ellas se basa en una asunción naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, más sofisticada, considera la existencia de una correlación entre palabras en diferentes lenguas. La primera aproximación dió lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la más compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexación de documentos. La segunda aproximación se caracteriza por modelos de traducción capaces de capturar correlación entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducción elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas más simples superando la aproximación naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos bilingües. En traducción automática, los modelos estadísticos de traducción basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelización mediante mixturas, con el objetivo de definir modelos de traducción dependientes del contexto. Asimismo se extiende un algoritmo iterativo de búsqueda basado en programación dinámica, originalmente diseñado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de búsqueda n
- Published
- 2008
20. Statistical morphological disambiguation for agglutinative languages
- Author
-
Dilek Hakkani-Tur, Kemal Oflazer, and Gokhan Tur
- Subjects
FOS: Computer and information sciences ,Agglutinative language ,N-Gram Language Models ,Morphology (linguistics) ,Agglutinative Languages ,Statistical Natural Language Processing ,Turkish ,business.industry ,Computer science ,computer.software_genre ,Morphological Disambiguation ,language.human_language ,Set (abstract data type) ,Inflection ,language ,Trigram ,Artificial intelligence ,business ,80107 Natural Language Processing ,computer ,Natural language processing - Abstract
We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.
- Published
- 2000
- Full Text
- View/download PDF
21. Finding and identifying text in 900+ languages.
- Author
-
Brown, Ralf D.
- Subjects
TEXT messages ,PROGRAMMING languages ,OPEN source software ,COMPUTER files ,ERROR rates ,COMPUTATIONAL linguistics ,FALSE alarms - Abstract
Abstract: This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, an overall language identification error rate of less than 0.4% is achieved. False-alarm rates on random data are 0.34% when filtering thresholds are set for high recall and 0.012% when set for high precision, with corresponding miss rates of 0.002% and 0.009% in running text. [Copyright &y& Elsevier]
- Published
- 2012
- Full Text
- View/download PDF
22. Rozpoznávácí sítě založené na konečných stavových převodnících pro dopředné a zpětné dekódování v rozpoznávání řeči
- Author
-
Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, Hannemann, Mirko, Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, and Hannemann, Mirko
- Abstract
Pomocí matematického formalismu váhovaných konečných stavových převodníků (weighted finite state transducers WFST) může být formulována řada úloh včetně automatického rozpoznávání řeči (automatic speech recognition ASR). Dnešní ASR systémy široce využívají složených pravděpodobnostních modelů nazývaných dekódovací grafy nebo rozpoznávací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, např. kompozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mohou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmů získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací "weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu modelů -- převodníky jazykového modelu (language model transducers) a obecně pro všechny cyklické WFST a WFST se záložními (back-off) přechody. Představujeme také způsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v čase, které prokazatelně produkuje ty samé pravděpodobnosti jako dopředná síť. K tomuto účelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelů a převodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopředné a zpětné vyhledávání). Tento přístup --- "sledovací" dekódování (tracked decoding) --- umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje, Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significa
23. Rozpoznávácí sítě založené na konečných stavových převodnících pro dopředné a zpětné dekódování v rozpoznávání řeči
- Author
-
Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, Hannemann, Mirko, Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, and Hannemann, Mirko
- Abstract
Pomocí matematického formalismu váhovaných konečných stavových převodníků (weighted finite state transducers WFST) může být formulována řada úloh včetně automatického rozpoznávání řeči (automatic speech recognition ASR). Dnešní ASR systémy široce využívají složených pravděpodobnostních modelů nazývaných dekódovací grafy nebo rozpoznávací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, např. kompozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mohou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmů získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací "weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu modelů -- převodníky jazykového modelu (language model transducers) a obecně pro všechny cyklické WFST a WFST se záložními (back-off) přechody. Představujeme také způsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v čase, které prokazatelně produkuje ty samé pravděpodobnosti jako dopředná síť. K tomuto účelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelů a převodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopředné a zpětné vyhledávání). Tento přístup --- "sledovací" dekódování (tracked decoding) --- umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje, Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significa
24. Rozpoznávácí sítě založené na konečných stavových převodnících pro dopředné a zpětné dekódování v rozpoznávání řeči
- Author
-
Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, Burget, Lukáš, AD, Ralf Schlüter, and Novák,, Miroslav
- Abstract
Pomocí matematického formalismu váhovaných konečných stavových převodníků (weighted finite state transducers WFST) může být formulována řada úloh včetně automatického rozpoznávání řeči (automatic speech recognition ASR). Dnešní ASR systémy široce využívají složených pravděpodobnostních modelů nazývaných dekódovací grafy nebo rozpoznávací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, např. kompozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mohou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmů získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací "weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu modelů -- převodníky jazykového modelu (language model transducers) a obecně pro všechny cyklické WFST a WFST se záložními (back-off) přechody. Představujeme také způsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v čase, které prokazatelně produkuje ty samé pravděpodobnosti jako dopředná síť. K tomuto účelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelů a převodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopředné a zpětné vyhledávání). Tento přístup --- "sledovací" dekódování (tracked decoding) --- umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje, Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significa
25. Rozpoznávácí sítě založené na konečných stavových převodnících pro dopředné a zpětné dekódování v rozpoznávání řeči
- Author
-
Burget, Lukáš, AD, Ralf Schlüter, Novák,, Miroslav, Burget, Lukáš, AD, Ralf Schlüter, and Novák,, Miroslav
- Abstract
Pomocí matematického formalismu váhovaných konečných stavových převodníků (weighted finite state transducers WFST) může být formulována řada úloh včetně automatického rozpoznávání řeči (automatic speech recognition ASR). Dnešní ASR systémy široce využívají složených pravděpodobnostních modelů nazývaných dekódovací grafy nebo rozpoznávací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, např. kompozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mohou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmů získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací "weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu modelů -- převodníky jazykového modelu (language model transducers) a obecně pro všechny cyklické WFST a WFST se záložními (back-off) přechody. Představujeme také způsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v čase, které prokazatelně produkuje ty samé pravděpodobnosti jako dopředná síť. K tomuto účelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelů a převodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopředné a zpětné vyhledávání). Tento přístup --- "sledovací" dekódování (tracked decoding) --- umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje, Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significa
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.