31 results on '"Morphosyntactic tagging"'
Search Results
2. Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields
- Author
-
Waszczuk, Jakub, Kieraś, Witold, Woliński, Marcin, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
- Published
- 2018
- Full Text
- View/download PDF
3. TOWARDS AN OPTIMAL SET OF INITIAL WEIGHTS FOR A DEEP NEURAL NETWORK ARCHITECTURE.
- Author
-
Saadi, A. and Belhadef, H.
- Subjects
CENTER of mass ,ARCHITECTURE ,ACOUSTIC imaging ,TAGUCHI methods ,CENTROID - Abstract
Modern neural network architectures are powerful models. They have been proven efficient in many fields, such as imaging and acoustic. However, these neural networks involve a long-running and time-consuming process. To accelerate the training process, we propose a two-stage approach based on data analysis and focus on the gravity center concept. The neural network is first trained on reduced data represented by a set of centroids of the original data points, and then the learned weights are used to initialize a second training phase of the neural network over the full-blown data. The design of deep neural networks is extremely difficult, and the primary objective is to achieve high performance. In this study, we apply the Taguchi method to select good values for the factors required to build the proposed architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. A Tiered CRF Tagger for Polish
- Author
-
Radziszewski, Adam, Bembenik, Robert, editor, Skonieczny, Lukasz, editor, Rybinski, Henryk, editor, Kryszkiewicz, Marzena, editor, and Niezgodka, Marek, editor
- Published
- 2013
- Full Text
- View/download PDF
5. Taggers Gonna Tag: An Argument against Evaluating Disambiguation Capacities of Morphosyntactic Taggers
- Author
-
Radziszewski, Adam, Acedański, Szymon, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
- Published
- 2012
- Full Text
- View/download PDF
6. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change
- Author
-
Rögnvaldsson, Eiríkur, Helgadóttir, Sigrún, Sporleder, Caroline, editor, van den Bosch, Antal, editor, and Zervanou, Kalliopi, editor
- Published
- 2011
- Full Text
- View/download PDF
7. Morphosyntactic Constraints in the Acquisition of Linguistic Knowledge for Polish
- Author
-
Piasecki, Maciej, Radziszewski, Adam, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Marciniak, Małgorzata, editor, and Mykowiecka, Agnieszka, editor
- Published
- 2009
- Full Text
- View/download PDF
8. MORPOHOLOGICAL POS TAGGING IN ORAL LANGUAGE CORPUS: CHALLENGES FOR AELIUS
- Author
-
Gabriel de Ávila Othero and Mônica Rigo Ayres
- Subjects
Tagger ,Morphosyntactic Tagging ,Corpus Linguistics ,Technology ,Language and Literature - Abstract
In this paper, we present the results of our work with automatic morphological annotation of excerpts from a corpus of spoken language – belonging to the VARSUL project – using the free morphosyntatic tagger Aelius. We present 20 texts containing 154,530 words, annotated automatically and corrected manually. This paper presents the tagger Aelius and our work of manual review of the texts, as well as our suggestions for improvements of the tool, concerning aspects of oral texts. We verify the performance of morphosyntactic tagging a spoken language corpus, an unprecedented challenge for the tagger. Based on the errors of the tagger, we try to infer certain patterns of annotation to overcome limitations presented by the program, and we propose suggestions for implementations in order to allow Aelius to tag spoken language corpora in a more effective way, specially treating cases such as interjections, apheresis, onomatopeia and conversational markers.
- Published
- 2014
9. Automatic Tagging of Compound Verb Groups in Czech Corpora
- Author
-
Žáčková, Eva, Popelínský, Luboš, Nepil, Miloslav, Goos, G., editor, Hartmanis, J., editor, van Leeuwen, J., editor, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Sojka, Petr, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
- Published
- 2000
- Full Text
- View/download PDF
10. Izazovi morfosintaktičkog označavanja na primjerima španjolskog i hrvatskog jezika
- Author
-
Kozolić, Klara and Mikelić Preradović, Nives
- Subjects
hrvatski ,POS taggers ,morfosintaktičko označavanje ,POS oznake ,morfosintaktičko označavanje, španjolski, hrvatski, POS oznake, MSD oznake, POS označivači ,Spanish ,SOCIAL SCIENCES. Information and Communication Sciences ,MSD oznake ,španjolski ,POS označivači ,morphosyntactic tagging ,Croatian ,MSD tags ,POS tags ,DRUŠTVENE ZNANOSTI. Informacijske i komunikacijske znanosti - Abstract
Posljednjih se desetljeća sve brže razvijaju jezične tehnologije kao težnja da se jezik kao sredstvo ljudske komunikacije kodira. One ubrzavaju usvajanje jezika, ali i olakšavaju lingvistička, psihološka i ostala istraživanja. Vrlo važnu ulogu za razvoj kvalitetnih i točnih jezičnih tehnologija imaju korpusi koji se označavaju na različitim razinama, a jedna od njih je dodavanje POS ili MSD oznake riječi s obzirom na stupanj flektivnosti jezika. U ovom radu će se opisati POS i MSD označavanje i navesti poteškoće koji se općenito javljaju pri tom procesu. Kako bi primjeri bili konkretni odabran je španjolski jezik kao manje flektivni jezik i hrvatski jezik kao visoko flektivni jezik. Španjolski će biti testiran na primjerima Stanfordskog POS označivača i TreeTagger označivača putem sučelja programskog jezika Python. Isprobat će se rečenice na španjolskom iz raznih registara i s mogućim poteškoćama za označivače. Isti proces bit će ponovljen na web sučelju ReLDIanno označivača za hrvatski jezik. Usporedit će se dobiveni rezultati između španjolskih označivača i razlike u problemima kod označavanja manje flektivnih i visoko flektivnih jezika na primjerima španjolskog i hrvatskog. During the last decades language technologies have developed at a fast rate with the purpose of coding language as a means of human communication. They are helping with the process of learning a language as well as with investigations in many fields such as linguistics and psychology. Corpora play a big role in developing high-quality and correct language technologies. They can be tagged on a few different levels, one of which is assigning POS or MSD tag to a word considering the grade of inflection in certain language. In this BA thesis POS and MSD tagging will be described as well as certain difficulties that occur during that process in general. To make the examples more concrete, the Spanish language is chosen as a language with less inflection and the Croatian language is chosen as language with high inflection. Spanish will be tested by Stanford POS Tagger and TreeTagger with the Python programming language. Spanish sentences from different registers which present possible challenges will be tested. The same process will be repeated with web interface of ReLDIanno tagger for Croatian. The results from Spanish and Croatian taggers and differences in problems between Spanish and Croatian will be compared.
- Published
- 2021
11. Towards an optimal set of initial weights for a Deep Neural Network architecture
- Author
-
Abdelhalim Saadi, Hacene Belhadef, Faculty of Information and Communication Technology (ICT), and PLACEHOLDER_PARENT_METADATA_VALUE
- Subjects
Big Data ,Deep Neural Networks ,business.industry ,Computer science ,General Neuroscience ,Centroids ,Arabic Language Processing ,Regression ,Set (abstract data type) ,Machine Translation ,Artificial Intelligence ,Hardware and Architecture ,Neural network architecture ,Artificial intelligence ,business ,Software ,Morphosyntactic Tagging - Abstract
Modern neural network architectures are powerful models. They have been proven efficient in many fields, such as imaging and acoustic. However, these neural networks involve a long-running and time-consuming process. To accelerate the training process, we propose a two-stage approach based on data analysis and focus on the gravity center concept. The neural network is first trained on reduced data represented by a set of centroids of the original data points, and then the learned weights are used to initialize a second training phase of the neural network over the full-blown data. The design of deep neural networks is extremely difficult, and the primary objective is to achieve high performance. In this study, we apply the Taguchi method to select good values for the factors required to build the proposed architecture. 29 06 403 426
- Published
- 2019
- Full Text
- View/download PDF
12. POLISH TAGGER TaKIPI: RULE BASED CONSTRUCTION AND OPTIMISATION
- Author
-
MACIEJ PIASECKI
- Subjects
morphosyntactic tagging ,Polish ,rule based tagging ,decission trees ,Information technology ,T58.5-58.64 - Abstract
A large number of different tags, limited corpora and the free word order are the main causes of low accuracy of tagging in Polish (automatic disambiguation of morphological descriptions) by applying commonly used techniques based on stochastic modelling. In the paper the rule-based architecture of the TaKIPI Polish tagger combining handwritten and automatically extracted rules is presented. The possibilities of optimisation of its parameters and component are discussed, including the possibility of using different methods of rules extraction, than C4.5 Decision Trees applied initially. The main goal of this paper is to explore a range of promising rule-based classifiers and investigate their impact on the accuracy of tagging. Simple techniques of combing classifiers are also tested. The performed experiments have shown that even a simple combination of different classifiers can increase the tagger’s accuracy by almost one percent.
- Published
- 2007
13. NLP Web Services for Slovene and English: Morphosyntactic Tagging, Lemmatisation and Definition Extraction.
- Author
-
Pollak, Senja, Trdin, Nejc, Vavpetic, Anže, and Erjavec, Tomaž
- Subjects
NATURAL language processing ,WEB services ,MORPHOSYNTAX ,FEATURE extraction ,ENGLISH language ,SLOVENIAN language ,ELECTRONIC data processing - Abstract
This paper presents a web service for automatic linguistic annotation of Slovene and English texts. The web service enables text up-loading in a number of different input formats, and then converts, tokenises, tags and lemmatises the text, and returns the annotated text. The paper presents the ToTrTaLe annotation tool, and the implementation of the annotation workflow in two workflow construction environments, Orange4WS and ClowdFlows. It also proposes several improvements to the annotation tool based on the identification of various types of errors of the existing ToTrTaLe tool, and implements these improvements as a post-processing step in the workflow. The workflows enable the users to incorporate the annotation service as an elementary constituent for other natural language processing workflows, as demonstrated by the definition extraction use case. [ABSTRACT FROM AUTHOR]
- Published
- 2012
14. Utilização de informações lexicais extraídas automaticamente de corpora na análise sintática computacional do português.
- Author
-
Alencar, Leonel Figueiredo De
- Subjects
- *
LEXICON , *GRAMMAR , *ARTIFICIAL intelligence , *ORTHOGRAPHY & spelling , *PARSING (Computer grammar) - Abstract
Lexicon modeling is the main difficulty to overcome when building deep syntactic parsers for unrestricted text. Traditionally, two strategies have been used for tackling lexical information in the domain of unrestricted syntactic parsing: compiling thousands of lexical entries or formulating hundreds of morphological rules. Due to productive word-formation processes, proper names, and non-standard spellings, the former strategy, resorted to by freely downloadable parsers for Brazilian Portuguese (BP), is not robust. On the other hand, deploying the latter is a time-intensive and non-trivial knowledge engineering task. At present, there is no open-source licensed wide-coverage parser for BP. Aiming at filling this gap as soon as possible, we argue in this paper that a much less expensive and much more efficient solution to the lexicon bottleneck in parsing is to simply reuse freely available morphosyntactic taggers as the system's lexical analyzer. Besides, thanks to the free and broad availability of POS-tagged corpora for BP and efficient machine learning packages, building additional high accurate taggers has become an almost effortless task. In order to easily integrate the output of taggers constructed in different architectures into context-free grammar chart parsers compiled with the Natural Language Toolkit (NLTK), we have developed a Python module named ALEXP. To the best of our knowledge, this is the first free software specially optimized for processing Portuguese to accomplish such a task. The tool's functionality is described by means of BP grammar prototypes applied to parsing real-world sentences, with very promising results. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
15. Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields
- Author
-
Jakub Waszczuk, Witold Kieraś, Marcin Woliński, Abteilung für Computerlinguistik [Düsseldorf ], Philosophische Fakultät [Düsseldorf], Heinrich Heine Universität Düsseldorf = Heinrich Heine University [Düsseldorf]-Heinrich Heine Universität Düsseldorf = Heinrich Heine University [Düsseldorf]-Instituts für Sprache und Information [Düsseldorf], Heinrich Heine Universität Düsseldorf = Heinrich Heine University [Düsseldorf], Instytut Podstaw Informatyki (IPI PAN), Polska Akademia Nauk = Polish Academy of Sciences (PAN), and Waszczuk, Jakub
- Subjects
Conditional random field ,Computer science ,02 engineering and technology ,computer.software_genre ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,conditional random fields ,0202 electrical engineering, electronic engineering, information engineering ,word segmentation ,Segmentation ,CRFS ,060201 languages & linguistics ,historical Polish ,business.industry ,Graph based ,Text segmentation ,Contrast (statistics) ,06 humanities and the arts ,Directed acyclic graph ,morphosyntactic tagging ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,0602 languages and literature ,020201 artificial intelligence & image processing ,Artificial intelligence ,Heuristics ,business ,computer ,Natural language processing - Abstract
International audience; The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.
- Published
- 2018
16. Computerising the lexicon: Modelling, development and use of morphological, syntactic and semantic lexicons
- Author
-
Sagot, Benoît, Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Sorbonne Université, and Ludovic Denoyer
- Subjects
Parsing ,Lexique morphosyntaxique ,Analyse syntaxique ,Morphosyntactic tagging ,Natural language processing ,WordNet ,Morphologie computationnelle ,[SCCO.LING]Cognitive science/Linguistics ,Traitement automatique des langues ,Computational morphology ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,Lexique syntaxique ,Morphological lexicon ,Part-of-speech tagging ,Développement de ressources lexicales ,Syntactic lexicon ,Lexical resource development ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,Analyse morphosyntaxique ,Lexicon ,Lexique - Published
- 2018
17. Computerising the lexicon
- Author
-
Sagot, Benoît, Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Sorbonne Université, and Ludovic Denoyer
- Subjects
Parsing ,Lexique morphosyntaxique ,Analyse syntaxique ,Morphosyntactic tagging ,Natural language processing ,WordNet ,Morphologie computationnelle ,[SCCO.LING]Cognitive science/Linguistics ,Traitement automatique des langues ,Computational morphology ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,Lexique syntaxique ,Morphological lexicon ,Part-of-speech tagging ,Développement de ressources lexicales ,Syntactic lexicon ,Lexical resource development ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,Analyse morphosyntaxique ,Lexicon ,Lexique - Published
- 2018
18. Multi-level approach for the analysis of non-standardized textual data : corpus of texts in middle french
- Author
-
Aouini, Mourad, Edition, Littératures, Langages, Informatique, Arts, Didactique, Discours - UFC (EA 4661) (ELLIADD), Université Bourgogne Franche-Comté [COMUE] (UBFC)-Université de Franche-Comté (UFC), Université Bourgogne Franche-Comté, Max Silberztein, Jean-Philippe Genet, Université de Franche-Comté (UFC), Université Bourgogne Franche-Comté [COMUE] (UBFC)-Université Bourgogne Franche-Comté [COMUE] (UBFC), and Edition, Littératures, Langages, Informatique, Arts, Didactique, Discours - UFC (UR 4661) (ELLIADD)
- Subjects
Morphosyntactic tagging ,Étiquetage morphosyntaxique ,Multi-Level approach ,Middle French ,Nlp ,Named-Entity recognition ,Tal ,Approche multi-Niveaux ,Données textuelles non-Standardisées ,Reconnaissance des entités nommées ,Moyen Français ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,MEDITEXT ,Standardized textual data - Abstract
This thesis presents a non-standardized text analysis approach which consists a chain process modeling allowing the automatic annotation of texts: grammar annotation using a morphosyntactic tagging method and semantic annotation by putting in operates a system of named-entity recognition. In this context, we present a system analysis of the Middle French which is a language in the course of evolution including: spelling, the flexional system and the syntax are not stable. The texts in Middle French are mainly distinguished by the absence of normalized orthography and the geographical and chronological variability of medieval lexicons.The main objective is to highlight a system dedicated to the construction of linguistic resources, in particular the construction of electronic dictionaries, based on rules of morphology. Then, we will present the instructions that we have carried out to construct a morphosyntactic tagging which aims at automatically producing contextual analyzes using the disambiguation grammars. Finally, we will retrace the path that led us to set up local grammars to find the named entities. Hence, we were asked to create a MEDITEXT corpus of texts in Middle French between the end of the thirteenth and fifteenth centuries.; Cette thèse présente une approche d'analyse des textes non-standardisé qui consiste à modéliser une chaine de traitement permettant l’annotation automatique de textes à savoir l’annotation grammaticale en utilisant une méthode d’étiquetage morphosyntaxique et l’annotation sémantique en mettant en œuvre un système de reconnaissance des entités nommées. Dans ce contexte, nous présentons un système d'analyse du Moyen Français qui est une langue en pleine évolution dont l’orthographe, le système flexionnel et la syntaxe ne sont pas stables. Les textes en Moyen Français se singularisent principalement par l’absence d’orthographe normalisée et par la variabilité tant géographique que chronologique des lexiques médiévaux.L’objectif est de mettre en évidence un système dédié à la construction de ressources linguistiques, notamment la construction des dictionnaires électroniques, se basant sur des règles de morphologie. Ensuite, nous présenterons les instructions que nous avons établies pour construire un étiqueteur morphosyntaxique qui vise à produire automatiquement des analyses contextuelles à l’aide de grammaires de désambiguïsation. Finalement, nous retracerons le chemin qui nous a conduits à mettre en place des grammaires locales permettant de retrouver les entités nommées. De ce fait, nous avons été amenés à constituer un corpus MEDITEXT regroupant des textes en Moyen Français apparus entre le fin du XIIIème et XVème siècle.
- Published
- 2018
19. Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe
- Author
-
Miletic, Aleksandra, Fabre, Cécile, Stosic, Dejan, Cognition, Langues, Langage, Ergonomie (CLLE-ERSS), École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)-Université Bordeaux Montaigne-Centre National de la Recherche Scientifique (CNRS), ATALA, and Fabre, Cécile
- Subjects
Morphosyntactic tagging ,training corpus ,Annotation morphosyntaxique ,serbe ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,[SHS.LANGUE] Humanities and Social Sciences/Linguistics ,corpus d’entraînement ,Serbian - Abstract
Developping a method for detailed morphosyntactic tagging of Serbian This paper presents an experience in detailed morphosyntactic tagging of the Serbian subcorpus of the parallel Serbian-French-English ParCoLab corpus. We enriched an existing POS annotation with finer-grained morphosyntactic properties in order to prepare the corpus for subsequent parsing stages. We compared three approaches: 1) manual annotation; 2) pre-annotation with a tagger trained on Croatian, followed by manual correction; 3) retraining the model on a small validated sample of the corpus (20K tokens), followed by automatic annotation and manual correction. The Croatian model maintains its global stability when applied to Serbian texts, but due to the differences between the two tagsets, important manual interventions were still required. A new model was trained on a validated sample of the corpus: it has the same accuracy as the existing model, but the observed acceleration of the manual correction confirms that it is better suited to the task than the first one. MOTS-CLES : Annotation morphosyntaxique, corpus d'entraînement, serbe., Cet article présente une expérience d'annotation morphosyntaxique fine du volet serbe du corpus parallèle ParCoLab (corpus serbe-français-anglais). Elle a consisté à enrichir une annotation existante en parties du discours avec des traits morphosyntaxiques fins, afin de préparer une étape ultérieure de parsing. Nous avons comparé trois approches : 1) annotation manuelle ; 2) pré-annotation avec un étiqueteur entraîné sur le croate suivie d'une correction manuelle ; 3) ré-entraînement de l'outil sur un petit échantillon validé du corpus, suivi de l'annotation automatique et de la correction manuelle. Le modèle croate maintient une stabilité globale en passant au serbe, mais les différences entre les deux jeux d'étiquettes exigent des interventions manuelles importantes. Le modèle ré-entraîné sur un échantillon de taille limité (20K tokens) atteint la même exactitude que le modèle existant et le gain de temps observé montre que cette méthode optimise la phase de correction.
- Published
- 2016
20. Part-of-Speech Tagging for Croatian using Conditional Random Fields
- Author
-
Crnjak, Vjeran and Šnajder, Jan
- Subjects
morphosyntactic tagging ,Croatian ,conditional random fields ,obrada prirodnog jezika ,TECHNICAL SCIENCES. Computing ,TEHNIČKE ZNANOSTI. Računarstvo ,označavanje vrste riječi ,uvjetna slučajna polja ,natural language processing ,hrvatski jezik - Abstract
Označavanje vrste riječi jedan je od osnovnih zadataka u obradi prirodnog jezika i preduvjet za mnoge druge zadatke. U ovome radu opisana je problematika označavanja vrste riječi i dan je pregled osnovnih i naprednih stohastičkih grafičkih modela te njihova primjena na označavanje vrste riječi visokoflektivnih jezika. Opisan je razvoj morfosintaktičkog označivača temeljen na modelu uvjetnih slučajnih polja s ograničenjima i pažljivo su analizirane sve poteškoće prisutne u razvoju. Part-of-speech tagging is one of the fundamental tasks in natural language processing and a prerequisite for many others. In this thesis the problem of POS and morphosyntactic tagging was described. Overview of basic and advanced stochastic graphical models was given and their application to the tagging problem of highly-inflectional languages. Description of the development of the morphosyntactic tagger based on constrained conditional random fields is provided and detailed analysis of all the problems encountered during development.
- Published
- 2014
21. Using lexical information automatically extracted from corpora in the computational parsing of portuguese
- Author
-
Araripe, Leonel Figueiredo de Alencar
- Subjects
Gramática livre de contexto ,Aquisição de conhecimento lexical ,POS tagging ,Linguística computacional ,Morphosyntactic tagging ,Processamento automático da linguagem natural ,Natural language processing ,Computational linguistics ,Aprendizado de máquina ,Syntactic parsing ,Lexical knowledge acquisition ,Etiquetagem morfossintática ,Computational processing of Portuguese ,Etiquetador morfossintático ,Part-of-speech tagging ,Machine learning ,Análise sintática automática ,Context-free grammar ,Processamento computacional do português - Abstract
ARARIPE, Leonel Figueiredo de Alencar. Utilização de informações lexicais extraídas automaticamente de corpora na análise sintática computacional do português. Revista de Estudos da Linguagem, Belo Horizonte, v. 19, n. 1, p. 7-85, jan./jun. 2011. Lexicon modeling is the main difficulty to overcome when building deep syntactic parsers for unrestricted text. Traditionally, two strategies have been used for tackling lexical information in the domain of unrestricted syntactic parsing: compiling thousands of lexical entries or formulating hundreds of morphological rules. Due to productive word-formation processes, proper names, and non-standard spellings, the former strategy, resorted to by freely downloadable parsers for Brazilian Portuguese (BP), is not robust. On the other hand, deploying the latter is a time-intensive and non-trivial knowledge engineering task. At present, there is no open-source licensed wide-coverage parser for BP. Aiming at filling this gap as soon as possible, we argue in this paper that a much less expensive and much more efficient solution to the lexicon bottleneck in parsing is to simply reuse freely available morphosyntactic taggers as the system’s lexical analyzer. Besides, thanks to the free and broad availability of POS-tagged corpora for BP and efficient machine learning packages, building additional high accurate taggers has become an almost effortless task. In order to easily integrate the output of taggers constructed in different architectures into context-free grammar chart parsers compiled with the Natural Language Toolkit (NLTK), we have developed a Python module named ALEXP. To the best of our knowledge, this is the first free software specially optimized for processing Portuguese to accomplish such a task. The tool’s functionality is described by means of BP grammar prototypes applied to parsing real-world sentences, with very promising results. No desenvolvimento de analisadores sintáticos profundos para textos irrestritos, a principal dificuldade a ser vencida é a modelação do léxico. Tradicionalmente, duas estratégias têm sido usadas para lidar com a informação lexical na análise sintática automática: a compilação de milhares de entradas lexicais ou a formulação de centenas de regras morfológicas. Devido aos processos produtivos de formação de palavras, aos nomes próprios ou a grafias não padrão, a primeira estratégia, que subjaz aos analisadores do português do Brasil (PB) livremente descarregáveis da Internet, não é robusta. A última estratégia, por sua vez, constitui tarefa não trivial de engenharia do conhecimento, consumindo muito tempo. No momento, o PB não dispõe de um analisador sintático de ampla cobertura licenciado como software livre. Visando ao preenchimento o mais rápido possível dessa lacuna, argumentamos neste artigo que uma solução bem menos custosa e muito mais eficiente para o gargalo lexical consiste em simplesmente reaproveitar, como componente lexical do processamento sintático profundo, etiquetadores morfossintáticos livremente disponíveis. Além disso, graças à ampla e gratuita disponibilidade de corpora morfossintaticamente anotados do PB e eficientes pacotes de aprendizado de máquina, a construção de etiquetadores de alta acurácia adicionais tornou-se uma tarefa que quase não demanda esforço. A fim de integrar facilmente o output de etiquetadores de diferentes arquiteturas em parsers tabulares de gramáticas livres de contexto compilados por meio do Natural Language Toolkit (NLTK), desenvolvemos um módulo em Python denominado ALEXP. Pelo que sabemos, o ALEXP é o primeiro software livre especialmente otimizado para o processamento do português a realizar essa tarefa. A funcionalidade da ferramenta é descrita por meio de protótipos de gramática do PB aplicados na análise de sentenças do mundo real, com resultados bastante promissores.
- Published
- 2011
22. Development and Applications of the Croatian 1984 Corpus for the MULTEXT-East Resources
- Author
-
Agić, Željko, Merkler, Danijela, Berović, Daša, and Tadić, Marko
- Subjects
1984 corpus ,Multext-East ,morphosyntactic tagging ,Croatian language - Abstract
/
- Published
- 2011
23. Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging
- Author
-
Vučković, K., Agić, Ž, Marko Tadić, Calzolari, Nicoletta, Choukri, Khalid, Maegaard, Bente, Mariani, Joseph, Odjik, Jan, Piperidis, Stelios, Rosner, Mike, and Tapias, Daniel
- Subjects
chunking ,partial parsing ,morphosyntactic tagging - Abstract
In this paper, we present the results of an experiment with utilizing a stochastic morphosyntactic tagger as a pre-processing module of a rule-based chunker and partial parser for Croatian in order to raise its overall chunking and partial parsing accuracy on Croatian texts. In order to conduct the experiment, we have manually chunked and partially parsed 459 sentences from the Croatia Weekly 100 kw newspaper sub-corpus taken from the Croatian National Corpus, that were previously also morphosyntactically disambiguated and lemmatized. Due to the lack of resources of this type, these sentences were designated as a temporary chunking and partial parsing gold standard for Croatian. We have then evaluated the chunker and partial parser in three different scenarios: (1) chunking previously morphosyntactically untagged text, (2) chunking text that was tagged using the stochastic morphosyntactic tagger for Croatian and (3) chunking manually tagged text. The obtained F1- scores for the three scenarios were, respectively, 0.875 (P: 0.826, R: 0.930), 0.900 (P: 0.866, R: 0.937) and 0.930 (P: 0.912, R: 0.949). The paper provides the description of language resources and tools used in the experiment, its setup and discussion of results and perspectives for future work.
- Published
- 2010
24. Tagger Voting Improves Morphosyntactic Tagging Accuracy on Croatian Texts
- Author
-
Agić, Ž, Marko Tadić, Dovedan, Z., Lužar-Stiffler, Vesna, Jarec, Iva, and Bekić, Zoran
- Subjects
Tagger voting ,morphosyntactic tagging ,part-of-speech tagging ,Croatian language - Abstract
We present results of an experiment dealing with combining outputs of five part-ofspeech taggers via tagger voting in order to improve the overall accuracy of morphosyntactic tagging of Croatian texts using a subset of the Multext-East v3 tagset. The increase in accuracy over the best- performing single tagger is shown to exist, but not to be statistically significant. We discuss the performance of the five single taggers, the overlaps between tagger pairs, the reduced tagset and the voting scheme, along with scores for five meaningful tagger combinations in the voting scheme and future work plans.
- Published
- 2010
25. Error Analysis in Croatian Morphosyntactic Tagging
- Author
-
Zdravko Dovedan, Zeljko Agic, Marko Tadić, Lužar-Stiffler, Vesna, Jarec, Iva, and Bekić, Zoran
- Subjects
Croatian ,business.industry ,Computer science ,Stochastic process ,Speech recognition ,morphosyntactic tagging ,part-of-speech tagging ,error analysis ,error distribution ,Croatian language ,hybrid tagging ,computer.software_genre ,Part of speech ,language.human_language ,Knowledge-based systems ,Error analysis ,Noun ,language ,Artificial intelligence ,business ,Hidden Markov model ,computer ,Natural language processing ,Natural language - Abstract
In this paper, we provide detailed insight on properties of errors generated by a stochastic morphosyntactic tagger assigning Multext-East morphosyntactic descriptions to Croatian texts. Tagging the Croatia Weekly newspaper corpus by the CroTag tagger in stochastic mode revealed that approximately 85 percent of all tagging errors occur on nouns, adjectives, pronouns and verbs. Moreover, approximately 50 percent of these are shown to be incorrect assignments of case values. We provide various other distributional properties of errors in assigning morphosyntactic descriptions for these and other parts of speech. On the basis of these properties, we propose rule- based and stochastic strategies which could be integrated in the tagging module, creating a hybrid procedure in order to raise overall tagging accuracy for Croatian.
- Published
- 2009
26. Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
- Author
-
Agić, Željko, Tadić, Marko, Dovedan, Zdravko, Stančić, Hrvoje, Seljan, Sanja, Bawden, David, Lasić-Lazić, Jadranka, and Slavić, Aida
- Subjects
morphosyntactic tagging ,part-of-speech tagging ,stochastic tagger ,Multext East tagset ,tagset reductions ,Croatian language - Abstract
Morphosyntactic tagging of Croatian texts is performed with stochastic taggers by using a language model built on a manually annotated corpus implementing the Multext East version 3 specifications for Croatian. Tagging accuracy in this framework is basically predefined, i.e. proportionally dependent of two things: the size of the training corpus and the number of different morphosyntactic tags encompassed by that corpus. Being that the 100 kw Croatia Weekly newspaper corpus by definition makes a rather small language model in terms of stochastic tagging of free domain texts, the paper presents an approach dealing with tagset reductions. Several meaningful subsets of the Croatian Multext- East version 3 morphosyntactic tagset specifications are created and applied on Croatian texts with the CroTag stochastic tagger, measuring overall tagging accuracy and F1-measures. Obtained results are discussed in terms of applying different reductions in different natural language processing systems and specific tasks defined by specific user requirements.
- Published
- 2009
27. Investigating Language Independence in HMM PoS/MSD-Tagging
- Author
-
Marko Tadić, Zeljko Agic, Zdravko Dovedan, Lužar-Stiffler, Vesna, Hljuz Dobrić, Vesna, and Bekić, Zoran
- Subjects
Czech ,language independence ,part-of-speech tagging ,morphosyntactic tagging ,hidden Markov models ,Computer science ,business.industry ,Romanian ,Speech recognition ,computer.software_genre ,Estonian ,language.human_language ,Trigram tagger ,language ,Artificial intelligence ,Language model ,Serbian ,Hidden Markov model ,business ,computer ,Natural language ,Natural language processing - Abstract
The paper presents an investigation of functional dependencies in morphosyntactic tagging using hidden Markov models. Starting from a well known fact that the HMM tagging paradigm relies on lexical knowledge acquired from training corpora and stored in form of transition and emission matrices, also called a language model, in the experiment, we apply the TnT trigram tagger on creating language models for seven different languages from MULTEXT East version 3 project translations of George Orwellpsilas novel 1984. - Czech, Estonian, Hungarian, Romanian, Serbian, Slovene and original English version. We then use these language models in the tagging procedure and obtain details on various relations between training corpora statistics, training outputs and outputs of the tagging procedure.
- Published
- 2008
28. A resource-based Korean morphological annotation system
- Author
-
Huh, Hyun-Gue, Laporte, Eric, Laporte, Eric, Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), and Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,FOS: Computer and information sciences ,Computer Science - Computation and Language ,Preprocessing and Tokenization of data ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,Agglutinative language ,NLP dictionary ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,Korean ,Morphosyntactic lexicon ,Computation and Language (cs.CL) ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Morphosyntactic Tagging - Abstract
We describe a resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. The output of our system is a graph of morphemes annotated with accurate linguistic information. The language resources used by the system can be easily updated, which allows us-ers to control the evolution of the per-formances of the system. We show that morphological annotation of Korean text can be performed directly with a lexicon of words and without morpho-logical rules., Comment: 6 pages
- Published
- 2007
- Full Text
- View/download PDF
29. Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing
- Author
-
Aduriz, Itziar, Aranzabe, Maxux, Arriola, Jose Maria, Atutxa, Atziber, Díaz de Ilarraza, Arantza, Ezeiza, Nerea, Gojenola, Koldo, Oronoz, Maite, Soroa, Aitor, Urizar, Ruben, IXA Taldea (IXA), and University of the Basque Country/Euskal Herriko Unibertsitatea (UPV/EHU)
- Subjects
Basque ,[SHS.LANGUE.TRAI.RESS]Humanities and Social Sciences/Linguistics/Automatic Processing of the Language/Linguistic Resources (Corpuses, Glossaries, Grammars...) ,[SHS.LANGUE.TRAI.RESS]Humanities and Social Sciences/Linguistics/domain_shs.langue.trai/domain_shs.langue.trai.ress ,Corpus ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Morphosyntactic Tagging - Abstract
This article describes the different steps in the construction of EPEC (Reference Corpus for the Processing of Basque). EPEC is a corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases) and is currently being hand tagged at deep syntax level following the Dependency Structure-based Scheme. It is aimed to be a "reference" corpus for the development and improvement of several NLP tools for Basque. This corpus has already been used for the construction of some tools such as a morphological analyser, a lemmatiser, or a shallow syntactic analyser.
- Published
- 2006
30. Adquisición de recursos básicos de lingüística computacional del gallego para aplicaciones informáticas de tecnología lingüística
- Author
-
Castro Pena, Luz, López López, Ángel, Pichel Campos, José Ramom, Aguirre Moreno, José Luis, Álvarez Lugrís, Alberto, Gómez Guinovart, Xavier, Sacau Fontenla, Elena, and Santos Suárez, Lara
- Subjects
Recursos lingüísticos ,Morphosyntactic tagging ,Traducción automática ,Grammar checking ,Lengua gallega ,Anotación morfosintáctica ,Language resources ,Machine translation ,Verificación gramatical ,Galician language - Abstract
Este trabajo presenta las características principales del proyecto Empresa- Universidad "Estudio y adquisición de recursos básicos de lingüística computacional del gallego para la elaboración y mejora de aplicaciones informáticas de tecnología lingüística" desarrollado por imaxin|software y el Seminario de Lingüística Informática (SLI) de la Universidade de Vigo. This work presents the main features of the project "Acquisition of basic resources in Galician computational linguistics for language engineering" (Xunta de Galicia, 2001-2003, ref. PGIDT01TICC06E) led by imaxin|software and the Computational Linguistics Group (SLI) of the University of Vigo. Institución financiera: Secretaría Xeral de Investigación e Desenvolvemento, Xunta de Galicia, 2001-2004 (ref. PGIDT01TICC06E).
- Published
- 2003
31. Etiquetario morfosintáctico del SLI para corpus de lengua gallega : aplicación al corpus
- Author
-
Aguirre Moreno, José Luis, Álvarez Lugrís, Alberto, and Gómez Guinovart, Xavier
- Subjects
Corpus linguistics ,Morphosyntactic tagging ,Lingüística de corpus ,Lengua gallega ,Anotación morfosintáctica ,Galician language - Abstract
En este artículo se presenta un etiquetario morfosintáctico completo y normalizado para etiquetar corpus lingüísticos de lengua gallega. La elaboración de este etiquetario, diseñado por el Seminario de Lingüística Informática (SLI) de la Universidad de Vigo siguiendo estrictamente las recomendaciones de EAGLES (Leech y Wilson, 1996, incluye la creación de un etiquetario intermedio que nos permite establecer una correspondencia entre la información gramatical para el gallego codificada en el CLUVI (Corpus Lingüístico de la Universidad de Vigo)y la que se encuentra codificada en el formato estándar de EAGLES en corpus de otras lenguas. In this article we present a complete and normalized morphosyntactic tagset for the annotation of linguistic corpora in Galician. The elaboration of this tagset, designed by the Computational Linguistics Group (SLI)of the University of Vigo, following strictly the EAGLES recommendations (Leech and Wilson, 1996), includes the creation of an intermediate tagset that allows us to establish a correspondence between the grammatical information encoded for Galician in the CLUVI (Linguistic Corpus of the University of Vigo) and the information encoded in the EAGLES standard format in corpora of other languages Este trabajo ha sido financiado por la Xunta de Galicia, dentro de los proyectos "Desenvolvemento e aplicación de técnicas de anàlise lingüístico-computacional de corpus orais e escritos para o procesamento do CLUVI (Corpus Lingüístico da Universidade de Vigo)" (PGIDT01PXI30203PR)i "Estudio e adquisicíón de recursos básicos de lingüística computacional do galego para a elaboración e mellora de aplicacións informáticas de tecnoloxía lingüística" (ref. PGIDT01TICC06E).
- Published
- 2002
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.