51 results on '"Isolating language"'
Search Results
2. A CLASSIFICATION OF ENGLISH UNCOUNTABLE NOUNS
- Author
-
Alexey B Kostromin
- Subjects
Linguistics and Language ,media_common.quotation_subject ,исчисляемые и неисчисляемые существительные ,computer.software_genre ,lcsh:P325-325.5 ,Language and Linguistics ,Synthetic language ,Noun ,Countable set ,Proper noun ,артикль и структура значений существительного ,Mathematics ,media_common ,Grammatical gender ,Grammar ,подгруппы неисчисля-емых существительных ,lcsh:P101-410 ,business.industry ,Isolating language ,Linguistics ,lcsh:Language. Linguistic theory. Comparative grammar ,Uncountable set ,Artificial intelligence ,business ,computer ,Natural language processing ,разряды исчисляемости ,lcsh:Semantics - Abstract
The article studies the classification of uncountable English nouns. The topic is described largely both in grammar manuals and research works. However, it may be presented in more details. In the typological perspective English is not just an analytical language. It certainly demonstrates some features of an isolating language. Unlike such languages as French, German or Italian, it lacks the grammatical gender and that is the cause of a different ground for the classification of English nouns. They are divided into classes according to the way the things exist - either as separate single units forming quantities or some continuity with no definite limits both material and mental. This is the guideline to a more detailed description of countable and uncountable nouns. The analytical character of English also implies that noun classifiers are not imbedded into a noun in its basic form, but are either a separate root morpheme (article a) or are not expressed at all. That differs much from grammatical gender which is a permanent attribute of a noun. The initial class of a noun is not fixed and can change depending the meaning and context. A balance of those two factors determines the shift from being a countable noun to the opposite status and vice versa. That’s why a synthetic language speakers face a difficult task of mastering the English article and a detailed classification uncountable nouns may help them a lot. A statistical analysis of the word death which is both countable and uncountable was made to show the prevailing usage.
- Published
- 2017
3. Computational Modeling of Affixoid Behavior in Chinese Morphology
- Author
-
Sara Court, Yu-Hsiang Tseng, Pei-Yi Chen, and Shu-Kai Hsieh
- Subjects
Morphology (linguistics) ,business.industry ,Computer science ,media_common.quotation_subject ,Affix ,WordNet ,Morphology (biology) ,Isolating language ,computer.software_genre ,Mandarin Chinese ,language.human_language ,language ,Artificial intelligence ,Polysemy ,business ,Function (engineering) ,computer ,Natural language processing ,media_common - Abstract
The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. In this paper, we propose three quantitative features in a computational model of affixoid behavior in Mandarin Chinese. The results show that, except for in a very few cases, there are no clear criteria that can be used to identify an affix’s status in an isolating language like Chinese. A diachronic check using contextualized embeddings with the WordNet Sense Inventory also demonstrates the possible role of the polysemy of lexical roots across diachronic settings.
- Published
- 2020
- Full Text
- View/download PDF
4. A morpheme-based analysis of lexical bundles in Korean: an interface between corpus-driven approach and lexicography
- Author
-
Hyun-ju Song, Jun Choi, and Kilim Nam
- Subjects
Agglutinative language ,Linguistics and Language ,Computer science ,Interface (Java) ,business.industry ,media_common.quotation_subject ,Isolating language ,computer.software_genre ,Language and Linguistics ,Linguistics ,Lexicography ,Lexical bundles ,Corpus linguistics ,Morpheme ,Artificial intelligence ,Function (engineering) ,business ,computer ,Natural language processing ,media_common - Abstract
This study proposes a new methodology for morpheme-based analysis designed to identify multi-word patterns in Korean, which is a typical example of agglutinative languages. The need for a new approach in corpus linguistics, which takes language typological characteristics into consideration, is also a crucial point of this paper. In Korean, functional words like prepositions or conjunctions are realized as bound morphemes (emi or cosa) that function as ‘minimal grammatical units’. When formulaic expressions in Korean are analyzed according to the morpheme unit, as it is the case in our study, the findings yielded show significant differences from those of previous studies. Based on this methodology, our results provide supporting evidence for the following: (1) lexical bundles are prevalent in Korean, just as in English; (2) computer-defined formulaicity might be language-universal; (3) finally, differences in distributions or discourse functions of morphemic bundles in various genres or registers can be language-specific. The external and internal language factors that may influence these differences are discussed.
- Published
- 2016
- Full Text
- View/download PDF
5. An Algorithm for Morphological Segmentation of Esperanto Words
- Author
-
Theresa Guinard
- Subjects
060201 languages & linguistics ,Computer science ,business.industry ,Esperanto grammar ,Speech recognition ,Agglutination ,06 humanities and the arts ,02 engineering and technology ,Isolating language ,computer.software_genre ,Synthetic language ,Constructed language ,Morpheme ,Compound ,0602 languages and literature ,0202 electrical engineering, electronic engineering, information engineering ,Computational linguistics. Natural language processing ,020201 artificial intelligence & image processing ,Segmentation ,Artificial intelligence ,P98-98.5 ,business ,computer ,Natural language processing - Abstract
Morphological analysis (finding the component morphemes of a word and tagging morphemes with part-of-speech information) is a useful preprocessing step in many natural language processing applications, especially for synthetic languages. Compound words from the constructed language Esperanto are formed by straightforward agglutination, but for many words, there is more than one possible sequence of component morphemes. However, one segmentation is usually more semantically probable than the others. This paper presents a modified n-gram Markov model that finds the most probable segmentation of any Esperanto word, where the model’s states represent morpheme part-of-speech and semantic classes. The overall segmentation accuracy was over 98% for a set of presegmented dictionary words.
- Published
- 2016
6. One Novel Word Segmentation Method Based on N-Shortest Path in Vietnamese
- Author
-
Jinwen Lai, Ke Xiaohua, Chen Jihua, Huang Ruibin, and Haijiao Luo
- Subjects
business.industry ,Computer science ,Vietnamese ,Text segmentation ,Isolating language ,computer.software_genre ,Directed acyclic graph ,language.human_language ,Shortest path problem ,language ,Segmentation ,Artificial intelligence ,Syllable ,business ,computer ,Natural language processing ,Word (computer architecture) - Abstract
Automatic word segmentation of Vietnamese is the primary step in Vietnamese text information processing, which would be an important support for cross-language information processing tasks in China and Vietnam. Since the Vietnamese language is an isolating language with tones, each syllable can not only form a word individually, but also create a new word by combining with left and/or right syllables. Therefore, automatic word segmentation of Vietnamese cannot be simply based on spaces. This paper takes automatic word segmentation of the Vietnamese language as the research object. First, it makes a rough segmentation of Vietnamese sentences with the N-shortest path model. Then, syllables in each sentence are abstracted into a directed acyclic graph. Finally, the Vietnamese word segmentation is obtained by calculating the shortest path with the help of the BEMS marking system. The results show that the proposed algorithm achieves a satisfactory performance in Vietnamese word segmentation.
- Published
- 2019
- Full Text
- View/download PDF
7. W stronę słowotwórstwa operacyjnego
- Author
-
Sebastian Żurowski
- Subjects
lcsh:Ethnology. Social and cultural anthropology ,Cultural Studies ,gramatyka operacyjna ,Linguistics and Language ,History ,Literature and Literary Theory ,etymology ,Computer science ,media_common.quotation_subject ,computer.software_genre ,unit of language ,Language and Linguistics ,słowotwórstwo ,Morpheme ,media_common ,Grammar ,business.industry ,etymologia ,lcsh:PG1-9665 ,Isolating language ,Word formation ,morfem ,Linguistics ,word formation ,lcsh:GN301-674 ,operational grammar ,lcsh:Slavic languages. Baltic languages. Albanian languages ,Anthropology ,morpheme ,Etymology ,Artificial intelligence ,jednostka jezyka ,business ,computer ,Natural language processing - Abstract
Towards an operational word formationIn this article the selected theoretical and practical problems associated with placing the word formation in the model of operational grammar (by Andrzej Bogusławski) are discussed. The aim of the paper is to prepare the ground for a possible description of morpheme units of language in this model. The proposed way to enter word formation in the model is to divide it into operational and etymological word formation.
- Published
- 2015
- Full Text
- View/download PDF
8. A Contrastive Study on Modality of Korean and Chinese: from a Typological Perspective
- Author
-
Zhang Jingxi
- Subjects
Agglutinative language ,Communication ,business.industry ,Perspective (graphical) ,Isolating language ,Psychology ,business ,Modality (semiotics) ,Linguistics ,Linguistic typology - Published
- 2015
- Full Text
- View/download PDF
9. A Factoid Question Answering System for Vietnamese
- Author
-
Duc-Thien Bui and Phuong Le-Hong
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computer science ,business.industry ,Factoid ,Vietnamese ,02 engineering and technology ,Isolating language ,computer.software_genre ,language.human_language ,Intelligent user interface ,020204 information systems ,Test set ,0202 electrical engineering, electronic engineering, information engineering ,language ,Question answering ,020201 artificial intelligence & image processing ,General knowledge ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,Natural language ,Natural language processing - Abstract
In this paper, we describe the development of an end-to-end factoid question answering system for the Vietnamese language. This system combines both statistical models and ontology-based methods in a chain of processing modules to provide high-quality mappings from natural language text to entities. We present the challenges in the development of such an intelligent user interface for an isolating language like Vietnamese and show that techniques developed for inflectional languages cannot be applied "as is". Our question answering system can answer a wide range of general knowledge questions with promising accuracy on a test set., In the proceedings of the HQA'18 workshop, The Web Conference Companion, Lyon, France
- Published
- 2018
- Full Text
- View/download PDF
10. Morpheme Level Word Embedding
- Author
-
Tatiana Kovalenko, Julia Yakovleva, Andrey Filchenkov, and Ruslan Galinsky
- Subjects
0301 basic medicine ,Word embedding ,Computer science ,business.industry ,Sentiment analysis ,Isolating language ,computer.software_genre ,Linguistics ,Synthetic language ,03 medical and health sciences ,030104 developmental biology ,0302 clinical medicine ,Analytic language ,Morpheme ,Inflection ,Language model ,Artificial intelligence ,business ,computer ,030217 neurology & neurosurgery ,Natural language processing - Abstract
Modern NLP tasks such as sentiment analysis, semantic analysis, text entity extraction and others depend on the language model quality. Language structure influences quality: a model that fits well the analytic languages for some NLP tasks, doesn’t fit well enough the synthetic languages for the same tasks. For example, a well known Word2Vec [27] model shows good results for the English language which is rather an analytic language than a synthetic one, but Word2Vec has some problems with synthetic languages due to their high inflection for some NLP tasks. Since every morpheme in synthetic languages provides some information, we propose to discuss morpheme level-model to solve different NLP tasks. We consider the Russian language in our experiments. Firstly, we describe how to build morpheme extractor from prepared vocabularies. Our extractor reached 91% accuracy on the vocabularies of known morpheme segmentation. Secondly we show the way how it can be applied for NLP tasks, and then we discuss our results, pros and cons, and our future work.
- Published
- 2017
- Full Text
- View/download PDF
11. A Study on the Cognitive Process of Isolating Language for Developing Software User Interface
- Author
-
Phenpimon Wilairatana, Tsutomu Konosu, Koichi Mizutani, and Chanongkorn Kuntonbutr
- Subjects
Space (punctuation) ,Homograph ,Kanji ,Natural language user interface ,Computer science ,business.industry ,05 social sciences ,Isolating language ,Semantics ,computer.software_genre ,050105 experimental psychology ,Human–computer interaction ,0501 psychology and cognitive sciences ,Artificial intelligence ,business ,computer ,Homophone ,Natural language processing ,Sentence - Abstract
This research focuses on the cognitive process in Thai language because the written nature of Thai language does not have spaces in a sentence like English and does not have Kanji characters as in Japanese. Words will be written continuously until the end of the sentence, which sometimes leads to ambiguity. This research studies the cognitive processing and influence of final consonants and non-final consonants by using word identification tasks to verify the differences in time spent to answer and correct data consisting of 4 main variables: Types of question divided by single words and sentences; Types of option word or word choices divided by words with consonants and without consonants. The level of difficulty was divided into 3 levels: easy, middle, and hard, while the Type of word was divided into 4 types: Uncontrolled, Homophone, Homograph, and Semantic. The differences of average duration for identifying target words were analyzed by ANOVA and t-test that is statistical analysis. The results showed that final consonants work as the space in English is used for separating words and sentences, and can make the Thai language easier. It works as the accelerator of competency in cognition of Thai language and helps learners to receive more precise and faster information.
- Published
- 2017
- Full Text
- View/download PDF
12. L2 Learners’ Knowledge of Derivational Morpheme
- Author
-
Haerim Ahn, mijin Jung, and Sungmook Choi
- Subjects
business.industry ,Morpheme ,L2 learners ,Artificial intelligence ,Isolating language ,computer.software_genre ,business ,Psychology ,computer ,Natural language processing ,Linguistics - Published
- 2014
- Full Text
- View/download PDF
13. The Acquisition of Polysynthetic Languages
- Author
-
Joe Blythe, Barbara Kelly, Rachel Nordlinger, and Gillian Wigglesworth
- Subjects
Comprehension ,Linguistics and Language ,Computer science ,business.industry ,Polysynthetic language ,Isolating language ,Artificial intelligence ,computer.software_genre ,business ,computer ,Linguistics ,Natural language processing ,Task (project management) - Abstract
One of the major challenges in acquiring a language is being able to use morphology as an adult would, and thus, a considerable amount of acquisition research has focused on morphological production and comprehension. Most of this research, however, has focused on the acquisition of morphology in isolating languages, or languages (such as English) with limited inflectional morphology. The nature of the learning task is different, and potentially more challenging, when the child is learning a polysynthetic language – a language in which words are highly morphologically complex, expressing in a single word what in English takes a multi-word clause. To date, there has been no cross-linguistic survey of how children approach thispuzzle and learnpolysynthetic languages.Thispaperaimstoprovide suchasurvey, including a discussion of some of the general findings in the literature regarding the acquisition of polysynthetic systems.
- Published
- 2014
- Full Text
- View/download PDF
14. School Grammar Approach to the Fuzzy Edge on the Level of Morpheme and Word
- Author
-
Byongcheol Jeong
- Subjects
Grammar ,business.industry ,Computer science ,media_common.quotation_subject ,Isolating language ,computer.software_genre ,Fuzzy logic ,Linguistics ,Morpheme ,Regular tree grammar ,Artificial intelligence ,Enhanced Data Rates for GSM Evolution ,business ,computer ,Natural language processing ,Word (computer architecture) ,Generative grammar ,media_common - Published
- 2012
- Full Text
- View/download PDF
15. Morpheme-Based Uyghur Speech Recognition Considering Vowel Weakening
- Author
-
Li Xiao, Xue Huajian, Osman Turghun, and Zhang Rong-hui
- Subjects
Vocabulary ,General Computer Science ,Computer science ,business.industry ,Speech recognition ,media_common.quotation_subject ,Isolating language ,computer.software_genre ,Lexicon ,Morpheme ,Vowel ,Language model ,Artificial intelligence ,business ,computer ,Natural language processing ,Word (computer architecture) ,media_common - Abstract
This paper describes the challenges in Uyghur speech recognition caused by rich morphology of this language and a new morpheme-based approach to overcome them. Standard morpheme-based approach is also investigated in this paper and outperforms word-based approach. However, this approach pays no attention to frequent vowel weakening of Uyghur that potentially increases the number of morphemes and reduces the effect of the morpheme-based approach. In the new approach, vowel weakening surface forms are replaced by their corresponding stems in lexicon building and language modeling in order to make more effective vocabulary and more robust language model. Then, The vocabulary and language model are utilized in the experiments. Experimental results show that this new approach gives the best result and performs better than standard morpheme-based approach.
- Published
- 2012
- Full Text
- View/download PDF
16. AMRITA_CEN@FIRE-2014
- Author
-
K. P. Soman and Anand Kumar
- Subjects
Agglutinative language ,Computer science ,business.industry ,Speech recognition ,Isolating language ,computer.software_genre ,Machine learning ,Part of speech ,language.human_language ,Morpheme ,Tamil ,Noun ,language ,Proper noun ,Artificial intelligence ,Suffix ,business ,computer ,Natural language processing - Abstract
This paper presents the method of Morpheme Extraction and lemmatization for Tamil language in Morpheme Extraction Task (MET) of FIRE-2014. Tamil is a morphologically rich and agglutinative language. Such a language needs deeper analysis at the word level to capture the meaning of the word from its morphemes and its categories. In this attempt, the methodology employed to extract Tamil morphemes and lemmas are based on a supervised machine learning algorithm for nouns and verbs and simple suffix stripping for pronouns and proper nouns. Morphemes are extracted for other Part-of-Speech categories using Tamil Part of Speech tagger. In supervised learning, Morphological analyzer problem is redefined as a classification problem. We decompose the problem of noun and verb morpheme extraction into two sub-problems: learning to perform morpheme identification of words in a text, and learning to perform morpheme tagging. In addition to the Morpheme extraction task results of FIRE-2014, we have carried out different experiments to show the effectiveness of the proposed method.
- Published
- 2015
- Full Text
- View/download PDF
17. Morpheme list
- Author
-
Robert P. Stockwell and Donka Minkova
- Subjects
Morphology (linguistics) ,Morpheme ,business.industry ,Computer science ,Artificial intelligence ,Isolating language ,business ,computer.software_genre ,computer ,Linguistics ,Natural language processing - Published
- 2001
- Full Text
- View/download PDF
18. Resolution Strategy of Morphological Ambiguity for Korean Parsing
- Author
-
Yong-Seok Lee, Yi-Gyu Hwang, Hyeon-Yeong Lee, and Woo-Jeong Bae
- Subjects
Agglutinative language ,Parsing ,business.industry ,LR parser ,Computer science ,media_common.quotation_subject ,Speech recognition ,Ambiguity ,Isolating language ,computer.software_genre ,Morpheme ,Morphological analysis ,Alternation (linguistics) ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
In an agglutinative language such as Korean, Japanese, etc., morphological analysis plays an important role in finding the morphological structure of input sentences. In these languages, a word can be divided into many morphemes and each morpheme has lexical ambiguities in many cases by means of the properties of an agglutinative language. In this paper, we propose a morphological ambiguity reduction method that can reduce a burden of the syntactic analysis by 1) adapting a syntactic unit of morpheme in the Feature Generator, 2) making features with detail classification of POS (Part-Of-Speech) tagset, 3) restricting rest morphological ambiguity by the multi-path LR parsing. For this work, we introduce a Morphological Feature Generator (FG) and this transforms morpheme sequences into one morpheme if these morphemes are regarded as syntactically or semantically one meaning.
- Published
- 2000
- Full Text
- View/download PDF
19. Influence of Morpheme Polysemy on Morpheme Frequency
- Author
-
Andrea Krott
- Subjects
Linguistics and Language ,business.industry ,Isolating language ,computer.software_genre ,Language and Linguistics ,language.human_language ,Linguistics ,German ,Morpheme ,Noun ,language ,Alternation (linguistics) ,Artificial intelligence ,Polysemy ,business ,computer ,Natural language processing ,Mathematics - Abstract
A hypothesis about the dependency of morpheme frequency on morpheme polysemy will be discussed and tested on a German sample. It will be shown that the hypothesis seems to be correct for derivational affixes and morphemes which are used as verbs, nouns, or adjectives. In the case of morphemes which are used as functional words the hypothesis cannot be confirmed.
- Published
- 1999
- Full Text
- View/download PDF
20. Building a Vietnamese Lexicon Ontology for Syntactic Parsing and Document Annotation
- Author
-
Tuyen Thi-Thanh Do
- Subjects
Parsing ,business.industry ,Computer science ,Vietnamese ,Isolating language ,Ontology (information science) ,computer.software_genre ,Lexicon ,language.human_language ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Morpheme ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,language ,Upper ontology ,Artificial intelligence ,business ,computer ,Natural language processing ,Sentence - Abstract
Vietnamese is an isolating language in which words do not have morphology. It means that if a word changes its syntactic function, it still has the same morphemes. Therefore, an syntactic parser based on only the POS tag of the word still has errors of identifying the phrases of a sentence. In order to overcome this problem, an ontology of Vienamese lexicons should be build to provide the exact syntactic and semantic information of every word for parsing. In addition, this ontology is also built to solve the problems of synonyms of Vietnamese document annotation.
- Published
- 2013
- Full Text
- View/download PDF
21. Recursive Compounds and Linking Morpheme
- Author
-
Makiko Mukai
- Subjects
Recursion ,business.industry ,Object (grammar) ,Isolating language ,Type (model theory) ,computer.software_genre ,Genitive case ,Iterated function ,Morpheme ,Embedding ,Artificial intelligence ,business ,computer ,Natural language processing ,Mathematics - Abstract
This paper shows that the existence of a linking morpheme is not related to recursion of compounds in the given language, but a linking morpheme does play a role in some ways or the other of recursion. Recursion of compounding is defined as embedding at the edge or in the center of an action or object of an instance of the same type. On the other hand, iteration, is simply unembedded repetition of an action or object (Bisetto 2010). Based on these definitions, it is argued that there are languages with a linking morpheme overtly realized in recursive languages. Second, there are also languages which have genitive compounds with a linking morpheme, although recursive compounds do not have a linking morpheme. On the other hand, there are languages with a genitive compounds and recursive compounds of coordinate VNN or nominal coordinate compounds. In these languages, recursive compounds are not so productive. Finally, Turkish and Greek show that the existence of a linking morpheme is not related to recursion of compounds. They have a linking morpheme in iterated compounds, but not in recursive compounds.
- Published
- 2013
- Full Text
- View/download PDF
22. Free morpheme constraint revisited
- Author
-
Shoji Azuma
- Subjects
Linguistics and Language ,Sociology and Political Science ,Computer science ,business.industry ,Isolating language ,computer.software_genre ,Language and Linguistics ,Bound morpheme ,Linguistics ,Morpheme ,Anthropology ,Theoretical linguistics ,Alternation (linguistics) ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
As one of the best-known linguistic constraints on code-switching, Poplack (1980) has proposed the ‘free morpheme constraint’ which predicts no switching between a free morpheme and a bound morpheme. Although the free morpheme constraint is attractive due to its simplicity, cross-linguistic data do not support the constraint. The present study argues that semantic content of an item should be considered rather than its morphological form. As an alternative to the free morpheme constraint, a constraint based on the notion of semantic content is proposed.
- Published
- 1996
- Full Text
- View/download PDF
23. Some remarks on the relation between word length and morpheme length
- Author
-
Andrea Krott
- Subjects
Linguistics and Language ,Relation (database) ,business.industry ,Function (mathematics) ,Isolating language ,computer.software_genre ,Lexical database ,Language and Linguistics ,language.human_language ,Linguistics ,German ,Morpheme ,language ,Artificial intelligence ,business ,computer ,Natural language processing ,Word (computer architecture) ,Statistical hypothesis testing ,Mathematics - Abstract
Against the background of the synergetic approach to language some hypotheses about the relation between word length and morpheme length are discussed. First of all, Menzerath's law on the word‐morpheme level is given. On the basis of this law, a function is derived which reflects the dependence of the number of graphemes of a word on its number of morphemes. Furthermore, the influence of morpheme length on the number of graphemes of words containing morphemes of this length is analysed. It is shown that this relation is not just proportional. The results of statistical tests which were realised with the help of the CELEX lexical database are given. They show that all of these hypotheses can for the present be accepted as valid for the languages English, German and Dutch.
- Published
- 1996
- Full Text
- View/download PDF
24. Detecting Machine Generated Domain Names Based on Morpheme Features
- Author
-
Jian Gong, Qian Liu, and Wei-wei Zhang
- Subjects
Root (linguistics) ,Computer science ,Character (computing) ,business.industry ,Speech recognition ,Affix ,Isolating language ,computer.software_genre ,Spelling ,Domain (software engineering) ,Morpheme ,Feature (machine learning) ,Artificial intelligence ,business ,computer ,Natural language processing ,Natural language ,Word (computer architecture) - Abstract
To detect machine generated domain names, we proposed a method to exclude human generated domain names by analyzing the basic morphemes in the character strings of domain names (The basic morphemes in English are word roots and affixes while in Chinese are initials and finals). Experimental results show that the analysis of the morphemes can make a great progress in the efficiency and accuracy of detection. Keywords-DNS; domain name; morpheme; word root; word affix Samuel have introduced the word segment techniques in the field of natural language to extract and restructure keywords from domain names for DNS probing and proactive forecast of blacklist (4-6). Among the detection methods mentioned above, the machine learning method based on the statistical lexical features has lower computational complexity, but the attacker can easily evade detection through the prior feature statistics. Though the word segment methods can improve detection accuracy with the analysis on the semantic level, the conditions are too harsh that requiring domain names are generated entirely based on the dictionary. Taking advantages and disadvantages of both kinds of methods into account, we proposed a new method. Based on the basic lexical features, with further analyzing the basic morphemes in character strings to exclude human generated domain names, we can effectively improve the accuracy of machine generated domain detection. Moreover, compared with the huge corpus in the natural language model, the number of morphemes is relatively small (the number of word roots commonly used in English is only 900, and the total of initials and finals in Chinese spelling is only 47), which can guarantee a lower system overhead. II. MOTIVATION
- Published
- 2013
- Full Text
- View/download PDF
25. Graphic Language Model for Agglutinative Languages: Uyghur as Study Case
- Author
-
Wenbin Jiang, Kai Liu, Miliwan Xuehelaiti, and Tuergen Yibulayin
- Subjects
Agglutinative language ,Machine translation ,business.industry ,Computer science ,Isolating language ,Directed graph ,computer.software_genre ,Linguistics ,Graphic language ,Morpheme ,Artificial intelligence ,Language model ,business ,computer ,Sentence ,Natural language processing ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
This paper describes a novel, graphic language modeling strategy for morphologically rich agglutinative languages. Different from the linear structure in n-gram language models, graphic modeling organizes the morphemes in a sentence, including stems and affixes, as a directed graph. The graphic language model is verified in two typical application scenarios, morphological analysis and machine translation. We take Uyghur for example, and experiments show that the graphic language model achieves significant improvement in both morphological analysis and machine translation.
- Published
- 2013
- Full Text
- View/download PDF
26. Tamil English Cross Lingual Information Retrieval
- Author
-
T. Pattabhi R. K. Rao and Sobha Lalitha Devi
- Subjects
Agglutinative language ,Information retrieval ,Computer science ,business.industry ,Bilingual dictionary ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,WordNet ,Isolating language ,computer.software_genre ,Query language ,ComputingMethodologies_ARTIFICIALINTELLIGENCE ,Query expansion ,ComputingMethodologies_PATTERNRECOGNITION ,Noun ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Transliteration ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
This paper describes our work on participation in the FIRE 2010 evaluation campaign in the cross lingual information retrieval track. We describe how cross lingual information retrieval can be effectively performed between a highly agglutinative language, Tamil and English, an isolating language. Agglutination is a morphological process of adding affixes to word base. These affixations can be between noun- noun, adjective-noun, noun-case, etc. This phenomenon of the language has brought serious problems in translation, transliteration and expansion of the query into another language. To overcome these we have used a morphological analyzer which gives the root word or a word base. The word base is used in turn for translation, transliteration and query expansion. The translation of the query is done using bilingual dictionary and transliteration uses statistical method. And query expansion is performed using ontology and WordNet.
- Published
- 2013
- Full Text
- View/download PDF
27. 1. Bound morphology in common: copy or cognate?
- Author
-
Martine Robbeets and Lars Johanson
- Subjects
Communication ,Inheritance (object-oriented programming) ,Morpheme ,business.industry ,Computer science ,Cognate ,Isolating language ,Term (logic) ,Turkic languages ,business ,Linguistics ,Synthetic language ,Key (music) - Abstract
This chapter provides an overview of the key concepts discussed in this volume. Two or more languages have bound morphology in common when their affixes share certain properties, either globally, including form and function, or selectively, restricted to certain structural material, semantic, combinational or frequential properties only. The term "cognate" in the title of this chapter refers to a morpheme which is related to a morpheme in another language by virtue of inheritance from a common ancestral morpheme, whereas a "copy" is a so-called "borrowed" morpheme. When shared bound morphemes are restricted to shared roots only, this is an indication that they are copied rather than inherited. Their explanation focuses on semantic rather than on formal factors. The major objections raised against the relatedness of these languages are that they do not have enough bound morphology in common and that all similarities can be attributed to code-copying. Keywords: borrowed morpheme; bound morphology; code-copying; languages; morpheme
- Published
- 2012
- Full Text
- View/download PDF
28. Symmetry and Asymmetry Chinese Writing in Japan: The Case of Kojiki (712)
- Author
-
Aldo Tollini
- Subjects
Agglutinative language ,Literature ,Kanji ,Point (typography) ,business.industry ,media_common.quotation_subject ,Japanese writing ,Isolating language ,Chinese characters ,Asymmetry ,Subject (grammar) ,Oral tradition ,business ,Psychology ,symmetry ,media_common - Abstract
In ancient Japan both the logographic and the phonographic strategies were present and often they were used together in various ways. This chapter presents the interaction between written Japanese and Chinese from the lexical point of view, highlighting the difficulties and peculiarities born from this asymmetrical symmetry. The key point is the employment in Japan of Chinese logographic characters (symmetry) used in an isolating language as Chinese for an agglutinative language as Japanese (asymmetry). In order to discuss the matter concretely, the chapter illustrates the example of Kojiki that has two important characteristics that will help to illustrate the above subject: it is the first attempt of extended writing in Japan, and as declared in the preface in Chinese language, Kojiki is the written version of an oral tradition regarding ancient events, transmitted from generation to generation. Keywords:asymmetry; Chinese language; Japan; Kojiki ; symmetry
- Published
- 2012
- Full Text
- View/download PDF
29. Multilingual Verification of the Annotation Scheme ISO-Space
- Author
-
Alex Chengyu Fang, James Pustejovsky, and Ki Yong Lee
- Subjects
Agglutinative language ,Computer science ,business.industry ,International standard ,Isolating language ,computer.software_genre ,Transparency (linguistic) ,Annotation ,Analytic language ,Artificial intelligence ,Language translation ,business ,computer ,Spatial analysis ,Natural language processing - Abstract
ISO-Space ([1], [2]) is an emerging annotation scheme for spatial information in language. The purpose of this paper is to verify its descriptive adequacy and semantic transparency for multilingual application. As a starting point, the present verification task works on three languages, namely English, Korean and Chinese. These three are chosen, for they are typologically different from one another: English represents an inflectional analytic language, Korean an agglutinative language and Chinese, an isolating language. Such multilingual verification is required to justify ISO-Space as an international standard for its applicability to various languages other than English.
- Published
- 2011
- Full Text
- View/download PDF
30. Morpheme Structure Constraints
- Author
-
Geert Booij
- Subjects
Accidental gap ,business.industry ,Phonology ,Isolating language ,Nasal consonant ,computer.software_genre ,Linguistics ,Sandhi ,Morpheme ,Alternation (linguistics) ,Artificial intelligence ,business ,computer ,Natural language processing ,Mathematics - Abstract
Morpheme structure constraints are constraints on the segmental make-up of the morphemes of a language. A textbook example of such a constraint is that bnik is an impossible morpheme of English, whereas blik is a possible English morpheme that happens not to exist. Hence, bnik is a systematic gap in the morpheme inventory of English, whereas blik is an accidental gap in this inventory. This can be taken to imply that there is a morpheme structure constraint that prevents English morphemes from beginning with a /b/ followed by a nasal consonant. Keywords: phonology
- Published
- 2011
- Full Text
- View/download PDF
31. Comparison of morphemic word structure and a cartographic sign
- Author
-
Dragica Živković and Jasmina Jovanovic
- Subjects
Atmospheric Science ,sign ,Language identification ,morfema ,Computer science ,Manually coded language ,Geography, Planning and Development ,lcsh:G1-922 ,computer.software_genre ,Picture language ,Education ,Synthetic language ,affix ,afiks ,word ,Global and Planetary Change ,language ,business.industry ,Geology ,Isolating language ,jezik ,znak ,Linguistics ,Language transfer ,reč ,morpheme ,Artificial intelligence ,business ,computer ,Cartography ,lcsh:Geography (General) ,Natural language processing ,Natural language ,Gesture - Abstract
Language is a system of gestures, sounds, characters, symbols and words that are used to display concepts and communication. Map language is derived from natural language, rather than parallel to it, as its graphical equivalent. Natural and mapping language is based on a system of signs. In the natural language, the letters are the smallest units, and arranged meaningfully they constitute a sign - a word i.e. a concept. In a cartographic language one sign is one term. But common to both languages is the basis of character - morphemes and its accessories - affixes, which in the cartographic language have greater possibilities of expression. Jezik je sistem gestikulacije, glasova, znakova, simbola i reči koje se koriste za prikaz pojmova i komunikaciju. Jezik kartografije nije nastao iz prirodnog jezika, nego paralelno sa njim, kao njegov grafički ekvivalent. Prirodni i kartografski jezik zasnivaju se na sistemu znakova. U prirodnom jeziku slova su najmanje jedinice, koje poređane smisleno tvore znak - reč, odnosno pojam. U kartografskom jeziku jedan znak je jedan pojam. Ali zajedničko za oba jezika je osnova znaka - morfema i dodaci - afiksi, koji u kartografskom jeziku imaju veće mogućnosti izražavanja.
- Published
- 2011
32. Exploiting Morphology And Local Word Reordering In English-To-Turkish Phrase-Based Statistical Machine Translation
- Author
-
Ilknur Durgar El-Kahlout and Kemal Oflazer
- Subjects
Agglutinative language ,Acoustics and Ultrasonics ,Machine translation ,Computer science ,business.industry ,Speech recognition ,Isolating language ,Content word ,computer.software_genre ,Word lists by frequency ,Morpheme ,Artificial intelligence ,Electrical and Electronic Engineering ,Language translation ,business ,computer ,Natural language processing ,Word order - Abstract
In this paper, we present the results of our work on the development of a phrase-based statistical machine translation prototype from English to Turkish-an agglutinative language with very productive inflectional and derivational morphology. We experiment with different morpheme-level representations for English-Turkish parallel texts. Additionally, to help with word alignment, we experiment with local word reordering on the English side, to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the morpheme order of the corresponding case-marked nouns and complex verbs, on the Turkish side. To alleviate the dearth of the parallel data available, we also augment the training data with sentences just with content word roots obtained from the original training data to bias root word alignment, and with highly reliable phrase-pairs from an earlier corpus alignment. We use a morpheme-based language model in decoding and a word-based language model in re-ranking the n-best lists generated by the decoder. Lastly, we present a scheme for repairing the decoder output by correcting words which have incorrect morphological structure or which are out-of-vocabulary with respect to the training data and language model, to further improve the translations. We improve from 15.53 BLEU points for our word-based baseline model to 25.17 BLEU points for an improvement of 9.64 points or about 62% relative.
- Published
- 2010
33. Building a Large Syntactically-Annotated Corpus of Vietnamese
- Author
-
Xuan Luong Vu, Thi Minh Huyen Nguyen, Van Hiep Nguyen, Phuong-Thai Nguyen, Hong Phuong Le, Faculté de Mathématiques, Mécanique et Informatique (MIM), Vietnam National University [Hanoï] (VNU), Vietlex, Vietnam Lexicography Centre, Knowledge Information and Web Intelligence (KIWI), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), Le-Hong, Phuong, and Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Subjects
Computer science ,Vietnamese ,media_common.quotation_subject ,Treebank ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,02 engineering and technology ,computer.software_genre ,Newspaper ,Annotation ,Resource (project management) ,syntactic ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,media_common ,business.industry ,Isolating language ,language.human_language ,Linguistics ,Agreement ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,annotation ,treebank ,Delimiter ,language ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Held in conjunction with ACL-IJCNLP 2009; International audience; Treebank is an important resource for both research and application of natural language processing. For Vietnamese, we still lack such kind of corpora. This paper presents up-to-date results of a project for Vietnamese treebank construction. Since Vietnamese is an isolating language and has no word delimiter, there are many ambiguities in sentence analysis. We systematically applied a lot of linguistic techniques to handle such ambiguities. Annotators are supported by automatic labeling tools and a tree-editor tool. Raw texts are extracted from Tuoi Tre (Youth), an online Vietnamese daily newspaper. The current annotation agreement is around 90 percent.
- Published
- 2009
34. Simple Morpheme Labelling in Unsupervised Morpheme Analysis
- Author
-
Delphine Bernhard, Ubiquitous Knowledge Processing (UKP), Technische Universität Darmstadt (TU Darmstadt), and Bernhard, Delphine
- Subjects
Text corpus ,business.industry ,Computer science ,Speech recognition ,Text segmentation ,Search engine indexing ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,02 engineering and technology ,Isolating language ,computer.software_genre ,Weighting ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Morpheme ,0202 electrical engineering, electronic engineering, information engineering ,Unsupervised learning ,020201 artificial intelligence & image processing ,Artificial intelligence ,0305 other medical science ,business ,computer ,Natural language ,Natural language processing ,ComputingMilieux_MISCELLANEOUS - Abstract
This paper describes a system for unsupervised morpheme analysis and the results it obtained at Morpho Challenge 2007. The system takes a plain list of words as input and returns a list of labelled morphemic segments for each word. Morphemic segments are obtained by an unsupervised learning process which can directly be applied to different natural languages. Results obtained at competition 1 (evaluation of the morpheme analyses) are better in English, Finnish and German than in Turkish. For information retrieval (competition 2), the best results are obtained when indexing is performed using Okapi (BM25) weighting for all morphemes minus those belonging to an automatic stop list made of the most common morphemes.
- Published
- 2007
35. Part-of-Speech Tagging Using Word Probability Based on Category Patterns
- Author
-
Mi-young Kang, Kyung-Soon Park, Sung-Won Jung, and Hyuk-Chul Kwon
- Subjects
Agglutinative language ,Computer science ,Turkish ,business.industry ,Part-of-speech tagging ,Speech recognition ,Isolating language ,computer.software_genre ,language.human_language ,Morpheme ,language ,Artificial intelligence ,business ,computer ,Word (computer architecture) ,Natural language processing - Abstract
This paper focuses on part-of-speech (POS, category) tagging based on word probability estimated using morpheme unigrams and category patterns within a word. The word-N-gram-based POS-tagging model is difficult to adapt to agglutinative languages such as Korean, Turkish and Hungarian, among others, due to the high productivity of words. Thus, many of the stochastic studies on Korean POS-tagging have been conducted based on morpheme N-grams. However, the morpheme-N-gram model also has difficulty coping with data sparseness when augmenting contextual information in order to assure sufficient performance. In addition, the model has difficulty conceiving the relationship of morphemes within a word. The present POS-tagging algorithm (a) resolves the data-sparseness problem thanks to a morpheme-unigram-based approach and (b) involves the relationship of morphemes within a word by estimating the weight of the category of a morpheme in a category pattern constituting a word. With the proposed model, a performance similar to that with other models that use more than just the morpheme-unigram model was observed.
- Published
- 2007
- Full Text
- View/download PDF
36. Elements of the grammar of space in Ewe
- Author
-
Felix K. Ameka and James Essegbey
- Subjects
Reduplication ,Communication ,Grammar ,Computer science ,business.industry ,media_common.quotation_subject ,Lexicalization ,Isolating language ,Linguistics ,Noun phrase ,Tone language ,business ,Cognitive linguistics ,media_common ,Word order - Published
- 2006
- Full Text
- View/download PDF
37. Morpheme-Based Language Modeling for Arabic Lvcsr
- Author
-
Geoffrey Zweig, Stanley F. Chen, Daniel Povey, and G. Choueiter
- Subjects
Vocabulary ,Arabic ,business.industry ,Computer science ,Speech recognition ,media_common.quotation_subject ,Speech coding ,Word error rate ,Isolating language ,computer.software_genre ,language.human_language ,Morpheme ,language ,Artificial intelligence ,Language model ,business ,computer ,Natural language ,Word (computer architecture) ,Natural language processing ,media_common - Abstract
In this paper, we concentrate on Arabic speech recognition. Taking advantage of the rich morphological structure of the language, we use morpheme-based language modeling to improve the word error rate. We propose a simple constraining method to rid the decoding output of illegal morpheme sequences. We report the results obtained for word and morpheme language models using medium (
- Published
- 2006
- Full Text
- View/download PDF
38. From Phoneme to Morpheme: Another Verification Using a Corpus
- Author
-
Kumiko Tanaka-Ishii and Zhihui Jin
- Subjects
Morpheme ,Computer science ,business.industry ,Text segmentation ,Phonetics ,Isolating language ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing ,Word (computer architecture) - Abstract
We scientifically test Harris's hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.
- Published
- 2006
- Full Text
- View/download PDF
39. Pseudo Context-Sensitive Models for Parsing Isolating Languages: Classical Chinese — A Case Study
- Author
-
Zhenyu Wu, Yinan Peng, Hui Liu, Zhihao Yuan, Liang Huang, and Huan Wang
- Subjects
Parsing ,Grammar ,business.industry ,Computer science ,media_common.quotation_subject ,Isolating language ,Context-free grammar ,computer.software_genre ,Top-down parsing ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Terminal and nonterminal symbols ,S-attributed grammar ,Statistical parsing ,Artificial intelligence ,business ,computer ,Context-sensitive language ,Natural language processing ,Bottom-up parsing ,media_common - Abstract
In this paper, we compare the performance of three probabilistic pseudo context-sensitive models on parsing isolating languages. These models are all based on the conventional probabilistic context-free grammar (PCFG). The first one is well known for statistical parsing of English, while the other two are novel models conditioning the siblings of an expanding nonterminal. We experiment these models on Classical Chinese, a typical isolating language. And it is quite surprising to see that through only a little more conditioning, the new models significantly outperform the first model. To this end, our work shows the impact of typological distinction on parsing and provides two simple-yet-effective conditioning models for isolating languages.
- Published
- 2003
- Full Text
- View/download PDF
40. Cross-Morphemic Predictability and the Lexical Access of Compounds in Mandarin Chinese
- Author
-
Shuping Gong and James Myers
- Subjects
Linguistics and Language ,Phrase ,Mental lexicon ,business.industry ,Computer science ,Isolating language ,computer.software_genre ,Mandarin Chinese ,Language and Linguistics ,Linguistics ,language.human_language ,Word lists by frequency ,Morpheme ,Lexical decision task ,language ,Artificial intelligence ,business ,computer ,Natural language processing ,Word (computer architecture) - Abstract
Chinese poses a challenge for models of compound processing, since the basic notions of morpheme, word and phrase are not consistently distinguished by native speakers. It is thus proposed that the mental lexicon consists of linked and overlapping listemes (in the sense of Di Sciullo and Williams 1987) which can potentially be of any size (morpheme, word, phrase). One implication of this approach for compound processing is that cross-morphemic predictability should play an important role: the more predictable one morpheme is from the other, the easier the compound should be to access. To study this implication, cross-morphemic predictability is quantified using the measure of Mutual Information from information theory, which divides the frequency of the constituent of interest (i.e. a compound) by the frequencies of the components (i.e. morphemes). This leads to two specific predictions: compound frequency should have a positive effect on lexical access, but morpheme frequency should have a negative effect. In Experiment 1, it is demonstrated that a very simple connectionist network, built according to the overlapping listeme approach, conforms to these two predictions. This suggests that separate effects of word and morpheme frequency need not require separate processing levels for words and morphemes. Experiment 2 then compares the network's behavior with Chinese native speakers in a lexical decision task involving spoken Mandarin Chinese compounds. As predicted, there was a positive effect of word frequency on response speed and accuracy, and a negative effect of morpheme frequency. Suggestions are made for reconciling these results with the more familiar positive or neutral morpheme frequency effects found in other studies.
- Published
- 2002
- Full Text
- View/download PDF
41. Morpheme Based Language Models for Speech Recognition of Czech
- Author
-
Josef Psutka, William Byrne, Pavel Ircing, Pavel Krbec, and Jan Hajič
- Subjects
Czech ,Vocabulary ,Computer science ,business.industry ,media_common.quotation_subject ,Agglutination ,Speech recognition ,Isolating language ,computer.software_genre ,language.human_language ,Synthetic language ,Constructed language ,Analytic language ,Morpheme ,language ,Language modelling ,Language model ,Artificial intelligence ,Slavic languages ,business ,computer ,Natural language processing ,media_common - Abstract
In our paper we propose new technique for language modelling of highly inflectional languages such as Czech, Russian an other Slavic languages. Our aim is to alleviate main problem encountered in these languages, which is enormous vocabulary growth caused by great number of different word forms derived from one word (lemma). We reduced the size of the vocabulary by decomposing words into stems and endings and storing these sub-word units (morphemes) in the vocabulary separately. Then we trained morpheme based language model on the decomposed corpus. This paper reports perplexities, OOV rates and some speech recognition results obtained with new language model.
- Published
- 2000
- Full Text
- View/download PDF
42. Błędy programu do obróbki korpusu, podczas badań korpusowych słownictwa biznesowego i prawnego w języku wietnamskim, na przykładzie programu AntConc
- Author
-
Jakub Królczyk
- Subjects
Space (punctuation) ,Text corpus ,Computer science ,business.industry ,Vietnamese ,Isolating language ,computer.software_genre ,language.human_language ,Field (computer science) ,Software ,Corpus linguistics ,language ,Artificial intelligence ,business ,computer ,TypeScript ,Natural language processing - Abstract
On the one hand corpus research and corpus linguistics are relatively new fields of science but on the other hand, according to some people, there are one of fastest developing methods of linguistic research. To perform a corpus research, it is necessary to have a text corpus and a proper kind of software. The range of software kinds is wide and its easy to find free of charge on or license based software. Nevertheless, what the choice is, it is possible to encounter problems or the software will have low efficiency. Low efficiency of AntConc can be seen while researching a corpus compiled from an isolating language. After processing the corpus, consisting of 18 text items in the Vietnamese language (that is 290 pages of typescript) dedicated to the field of management and law, the software outputted incorrect results. Starting with counting the number of words in a corpus and ending with concordance plotting. There are two ways to deal with this problem. The method involves “teaching” AntConc how to read the Vietnamese language, in other words it is necessary to input a list of all words in the Vietnamese language. The second method is more time consuming because it involves replacing the spaces between syllables to a sign that will not be recognized by the software as a space. Using one of these methods could potentialy end in raising AntConc efficiency.
- Published
- 2014
- Full Text
- View/download PDF
43. Automatic dictionary organization in NLP systems for Oriental languages
- Author
-
Yu Tovmach, E. Tioun, V. Andrezen, V. Shumovsky, R. Piotrowski, W. Kwitakowski, L. Kogan, and R. Minvaleev
- Subjects
Scheme (programming language) ,Computer science ,business.industry ,Agglutination ,Isolating language ,Romance languages ,computer.software_genre ,Lexicon ,Variety (linguistics) ,Semantic data model ,Syntax ,Text processing ,Artificial intelligence ,Source text ,business ,computer ,Natural language processing ,computer.programming_language - Abstract
This paper presents a description of automatic dictionaries (ADs) and dictionary entry (DE) schemes for NLP systems dealing with Oriental languages. The uniformity of the AD organization and of the DE pattern does not prevent the system from taking into account the structural differences of isolating (analytical), agglutinating and internal-flection languages.The "Speech Statistics" (SpSt) project team has been designing a linguistic automaton aimed at NL processing in a variety of forms.In addition to Germanic and Romance languages the system under development is to handle text processing of a number of Oriental languages. The strategy adopted by the SpSt group is characterized by a lexicalized approach: the NLP algorithms for any language are entirely AD dependent, i.e., a large lexicon database has been provided, its entries being loaded with information including not only lexical, but also morphological, syntactic and semantic data. This information concentrated in dictionary entries (DEs) is essential for both source text analysis and target (Russian) text generation.The DE structure is largely determined by the typological features of the source language. The SpSt group has hitherto had to deal with European languages and it was for these languages (inflective and inflective-analytical) that the prototype entry schemes were elaborated and adopted. No doubt, the typological characteristics of Oriental languages required certain modifications to be made to the basic scheme. Hence in the present paper each of the language types is given consideration. Agglutinating languages proved to be the most suitable to process according to the SpSt strategy. But an isolating language will be the first to be proposed for discussion.
- Published
- 1992
- Full Text
- View/download PDF
44. Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system
- Author
-
T. V. Geetha and S. Saraswathi
- Subjects
Vocabulary ,General Computer Science ,Language identification ,Computer science ,business.industry ,Speech recognition ,media_common.quotation_subject ,Isolating language ,computer.software_genre ,language.human_language ,Morpheme ,Tamil ,Cache language model ,language ,Trigram ,Artificial intelligence ,Language model ,business ,computer ,Natural language processing ,media_common - Abstract
This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number of different forms derived from one word. The size of the vocabulary was reduced by, decomposing the words into stems and endings and storing these sub word units (morphemes) in the vocabulary separately. A enhanced morpheme-based language model was designed for the inflectional language Tamil. The enhanced morpheme-based language model was trained on the decomposed corpus. The perplexity and Word Error Rate (WER) were obtained to check the efficiency of the model for Tamil speech recognition system. The results were compared with word-based bigram and trigram language models, distance based language model, dependency based language model and class based language model. From the results it was analyzed that the enhanced morpheme-based trigram model with Katz back-off smoothing effect improved the performance of the Tamil speech recognition system when compared to the word-based language models.
- Published
- 2007
- Full Text
- View/download PDF
45. The neogrammarian model
- Author
-
Theodora Bynon
- Subjects
Literature ,Neogrammarian ,business.industry ,Umlaut ,media_common.quotation_subject ,Phonological change ,Phonology ,Isolating language ,Art ,Linguistics ,Semantic change ,Etymology ,Historical linguistics ,business ,media_common - Published
- 1977
- Full Text
- View/download PDF
46. Linguistic problems in multilingual morphological decomposition
- Author
-
G. Thurmair
- Subjects
Parsing ,Grammar ,Computer science ,business.industry ,media_common.quotation_subject ,Isolating language ,Term (logic) ,computer.software_genre ,Linguistics ,Morpheme ,Selection (linguistics) ,Decomposition (computer science) ,Artificial intelligence ,Allomorph ,business ,computer ,Natural language processing ,media_common - Abstract
An algorithm for the morphological decomposition of words into morphemes is presented. The application area is information retrieval, and the purpose is to find morphologically related terms to a given search term. First, the parsing framework is presented, then several linguistic decisions are discussed: morpheme selection and segmentation, morpheme classes, morpheme grammar, allomorph handling, etc. Since the system works in several languages, language-specific phenomena are mentioned.
- Published
- 1984
- Full Text
- View/download PDF
47. Morpheme Boundaries within Words: Report on a Computer Test
- Author
-
Zellig S. Harris
- Subjects
Sequence ,Syntax (programming languages) ,business.industry ,Computer science ,Isolating language ,computer.software_genre ,Linguistics ,Test (assessment) ,Behavioral test ,Morpheme ,Artificial intelligence ,business ,computer ,Sentence ,Natural language processing ,Consonant cluster - Abstract
For the science of linguistics we seek objective and formally describable operations with which to analyze language. The phonemes of a language can be determined by means of an explicit behavioral test (the pair test, involving two speakers of the language) and distributional simplifications, i. e. the defining of symbols which express the way in which the outcomes of that test occur in respect to each other in sentences of the language. The syntax, and most of the morphology, of a language is discovered by seeing how the morphemes occur in respect to each other in sentences. As a bridge between these two sets of methods we need a test for determining what are the morphemes of a language, or at least a test that would tentatively segment a phonemic sequence (as a sentence) into morphemes, leaving it for a distributional criterion to decide which of these tentative segments are to be accepted as morphemes.
- Published
- 1970
- Full Text
- View/download PDF
48. From Phoneme to Morpheme
- Author
-
Zellig S. Harris
- Subjects
Computer science ,business.industry ,Score ,Isolating language ,computer.software_genre ,Morpheme ,Segmentation ,Consonant vowel ,Alternation (linguistics) ,Artificial intelligence ,business ,computer ,Utterance ,Natural language processing - Abstract
The following investigation1 presents a constructional procedure segmenting an utterance in a way which correlates well with word and morpheme boundaries. The procedure requires a large set of utterances, elicited in a certain manner from an informant (or found in a very large corpus); and it requires that all the utterances be written in the same phonemic representation, determined without reference to morphemes. It then investigates a particular distributional relation among the phonemes in the utterances thus collected; and on the basis of this relation among the phonemes, it indicates particular points of segmentation within one utterance at a time. For example, in the utterance /hiyzkwikǝr/ He’s quicker it will indicate segmentation at the points marked by dots: /hiy. z. kwik. Ər/; and it will do so purely by comparing this phonemic sequence with the phonemic sequences of other utterances.
- Published
- 1970
- Full Text
- View/download PDF
49. [Untitled]
- Subjects
Vocabulary ,Multidisciplinary ,business.industry ,media_common.quotation_subject ,Morphophonology ,Isolating language ,Biology ,computer.software_genre ,Psycholinguistics ,language.human_language ,German ,Analytic language ,Morpheme ,Mental representation ,language ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
Morphemes are the smallest meaningful parts of words and therefore represent a natural unit to study the evolution of words. To analyze the influence of language change on morphemes, we performed a large scale analysis of German and English vocabulary covering the last 200 years. Using a network approach from bioinformatics, we examined the historical dynamics of morphemes, the fixation of new morphemes and the emergence of words containing existing morphemes. We found that these processes are driven mainly by the number of different direct neighbors of a morpheme in words (connectivity, an equivalent to family size or type frequency) and not its frequency of usage (equivalent to token frequency). This contrasts words, whose survival is determined by their frequency of usage. We therefore identified features of morphemes which are not dictated by the statistical properties of words. As morphemes are also relevant for the mental representation of words, this result might enable establishing a link between an individual’s perception of language and historical language change.
50. [Untitled]
- Subjects
Vocabulary ,Multidisciplinary ,business.industry ,Language change ,media_common.quotation_subject ,Isolating language ,Word formation ,computer.software_genre ,Psycholinguistics ,Morpheme ,Probability distribution ,Artificial intelligence ,business ,computer ,Natural language processing ,Word (computer architecture) ,Mathematics ,media_common - Abstract
Words are built from smaller meaning bearing parts, called morphemes. As one word can contain multiple morphemes, one morpheme can be present in different words. The number of distinct words a morpheme can be found in is its family size. Here we used Birth-Death-Innovation Models (BDIMs) to analyze the distribution of morpheme family sizes in English and German vocabulary over the last 200 years. Rather than just fitting to a probability distribution, these mechanistic models allow for the direct interpretation of identified parameters. Despite the complexity of language change, we indeed found that a specific variant of this pure stochastic model, the second order linear balanced BDIM, significantly fitted the observed distributions. In this model, birth and death rates are increased for smaller morpheme families. This finding indicates an influence of morpheme family sizes on vocabulary changes. This could be an effect of word formation, perception or both. On a more general level, we give an example on how mechanistic models can enable the identification of statistical trends in language change usually hidden by cultural influences.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.