845 results on '"lexical database"'
Search Results
2. FLexSign: A lexical database in French Sign Language (LSF).
- Author
-
Périn, Philomène, Herrera, Santiago, and Bogliotti, Caroline
- Subjects
- *
SIGN language , *DATABASES , *FRENCH language , *INFORMATION storage & retrieval systems , *RESEARCH personnel - Abstract
In psycholinguistics, studies are conducted to understand language processing mechanisms, whether in comprehension or in production, and independently of the language modality. To do so, researchers need accurate psycholinguistic information about the linguistic material they use. One main obstacle to this process is the lack of information available in sign language. While some lexical databases exist in multiple sign languages, to the best of our knowledge, psycholinguistic data for French Sign Language (LSF) signs are not yet available. The present study presents FLexSign, the first interactive lexical database for LSF, inspired by ASL-Lex (Caselli et al., 2017). The database includes familiarity, concreteness, and iconicity data for 546 signs of LSF. These three factors are known to influence the speed or the accuracy of lexical processing. Familiarity and concreteness are known to generate a robust facilitative effect on sign processing, while iconicity plays a complex but crucial role in the creation and organization of sign language lexicons. Therefore, having accurate information on the iconicity of LSF signs would help to better understand the role of this notion in lexical processing. To develop the database, 33 participants were recruited and asked to complete an online questionnaire. The FLexSign database will be of great use to sign language researchers, providing linguistic information that was previously unavailable and offering many opportunities at both the experimental and clinical levels. The database is also open to future contributions. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
3. The Children and Young People's Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom.
- Author
-
Korochkina, Maria, Marelli, Marco, Brysbaert, Marc, and Rastle, Kathleen
- Subjects
- *
YOUNG adults , *CHILDREN'S books , *LANGUAGE acquisition , *WORD frequency , *DATABASES - Abstract
This article introduces the Children and Young People's Books-Lexicon (CYP-LEX), a large-scale lexical database derived from books popular with children and young people in the United Kingdom. CYP-LEX includes 1,200 books evenly distributed across three age bands (7–9, 10–12, 13+) and comprises over 70 million tokens and over 105,000 types. For each word in each age band, we provide its raw and Zipf-transformed frequencies, all parts-of-speech in which it occurs with raw frequency and lemma for each occurrence, and measures of count-based contextual diversity. Together and individually, the three CYP-LEX age bands contain substantially more words than any other publicly available database of books for primary and secondary school children. Most of these words are very low in frequency, and a substantial proportion of the words in each age band do not occur on British television. Although the three age bands share some very frequent words, they differ substantially regarding words that occur less frequently, and this pattern also holds at the level of individual books. Initial analyses of CYP-LEX illustrate why independent reading constitutes a challenge for children and young people, and they also underscore the importance of reading widely for the development of reading expertise. Overall, CYP-LEX provides unprecedented information into the nature of vocabulary in books that British children aged 7+ read, and is a highly valuable resource for those studying reading and language development. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. JALEX: Japanese version of lexical decision database
- Author
-
Naoto Ota and Masaya Mochizuki
- Subjects
lexical decision task ,lexical processing ,lexical database ,mental lexicon ,visual word recognition ,Language and Literature - Published
- 2025
- Full Text
- View/download PDF
5. A large-scale database of Chinese characters and words collected from elementary school textbooks.
- Author
-
Zhang, Man, Liu, Zeping, Botezatu, Mona Roxana, Dang, Qinpu, Yuan, Qiming, Han, Jinzhuo, Liu, Li, and Guo, Taomei
- Abstract
Lexical databases are essential tools for studies on language processing and acquisition. Most previous Chinese lexical databases have focused on materials for adults, yet little is known about reading materials for children and how lexical properties from these materials affect children's reading comprehension. In the present study, we provided the first large database of 2999 Chinese characters and 2182 words collected from the official textbooks recently issued by the Ministry of Education (MOE) of the People's Republic of China for most elementary schools in Mainland China, as well as norms from both school-aged children and adults. The database incorporates key orthographic, phonological, and semantic factors from these lexical units. A word-naming task was used to investigate the effects of these factors in character and word processing in both adults and children. The results suggest that: (1) as the grade level increases, visual complexity of those characters and words increases whereas semantic richness and frequency decreases; (2) the effects of lexical predictors on processing both characters and words vary across children and adults; (3) the effect of age of acquisition shows different patterns on character and word-naming performance. The database is available on Open Science Framework (OSF) (https://osf.io/ynk8c/?view%5fonly=5186bd68549340bd923e9b6531d2c820) for future studies on Chinese language development. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. The Children's Picture Books Lexicon (CPB-Lex): A large-scale lexical database from children's picture books.
- Author
-
Green, Clarence, Keogh, Kathleen, Sun, He, and O'Brien, Beth
- Abstract
This article presents cpb-lex, a large-scale database of lexical statistics derived from children's picture books (age range 0–8 years). Such a database is essential for research in psychology, education and computational modelling, where rich details on the vocabulary of early print exposure are required. Cpb-lex was built through an innovative method of computationally extracting lexical information from automatic speech-to-text captions and subtitle tracks generated from social media channels dedicated to reading picture books aloud. It consists of approximately 25,585 types (wordforms) and their frequency norms (raw and Zipf-transformed), a lexicon of bigrams (two-word sequences and their transitional probabilities) and a document-term matrix (which shows the importance of each word in the corpus in each book). Several immediate contributions of cpb-lex to behavioural science research are reported, including that the new cpb-lex frequency norms strongly predict age of acquisition and outperform comparable child-input lexical databases. The database allows researchers and practitioners to extract lexical statistics for high-frequency words which can be used to develop word lists. The paper concludes with an investigation of how cpb-lex can be used to extend recent modelling research on the lexical diversity children receive from picture books in addition to child-directed speech. Our model shows that the vocabulary input from a relatively small number of picture books can dramatically enrich vocabulary exposure from child-directed speech and potentially assist children with vocabulary input deficits. The database is freely available from the Open Science Framework repository: https://tinyurl.com/4este73c. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Information Retrieval Multi-agent System Established on the Metaphysics Lexical Database
- Author
-
Bystrov, Dmitriy, Spagnoletti, Paolo, Series Editor, De Marco, Marco, Series Editor, Pouloudi, Nancy, Series Editor, Te'eni, Dov, Series Editor, vom Brocke, Jan, Series Editor, Winter, Robert, Series Editor, Baskerville, Richard, Series Editor, Za, Stefano, Series Editor, Braccini, Alessio Maria, Series Editor, Ben Ahmed, Mohamed, editor, Boudhir, Anouar Abdelhakim, editor, Abd Elhamid Attia, Hany Farhat, editor, Eštoková, Adriana, editor, and Zelenáková, Martina, editor
- Published
- 2024
- Full Text
- View/download PDF
8. CINWA (database of terminology for cultivated plants in indigenous languages of northwestern South America): introducing a resource for research in ethnobiology, anthropology, historical linguistics, and interdisciplinary research on the neolithic transition in South America
- Author
-
Urban, Matthias, Panchi, Evelyn Michelle Aguilar, Lee, Saetbyul, and Brodetsky, Evgenia
- Subjects
- *
HISTORICAL linguistics , *CULTIVATED plants , *DATABASES , *ETHNOBIOLOGY , *ONLINE databases - Abstract
This article introduces CINWA, a freely accessible online database of terminology for cultivated plants in indigenous languages of South America based on FAIR principles for scientific data management and stewardship. In the pre-release version we present here, CINWA assembles more than 2700 terms from more than 60 indigenous languages of northwestern South America, and coverage will be continuously expanded. CINWA is primarily designed for use in historical linguistics to explore patterns of lexical borrowing that might be used as a proxy for tracing the pathways by which knowledge of individual cultivated plants and the associated know-how spread from speech community to speech community in pre-Columbian South America. In spite of intensifying research, this is still unclear for most cultivars as the locales of initial cultivation are heterogeneous and spatially diffuse. However, possible uses of the CINWA database are manifold and go beyond this research question. The database can be used as a resource for ethnobiological and comparative anthropological research on South American communities, South American agricultural ecosystems and practices, and for studies in lexical borrowing, language contact, and historical linguistics broadly. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. A Review Paper on Sentence Similarity Using Different Algorithm
- Author
-
Bhale, Yogiraj, Kumar, Lilambuj, Singh, Rajat, Nazar, Saquib, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Bansal, Jagdish Chand, editor, Sharma, Harish, editor, and Chakravorty, Antorweep, editor
- Published
- 2023
- Full Text
- View/download PDF
10. LA80: A Lexical Database of 10 Bantu A80 Languages
- Author
-
Tessa Y. Vermeir, Marc Allassonnière-Tang, and Guillaume Segerer
- Subjects
lexical database ,north-western bantu languages ,corpus analysis ,typology ,lexical reconstructions ,History of scholarship and learning. The humanities ,AZ20-999 ,Language and Literature - Abstract
In this paper, we present LA80, a database containing lexical data of 10 Bantu A80 languages (Bekwel, Gyeli, Kol, Koonzime, Kwasio, Makaa, Mpiemo, Njyem, Shiwa and Sso). Data from existing fieldwork datasets have been compiled and formatted. We standardised French translations, corrected spelling mistakes, and merged overlapping data points, resulting in a database with 5,588 concepts. Furthermore, for a subset of 557 concepts available in at least six of the 10 languages, we did additional reformatting by separating prefixes from stems, something that is not done systematically in the source data. The LA80 database can be used for comparative linguistic analyses and diachronic reconstructions.
- Published
- 2024
- Full Text
- View/download PDF
11. Blackfoot Words: a database of Blackfoot lexical forms.
- Author
-
Weber, Natalie, Brown, Tyler, Celli, Joshua, Denham, McKenzie, Dykstra, Hailey, Hernandez-Merlin, Rodrigo, Hochstein, Evan, Hwang, Pinyu, Kidd, Nico, Kulmizev, Diana, Morrison, Hannah, Norris, Matty, and Venkatraman, Lena
- Subjects
- *
DATABASES , *LEXICAL access , *ORTHOGRAPHY & spelling , *VARIATION in language , *MORPHEMICS , *RELATIONAL databases - Abstract
This paper describes the structure and creation of Blackfoot Words, a new relational database of lexical forms (inflected words, stems, and morphemes) in Blackfoot (Algonquian; ISO 639-3: bla). To date, we have digitized 63,493 individual lexical forms from 30 sources, representing all four major dialects, and spanning the years 1743–2017. Version 1.1 of the database includes lexical forms from nine of these sources. This project has two aims. The first is to digitize and provide access to the lexical data in these sources, many of which are difficult to access and discover. The second is to organize the data so that connections can be made between instances of the "same" lexical form across all sources, despite variation across sources in the dialect recorded, orthographic conventions, and the depth of morpheme analysis. The database structure was developed in response to these aims. The database comprises five tables: Sources, Words, Stems, Morphemes, and Lemmas. The Sources table contains bibliographic information and commentary on the sources. The Words table contains inflected words in the source orthography. Each word is broken down into stems and morphemes which are entered into the Stems and Morphemes tables in the source orthography. The Lemmas table contains abstract versions of each stem or morpheme in a standardized orthography. Instances of the same stem or morpheme are linked to a common lemma. We expect that the database will support projects by the language community and other researchers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
12. DILLo: an Italian lexical database for speech-language pathologists
- Author
-
Beccaria, Federica, Cristiano, Angela, Pisciotta, Flavio, Usardi, Noemi, Borgogni, Elisa, Prayer Galletti, Filippo, Corsi, Giulia, Gregori, Lorenzo, and Gagliardi, Gloria
- Published
- 2024
- Full Text
- View/download PDF
13. Units of sub-sign meaning in NGT: A toolbox for sub-sign meaning in a lexical database.
- Author
-
Zwitserlood, Inge, van der Kooij, Els, and Crasborn, Onno
- Subjects
- *
DATABASES , *SIGN language , *PSYCHOLINGUISTICS - Abstract
This paper provides an overview of all the meaningful sub-sign form units (form-meaning units; FMUs) in lexical signs in Sign Language of the Netherlands (NGT). We investigated the potential meaning of all form features that were previously established in analyses of NGT form by analyzing their distribution in lexical signs. The data set consisted of 500 NGT signs in the lexical database Global Signbank, and a set of 163 elicited newly-formed lexical signs. All features in these data sets appear to bear meaning (at least once). No completely arbitrary features were found, and some features appeared to be always associated to a specific meaning. This toolkit and the set of FMUs in NGT provides a possible basis for cross-linguistic study and for a more fine-grained approach in various research disciplines, for instance psycholinguistics and acquisition, and it may thus advance the theoretical and applied study of sign languages. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. CCLOWW: A grade-level Chinese children's lexicon of written words.
- Author
-
Li, Luan, Yang, Yang, Song, Ming, Fang, Siyi, Zhang, Manyan, Chen, Qingrong, and Cai, Qing
- Subjects
- *
CHINESE people , *LEXICON , *CHINESE characters , *WORD recognition , *DATABASES , *LINGUISTIC analysis , *COMPARATIVE grammar - Abstract
In this article, we present the Chinese Children's Lexicon of Written Words (CCLOWW), the first grade-level database that provides frequency statistics of simplified Chinese characters and words for children. The database computes from a corpus of 34,671,424 character tokens and 22,427,010 word tokens (including single- and multicharacter words), extracted from 2131 books. It contains 6746 different character types and 153,079 different word types. CCLOWW provides several frequency indices of simplified Chinese for three grade levels (grade 2 and below, grades 3–4, grades 5–6) to profile children's experience with written Chinese in and outside of school. We describe in this article the distributions of frequency and contextual diversity of the characters and words, as well as word length and syntactic categories of the words in the corpus and the subcorpora. We also report results of correlation analyses with other written corpora and of several naming and lexicon decision experiments. The findings suggest that CCLOWW frequency measures correlate well with other corpora. Importantly, they could reliably predict children's and adults' naming and lexical decision performances. They could also explain variance in adults' visual word recognition, in addition to frequency measures computed in an adult corpus, indicating that early print exposure might influence readers' lexical processing later on beyond an age of acquisition effect. CCLOWW will help researchers in language processing and development as well as educators with selecting language materials appropriate for children's developmental stages. The database is freely available online at https://www.learn2read.cn/database/. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
15. Application of Expectation–Maximization Algorithm to Solve Lexical Divergence in Bangla–Odia Machine Translation
- Author
-
Das, Bishwa Ranjan, Maringanti, Hima Bindu, Dash, Niladri Sekhar, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Dehuri, Satchidananda, editor, Prasad Mishra, Bhabani Shankar, editor, Mallick, Pradeep Kumar, editor, and Cho, Sung-Bae, editor
- Published
- 2022
- Full Text
- View/download PDF
16. CCLOOW: Chinese children’s lexicon of oral words
- Author
-
Li, Luan, Zhao, Wentao, Song, Ming, Wang, Jing, and Cai, Qing
- Published
- 2024
- Full Text
- View/download PDF
17. A New Corpus-Driven Lexical Database for Lithuanian as a Foreign Language
- Author
-
Kovalevskaitė Jolanta and Rimkutė Erika
- Subjects
lexical database ,corpus pattern analysis ,corpus ,corpus linguistics ,learner lexicography ,lithuanian language ,Special aspects of education ,LC8-6691 ,Geography. Anthropology. Recreation - Abstract
In this paper, we describe a new lexicographic resource for advanced learners of Lithuanian, the Lexical Database of Lithuanian Language Usage, which is the first attempt in Lithuanian lexicography to prepare a description of vocabulary based on the word usage analysis in the particular corpus. The written subpart of the Lithuanian Pedagogic Corpus (approx. 620,000 tokens) was used to develop headword lists and collect word usage information in the form of corpus patterns. In the database, there are 3,700 lexical items, words and multi-word units (compounds, idioms or sayings). For the appr. 700 most frequent words from a shared vocabulary (they appear in texts assigned to A1, A2, B1 and B2 levels, and their frequency in the whole corpus is 100 occurrences and above), we prepared a full-record entry: it includes sense-related corpus patterns with grammatical, semantic and lexical information and the examples illustrating all pattern components. The short-record entry (no patterns, only examples) is prepared for the less frequent words from the shared vocabulary, which are derivationally related to the most frequent headwords. The users are provided with 2,542 derivatives, which are linked to 940 headwords. In the database, 28,550 encoding examples are manually selected for all 3,000 headwords and 700 phrases. We discuss the features of the database, and, particularly, the adopted semi-automated procedure of Corpus Pattern Analysis, which was used for the description of word usage. We evaluate the approach applied, and discuss its advantages for users as well as provide the suggestions for the future improvements of the resource, which can be used as an additional resource in the classroom of Lithuanian as a foreign language, and, together with the available corpora, fill in a gap of usage information in the existing (learner) dictionaries.
- Published
- 2022
- Full Text
- View/download PDF
18. Lemmatization of Inflected Nouns
- Author
-
Dash, Niladri Sekhar and Dash, Niladri Sekhar
- Published
- 2021
- Full Text
- View/download PDF
19. A novel hybrid methodology for computing semantic similarity between sentences through various word senses.
- Author
-
Ahmad, Farooq and Faisal, Mohammad
- Subjects
NATURAL language processing ,SEMANTICS ,LEXICAL access ,DATABASES ,STATISTICAL correlation - Abstract
In the area of natural language processing, measuring sentence similarity is an essential problem. Searching for semantic meaning in natural language is a related issue. The task of measuring sentence similarity is to find semantic symmetry in two sentences, not matter how they are arranged. It is important to measure the similarity of sentences accurately. To compute the similarity between sentences, existing methods have been constructed from approaches for large texts. Since these methods work in very high-dimensional spaces, they are inefficient, require human input, and are not flexible enough for some applications. In this study, we propose a hybrid method (HydMethod) which considers not only semantic information including lexical databases, word embeddings, and corpus statistics, but also implied word order information. With lexical databases, our method models human common sense knowledge, and that knowledge can then be adapted to be used in different domains with the incorporation of corpus statistics. Therefore, the methodology is applicable across several domains. As part of our experiments, we used two standard datasets - Pilot Short Text Semantic Similarity Benchmark and MS paraphrase - in order to demonstrate the efficacy of our proposed method. As a result, the proposed method outperforms the existing approaches when tested on these two datasets, giving the highest correlation value for both word and sentence similarity. Moreover, it achieves a maximum of 32% higher increase than only using word vector or WorldNet based methodology. With Rubenstein and Goodenough word & sentence pairs, our algorithm's similarity measure shows a high Pearson correlation coefficient of 0.8953. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. The Children and Young People’s Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom
- Author
-
Korochkina, M, Marelli, M, Brysbaert, M, Rastle, K, Korochkina M., Marelli M., Brysbaert M., Rastle K., Korochkina, M, Marelli, M, Brysbaert, M, Rastle, K, Korochkina M., Marelli M., Brysbaert M., and Rastle K.
- Abstract
This article introduces the Children and Young People’s Books-Lexicon (CYP-LEX), a large-scale lexical database derived from books popular with children and young people in the United Kingdom. CYP-LEX includes 1,200 books evenly distributed across three age bands (7–9, 10–12, 13+) and comprises over 70 million tokens and over 105,000 types. For each word in each age band, we provide its raw and Zipf-transformed frequencies, all parts-of-speech in which it occurs with raw frequency and lemma for each occurrence, and measures of count-based contextual diversity. Together and individually, the three CYP-LEX age bands contain substantially more words than any other publicly available database of books for primary and secondary school children. Most of these words are very low in frequency, and a substantial proportion of the words in each age band do not occur on British television. Although the three age bands share some very frequent words, they differ substantially regarding words that occur less frequently, and this pattern also holds at the level of individual books. Initial analyses of CYP-LEX illustrate why independent reading constitutes a challenge for children and young people, and they also underscore the importance of reading widely for the development of reading expertise. Overall, CYP-LEX provides unprecedented information into the nature of vocabulary in books that British children aged 7+ read, and is a highly valuable resource for those studying reading and language development.
- Published
- 2024
21. KAHD: Katukinan-Arawan-Harakmbut Database (Pre-release)
- Author
-
Fabrício Ferraz Gerardi, Carolina Coelho Aragon, and Stanislav Reichert
- Subjects
arawan languages ,amazonian languages ,lexical database ,historical linguistics ,computational linguistics ,language documentation ,History of scholarship and learning. The humanities ,AZ20-999 ,Language and Literature - Abstract
Katukinan, Arawan, and Harakmbut are small language families spoken in south-western Amazonia. These families have received some attention, but there are no consistently transcribed and machine-readable datasets available for them. We address this lacuna by introducing the first publicly available linguistic dataset of Arawan languages as the first part of the Katukinan-Arawan-Harakmbut Database, created with the goal of providing and regularly updating a list of lexical items in a consistent transcription and with cognacy annotation. The database is being developed to be used in quantitative and genealogical investigations.
- Published
- 2022
- Full Text
- View/download PDF
22. Database of word-level statistics for Mandarin Chinese (DoWLS-MAN).
- Author
-
Neergaard, Karl David, Xu, Hongzhi, German, James S., and Huang, Chu-Ren
- Subjects
- *
MANDARIN dialects , *NUMERIC databases , *TONE (Phonetics) , *ORTHOGRAPHY & spelling , *CHINESE characters , *SPEECH , *MENTAL representation - Abstract
In this article we present the Database of Word-Level Statistics for Mandarin Chinese (DoWLS-MAN). The database addresses the lack of agreement in phonological syllable segmentation specific to Mandarin by offering phonological features for each lexical item according to 16 schematic representations of the syllable (8 with tone and 8 without tone). Those lexical statistics that differ per phonological word and nonword due to changes in syllable segmentation are of the variant category and include subtitle lexical frequency, phonological neighborhood density measures, homophone density, and network science measures. The invariant characteristics consist of each items' lexical tone, phonological transcription, and syllable structure among others. The goal of DoWLS-MAN is to provide researchers both the ability to choose stimuli that are derived from a segmentation schema that supports an existing model of Mandarin speech processing, and the ability to choose stimuli that allow for the testing of hypotheses on phonological segmentation according to multiple schemas. In an exploratory analysis we illustrate how multiple schematic representations of the phonological mental lexicon can aid in hypothesis generation, specifically in terms of phonological processing when reading Chinese orthography. Users of the database can search among over 92,000 words, over 1600 out-of-vocabulary Chinese characters, and 4300 phonological nonwords according to either Chinese orthography, pinyin, or ASCII phonetic script. Users can also generate a list of phonological words and nonwords according to user-defined ranges and categories of lexical characteristics. DoWLS-MAN is available to the public for search or download at https://dowls.site. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. A large and evolving cognate database.
- Author
-
Batsuren, Khuyagbaatar, Bella, Gábor, and Giunchiglia, Fausto
- Subjects
- *
ETYMOLOGY , *DATABASES , *KNOWLEDGE base , *DATA analysis , *QUANTITATIVE research - Abstract
We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and 35 writing systems, with new releases already in preparation. The paper presents the algorithm and input resources used for its computation, an evaluation of the result, as well as a quantitative analysis of cognate data leading to novel insights on language diversity. Furthermore, as an example on the use of large-scale cross-lingual knowledge bases for improving the quality of multilingual applications, we present a case study on the use of CogNet for bilingual lexicon induction in the framework of cross-lingual transfer learning. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Typologie víceslovných jednotek v češtině a frekvenční zastoupení jejich hlavních vlastností v žánrově vyváženém korpusu
- Author
-
Vladimír Petkevič, Marie Kopřivová, Milena Hnátková, Tomáš Jelínek, Pavel Kopřiva, Alexandr Rosen, Hana Skoumalová, and Pavel Vondřička
- Subjects
multiword (lexical) expressions in czech ,typology of multiword expressions ,frequency of types of multiword expressions ,idiomaticity ,lexical database ,genre-balanced corpus ,Philology. Linguistics ,P1-1091 - Abstract
The paper consists of two main parts: (a) In the first part, a typology of multiword expressions (MWE) in Czech is described in a detailed way. This typology is part of the description of MWE database entries in the lexical database LEMUR containing more than 10,500 MWE entries as of June 2020. MWE properties reflected in this typology are accounted for by categories and their values. Each MWE is identified by a unique lemma; a group of related MWEs is assigned a “superlemma”. A MWE is described by the following properties: a MWE definition, characteristic examples, lemmas and morphological features of MWE components (words), as well as the following key categories: MWE style/register, type of usage, syntactic structure (including its representation by a dependency and a phrase-structure tree), aspects of flexibility (variants and fragments, internal modifiability of individual MWE components, possibilities of syntactic transformations of the main MWE components and morphological constraints) and types of idiomaticity on the lexical, morphological, syntactic, semantic and pragmatic level. (b) In the second part of the paper, the authors focus on the frequency of the main features of the adopted typology in the real language material represented by the genre-balanced SYN2015 corpus, containing 100 mil. word forms (excluding punctuation): a type of usage correlated with a syntactic type and frequency of various kinds of idiomaticity. Our paper seems to be the first attempt at approaching the MWE properties from the point of view of MWE frequencies as types rather than tokens (i.e. frequencies of occurrences of a given MWE).
- Published
- 2020
25. Exploring Networks of Lexical Variation in Russian Sign Language
- Author
-
Vadim Kimmelman, Anna Komarova, Lyudmila Luchkova, Valeria Vinogradova, and Oksana Alekseeva
- Subjects
lexical variation ,phonological variation ,Russian Sign Language ,lexical database ,graph theory ,Psychology ,BF1-990 - Abstract
When describing variation at the lexical level in sign languages, researchers often distinguish between phonological and lexical variants, using the following principle: if two signs differ in only one of the major phonological components (handshape, orientation, movement, location), then they are considered phonological variants, otherwise they are considered separate lexemes. We demonstrate that this principle leads to contradictions in some simple and more complex cases of variation. We argue that it is useful to visualize the relations between variants as graphs, and we describe possible networks of variants that can arise using this visualization tool. We further demonstrate that these scenarios in fact arise in the case of variation in color terms and kinship terms in Russian Sign Language (RSL), using a newly created database of lexical variation in RSL. We show that it is possible to develop a set of formal rules that can help distinguish phonological and lexical variation also in the problematic scenarios. However, we argue that it might be a mistake to dismiss the actual patterns of variant relations in order to arrive at the binary lexical vs. phonological variant opposition.
- Published
- 2022
- Full Text
- View/download PDF
26. Exploring Networks of Lexical Variation in Russian Sign Language.
- Author
-
Kimmelman, Vadim, Komarova, Anna, Luchkova, Lyudmila, Vinogradova, Valeria, and Alekseeva, Oksana
- Subjects
SIGN language ,RUSSIAN language ,KINSHIP ,LEXEME - Abstract
When describing variation at the lexical level in sign languages, researchers often distinguish between phonological and lexical variants, using the following principle: if two signs differ in only one of the major phonological components (handshape, orientation, movement, location), then they are considered phonological variants, otherwise they are considered separate lexemes. We demonstrate that this principle leads to contradictions in some simple and more complex cases of variation. We argue that it is useful to visualize the relations between variants as graphs, and we describe possible networks of variants that can arise using this visualization tool. We further demonstrate that these scenarios in fact arise in the case of variation in color terms and kinship terms in Russian Sign Language (RSL), using a newly created database of lexical variation in RSL. We show that it is possible to develop a set of formal rules that can help distinguish phonological and lexical variation also in the problematic scenarios. However, we argue that it might be a mistake to dismiss the actual patterns of variant relations in order to arrive at the binary lexical vs. phonological variant opposition. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. KAHD: Katukinan-ArawanHarakmbut Database (Pre-release).
- Author
-
FERRAZ GERARDI, FABRÍCIO, COELHO ARAGON, CAROLINA, and REICHERT, STANISLAV
- Subjects
LANGUAGE & languages ,HISTORICAL linguistics ,DATA analysis ,SOCIAL networks ,MACHINE learning - Abstract
Katukinan, Arawan, and Harakmbut are small language families spoken in southwestern Amazonia. These families have received some attention, but there are no consistently transcribed and machine-readable datasets available for them. We address this lacuna by introducing the first publicly available linguistic dataset of Arawan languages as the first part of the Katukinan-Arawan-Harakmbut Database, created with the goal of providing and regularly updating a list of lexical items in a consistent transcription and with cognacy annotation. The database is being developed to be used in quantitative and genealogical investigations. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. TuLeD (Tupían lexical database): introducing a database of a South American language family.
- Author
-
Gerardi, Fabrício Ferraz, Reichert, Stanislav, and Aragon, Carolina Coelho
- Subjects
- *
AMERICAN English language , *DATABASES , *FAMILIES - Abstract
The last two decades witnessed a rapid growth of publicly accessible online language resources. This has allowed for valuable data on lesser known languages to become available. Such resources provide linguists with opportunities for advancing their research. Yet despite the proliferation of lexical and morphological databases, the ca. 456 languages spoken in South America are poorly represented, particularly the Tupían family, which is the largest on the continent. This paper therefore introduces and discusses TuLeD, a lexical database exclusively devoted to a South American language family. It provides a comprehensive list of lexical items presented in a unified transcription for all languages with cognacy assignment and relevant (cultural or linguistic) notes. One of the main goals of TuLeD is to become a full-fledged database and a benchmark for linguistic studies on South American languages in general and the Tupían family in particular. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
29. Developing a Transfer-Based System for Arabic Dialects Translation
- Author
-
Hamada, Salwa, Marzouk, Reham M., Kacprzyk, Janusz, Series editor, Shaalan, Khaled, editor, Hassanien, Aboul Ella, editor, and Tolba, Fahmy, editor
- Published
- 2018
- Full Text
- View/download PDF
30. Challenging the Boundaries of Unsupervised Learning for Semantic Similarity
- Author
-
Atish Pawar and Vijay Mago
- Subjects
Corpus ,lexical database ,natural language processing ,semantic analysis ,sentence similarity ,word similarity ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The semantic analysis field has a crucial role to play in the research related to text analytics. Calculating the semantic similarity between sentences is a long-standing problem in the area of natural language processing, and it differs significantly as the domain of operation differs. In this paper, we present a methodology that can be applied across multiple domains by incorporating corpora-based statistics into a standardized semantic similarity algorithm. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based approach using a lexical database. When tested on both benchmark standards and mean human similarity dataset, the methodology achieves a high correlation value for both word (r = 0.8753) and sentence similarity (r = 0.8793) concerning Rubenstein and Goodenough standard and the SICK dataset (r = 0.83241) outperforming other unsupervised models.
- Published
- 2019
- Full Text
- View/download PDF
31. Grammatical patterns in the corpus-driven "Lexical Database of Lithuanian".
- Author
-
Bielinskienė, Agnė, Kovalevskaitė, Jolanta, and Rimkutė, Erika
- Subjects
LITHUANIANS ,LANGUAGE & languages ,PARTS of speech ,ADJECTIVES (Grammar) ,NOUNS - Abstract
Copyright of Language: Meaning & Form / Valoda: Nozīme un Forma is the property of University of Latvia and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2021
- Full Text
- View/download PDF
32. Sources and steps of corpus lemmatization: Old English anomalous verbs.
- Author
-
García Fernández, Laura
- Subjects
- *
VERBS , *ENGLISH language usage , *SWEETNESS (Taste) , *CORPORA , *ORTHOGRAPHY & spelling - Abstract
This article describes the steps and results of the lemmatization of the derived anomalous verbs of Old English. The data have been retrieved from The Dictionary of Old English Web Corpus, searched through the lexical database from the Nerthus Project called Norna. The methodology comprises several steps combining automatic searches on the lemmatizer and manual revision. Part of the results, including the verbs starting with the letters A to H, are compared with the Dictionary of Old English, while the rest of the lemmas are checked with the standard Old English dictionaries (Clark-Hall, Sweet and Bosworth-Toller). The discussion leads to the conclusion that the lemmatization of the verbs of Old English, a language with a remarkable degree of spelling variation, requires considerable manual revision. However, the progressive improvement of automatic searches, based on the comparison of the initial results with the available lexicographical sources, minimizes the need for manual adjustment. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
33. Typologie víceslovných jednotek v češtině a frekvenční zastoupení jejich hlavních vlastností v žánrově vyváženém korpusu.
- Author
-
Petkevič, Vladimír, Kopřivová, Marie, Hnátková, Milena, Jelínek, Tomáš, Kopřiva, Pavel, Rosen, Alexandr, Skoumalová, Hana, and Vondřička, Pavel
- Subjects
IDIOMS ,DEFINITIONS ,TYPE & token (Linguistics) ,TERMS & phrases ,LEXICAL access - Abstract
Copyright of Studies in Applied Linguistics / Studie z Aplikované Lingvistiky is the property of Universita Karlova, Filozoficka Fakulta and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2020
34. NorthEuraLex: a wide-coverage lexical database of Northern Eurasia.
- Author
-
Dellert, Johannes, Daneyko, Thora, Münch, Alla, Ladygina, Alina, Buch, Armin, Clarius, Natalie, Grigorjew, Ilja, Balabel, Mohamed, Boga, Hizniye Isabella, Baysarova, Zalina, Mühlenbernd, Roland, Wahle, Johannes, and Jäger, Gerhard
- Subjects
- *
NUMBER concept , *DATABASES , *TURKIC languages , *INDO-European languages , *TEST methods - Abstract
This article describes the first release version of a new lexicostatistical database of Northern Eurasia, which includes Europe as the most well-researched linguistic area. Unlike in other areas of the world, where databases are restricted to covering a small number of concepts as far as possible based on often sparse documentation, good lexical resources providing wide coverage of the lexicon are available even for many smaller languages in our target area. This makes it possible to attain near-completeness for a substantial number of concepts. The resulting database provides a basis for rich benchmarks that can be used to test automated methods which aim to derive new knowledge about language history in underresearched areas. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
35. Automatic Creation of Ontology Using a Lexical Database: An Application for the Energy Sector
- Author
-
Moreira, Alexandra, Filho, Jugurta Lisboa, de Paiva Oliveira, Alcione, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Métais, Elisabeth, editor, Meziane, Farid, editor, Saraee, Mohamad, editor, Sugumaran, Vijayan, editor, and Vadera, Sunil, editor
- Published
- 2016
- Full Text
- View/download PDF
36. Strong verb lemmas from a corpus of old english. Advances and issues
- Author
-
Darío Metola Rodríguez
- Subjects
lemmatisation ,old english ,lexical database ,morphology ,orthography ,Philology. Linguistics ,P1-1091 - Abstract
The aim of this article is to devise the method of lemmatisation of strong verbs from a corpus of Old English with a view to maximising the automatic search for the inflectional forms, with the corresponding minimisation of manual revision of the verbs under analysis. The search algorithm, which consists of query strings and filters, is launched on the lemmatiser Norna, a component of the lexical database of Old English Nerthus. The conclusions of the article insist on the limits of automatic lemmatisation as well as the paths of refinement of the lemmatisation method in order to accomodate less predictable forms.
- Published
- 2017
- Full Text
- View/download PDF
37. The pedagogical benefits of a lexical database (SciE-Lex) to assist the production of publishable biomedical texts by EAL writers
- Author
-
Natalia Judith Laso and and Suganthi John
- Subjects
EAL writers ,biomedical discourse ,English for research publication purposes ,lexical database ,pedagogical benefits ,Language and Literature ,Philology. Linguistics ,P1-1091 - Abstract
Research has demonstrated that it is challenging for English as an Additional Language (EAL) writers to acquire phraseological competence in academic English and develop a good working knowledge of discipline-specific formulaic language. This paper aims to explore if SciE-Lex, a powerful lexical database of biomedical research articles, can be exploited by EAL writers to enhance their command of formulaic language in biomedical English published writing. Our paper reports on the challenges associated with formulaic language (namely collocations) for EAL writers, it reflects on the benefits of using a lexical database and it evaluates a pedagogical approach to helping EAL writers produce publishable texts. It specifically highlights results from two writing workshops conducted for EAL writers (medical researchers in the present study). The workshops involved medical researchers working on drafts of their writing using SciE-Lex. Our paper reports on the specific benefits of using SciE-Lex as demonstrated by revisions in the writing produced by the EAL medical researchers. This contribution aims to contribute to current discussion on English for Research Publication Purposes (ERPP) for the EAL community who now form the main contributors to research knowledge dissemination.
- Published
- 2017
38. Effects of lexical neighbourhood density and phonotactic probability studied with a new database of matched pairs of real signs and modelled pseudosigns in the Swedish Sign Language
- Author
-
Witte, Erik, Björkstrand, Thomas, Danielsson, Henrik, Holmer, Emil, Witte, Erik, Björkstrand, Thomas, Danielsson, Henrik, and Holmer, Emil
- Published
- 2023
39. Machine Learning-Based Web Documents Categorization by Semantic Graphs
- Author
-
Camastra, Francesco, Ciaramella, Angelo, Placitelli, Alessio, Staiano, Antonino, Howlett, Robert J., Series editor, Jain, Lakhmi C., Series editor, Bassis, Simone, editor, Esposito, Anna, editor, and Morabito, Francesco Carlo, editor
- Published
- 2015
- Full Text
- View/download PDF
40. Morphological Disambiguation of Classical Sanskrit
- Author
-
Hellwig, Oliver, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Mahlow, Cerstin, editor, and Piotrowski, Michael, editor
- Published
- 2015
- Full Text
- View/download PDF
41. Tibetan Linguistic Terminology on the Base of the Tibetan Traditional Grammar Treatises Corpus
- Author
-
Grokhovskiy, Pavel, Khokhlova, Maria, Smirnova, Maria, Zakharov, Victor, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Král, Pavel, editor, and Matoušek, Václav, editor
- Published
- 2015
- Full Text
- View/download PDF
42. A Modified Collaborative Filtering Approach for Collaborating Community
- Author
-
Bhagat, Pradnya, Mascarenhas, Maruska, Howlett, Robert J., Series editor, Jain, Lakhmi C., Series editor, Kumar Kundu, Malay, editor, Mohapatra, Durga Prasad, editor, Konar, Amit, editor, and Chakraborty, Aruna, editor
- Published
- 2014
- Full Text
- View/download PDF
43. When Rules Meet Bigrams
- Author
-
Wehrli, Eric, Nerima, Luka, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, and Gelbukh, Alexander, editor
- Published
- 2014
- Full Text
- View/download PDF
44. SPALEX: A Spanish Lexical Decision Database From a Massive Online Data Collection
- Author
-
Jose Armando Aguasvivas, Manuel Carreiras, Marc Brysbaert, Paweł Mandera, Emmanuel Keuleers, and Jon Andoni Duñabeitia
- Subjects
megastudies ,lexical decision ,vocabulary knowledge ,online assessments ,lexical database ,Psychology ,BF1-990 - Published
- 2018
- Full Text
- View/download PDF
45. PRALED - A New Kind of Lexicographic Workstation
- Author
-
Horák, Aleš, Rambousek, Adam, Przepiórkowski, Adam, editor, Piasecki, Maciej, editor, Jassem, Krzysztof, editor, and Fuglewicz, Piotr, editor
- Published
- 2013
- Full Text
- View/download PDF
46. Ontology-Based Knowledge Elicitation: An Architecture
- Author
-
Montedoro, Marcello, Orsi, Giorgio, Sbattella, Licia, Tedesco, Roberto, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Anastasi, Giuseppe, editor, Bellini, Emilio, editor, Di Nitto, Elisabetta, editor, Ghezzi, Carlo, editor, Tanca, Letizia, editor, and Zimeo, Eugenio, editor
- Published
- 2012
- Full Text
- View/download PDF
47. A Query Language for WordNet-Like Lexical Databases
- Author
-
Kubis, Marek, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Pan, Jeng-Shyang, editor, Chen, Shyi-Ming, editor, and Nguyen, Ngoc Thanh, editor
- Published
- 2012
- Full Text
- View/download PDF
48. Data Weeding Techniques Applied to Roget’s Thesaurus
- Author
-
Priss, Uta, Old, L. John, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Wolff, Karl Erich, editor, Palchunov, Dmitry E., editor, Zagoruiko, Nikolay G., editor, and Andelfinger, Urs, editor
- Published
- 2011
- Full Text
- View/download PDF
49. A Taxonomic Generalization Technique for Natural Language Processing
- Author
-
Ferilli, Stefano, Di Mauro, Nicola, Basile, Teresa M. A., Esposito, Floriana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Kryszkiewicz, Marzena, editor, Rybinski, Henryk, editor, Skowron, Andrzej, editor, and Raś, Zbigniew W., editor
- Published
- 2011
- Full Text
- View/download PDF
50. An Access Layer to PolNet – Polish WordNet
- Author
-
Kubis, Marek, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, and Vetulani, Zygmunt, editor
- Published
- 2011
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.