902 results on '"Word segmentation"'
Search Results
2. Back to Supervision: Boosting Word Boundary Detection Through Frame Classification
- Author
-
Carnemolla, Simone, Calcagno, Salvatore, Palazzo, Simone, Giordano, Daniela, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
3. The role of semantic information in Chinese word segmentation.
- Author
-
Chen, Ruqi, Huang, Linjieqiong, Perea, Manuel, and Li, Xingshan
- Subjects
- *
READING , *MASKING (Psychology) , *RESEARCH funding , *PHONOLOGICAL awareness , *DESCRIPTIVE statistics , *PSYCHOLINGUISTICS , *ATTENTION , *SEMANTICS , *VISUAL perception , *EYE movements , *COGNITION - Abstract
Word segmentation is crucial for reading in Chinese, where the absence of explicit word boundaries poses a distinct challenge. Previous studies in Chinese have examined how lexical and sub-lexical variables affect word segmentation. The present study investigated whether higher-level semantic information affects word segmentation using a primed word segmentation task with Overlapping Ambiguous Strings (OAS). An OAS is a three-character string in Chinese (e.g. ABC [in Latin letters]) where the middle character can constitute a word with both the left (word AB) and right (word BC) characters. The OAS was preceded by a semantic or repetition prime (presented for 42, 83, or 200 ms, across participants), priming either AB or BC. The semantic priming effect occurred at the 200-ms Stimulus Onset Asynchrony (SOA), whereas the repetition priming effect occurred at both 83 and 200-ms SOAs. These findings demonstrate that semantic information can affect word segmentation in Chinese within 200 ms. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
4. Exploring the processing unit of L2 Chinese learners in on-line Chinese reading.
- Author
-
Yao, Panpan, Jiang, Xin, Chen, Xinwei, and Li, Xingshan
- Subjects
- *
EYE movements , *CHINESE language , *EYE tracking , *VERBS , *VOCABULARY - Abstract
The present study explored the processing units of high-proficiency second language (L2) Chinese learners in on-line reading in an eye-tracking experiment. The critical aim was to investigate how learners segment continuous characters into words without the aid of word boundary demarcations. Based on previous studies, the embedded words of 2- and 3-character incremental words were manipulated to be either plausible or implausible with the preceding verbs, while the incremental words themselves were always plausible. The results revealed an effect of the plausibility manipulation, which suggested that L2 Chinese learners activated embedded words first and integrated embedded words with previous sentence context as soon as they read them. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
5. A corpus of Chinese word segmentation agreement.
- Author
-
Tsang, Yiu-Kei, Yan, Ming, Pan, Jinger, and Chan, Megan Yin Kan
- Abstract
The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., “反映–了”), modifier–head type (e.g., “科學–教育”), and others (e.g., “大力–支持”). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
6. UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation.
- Author
-
Wen, Yonghua, Xian, Yantuan, Wang, Yuehan, and Yu, Zhengtao
- Subjects
NATURAL language processing ,TRANSFORMER models ,GENERALIZATION - Abstract
Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut's multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Constraints on Acceleration in Bilingual Development: Evidence from Word Segmentation by Spanish Learning Infants
- Author
-
Mateu, Victoria and Sundara, Megha
- Subjects
Cognitive and Computational Psychology ,Psychology ,Behavioral and Social Science ,Minority Health ,Clinical Research ,Basic Behavioral and Social Science ,Pediatric ,Spanish ,English ,bilingualism ,word segmentation ,acceleration ,frequency ,noise tolerance ,regularization ,Cognitive Sciences ,Clinical sciences ,Biological psychology ,Clinical and health psychology - Abstract
We have previously shown that bilingual Spanish and English-learning infants can segment English iambs, two-syllable words with final stress (e.g., guiTAR), earlier than their monolingual peers. This is consistent with accelerated development in bilinguals and was attributed to bilingual infants' increased exposure to iambs through Spanish; about 10% of English content words start with an unstressed syllable, compared to 40% in Spanish. Here, we evaluated whether increased exposure to a stress pattern alone is sufficient to account for acceleration in bilingual infants. In English, 90% of content words start with a stressed syllable (e.g., KINGdom), compared to 60% in Spanish. However, we found no evidence for accelerated segmentation of Spanish trochees by Spanish-English bilingual infants compared to their monolingual Spanish-learning peers. Based on this finding, we argue that merely increased exposure to a linguistic feature in one language does not result in accelerated development in the other. Instead, only the acquisition of infrequent patterns in one language may be accelerated due to the additive effects of the other language.
- Published
- 2024
8. English‐learning infants developing sensitivity to vowel phonotactic cues to word segmentation.
- Author
-
Katsuda, Hironori and Sundara, Megha
- Subjects
- *
STATISTICAL bootstrapping , *ARTIFICIAL languages , *STATISTICAL learning , *STRESS (Linguistics) , *PHONOTACTICS - Abstract
Previous research has shown that when domain‐general transitional probability (TP) cues to word segmentation are in conflict with language‐specific stress cues, English‐learning 5‐ and 7‐month‐olds rely on TP, whereas 9‐month‐olds rely on stress. In two artificial languages, we evaluated English‐learning infants' sensitivity to TP cues to word segmentation vis‐a‐vis language‐specific vowel phonotactic (VP) cues—English words do not end in lax vowels. These cues were either consistent or conflicting. When these cues were in conflict, 10‐month‐olds relied on the VP cues, whereas 5‐month‐olds relied on TP. These findings align with statistical bootstrapping accounts, where infants initially use domain‐general distributional information for word segmentation, and subsequently discover language‐specific patterns based on segmented words. Research Highlights: Research indicates that when transitional probability (TP) conflicts with stress cues for word segmentation, English‐learning 9‐month‐olds rely on stress, whereas younger infants rely on TP.In two artificial languages, we evaluated English‐learning infants' sensitivity to TP versus vowel phonotactic (VP) cues for word segmentation.When these cues conflicted, 10‐month‐olds relied on VPs, whereas 5‐month‐olds relied on TP.These findings align with statistical bootstrapping accounts, where infants first utilize domain‐general distributional information for word segmentation, and then identify language‐specific patterns from segmented words. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. The Comprehensive Analysis of the Effect of Chinese Word Segmentation on Fuzzy-Based Classification Algorithms for Agricultural Questions.
- Author
-
Zhao, Xinyue, Huang, Jianing, Zhang, Jing, and Song, Yunsheng
- Subjects
NATURAL language processing ,MEMBERSHIP functions (Fuzzy logic) ,ENCYCLOPEDIAS & dictionaries ,CHINESE language ,FEATURE extraction ,CLASSIFICATION algorithms ,FUZZY logic - Abstract
Fuzzy logic is the core method for handling uncertainty and vagueness of information in agricultural natural language processing, and it also plays a crucial role in word segmentation and text classification algorithms using the neural network. Word segmentation is often the primary step in Chinese text classification tasks and has a profound effect on the generation ability of classification algorithm-based fuzzy logic. However, the high complexity of text classification models structure and specificity of agricultural data take a great challenge to studying the effect of word segmentation. Although there have been several attempts to resolve this issue, the main effort focuses on word segment Precision or the generalization performance of multiple word segment methods for the same classification algorithm and does not involve agricultural text. To solve this problem from the perspective of rational analysis and empirical analysis, a comprehensive analysis has been made to study the effect of Chinese word segmentation on fuzzy-based classification algorithms for agricultural questions. It initially discusses the characteristics of agricultural questions for the subsequent analysis of the field adaptability of word segmentation and classification algorithms, employs fuzzy logic to convert the Chinese word segmentation task into a sequence labeling problem, and then analyzes the characteristics, techniques, and performance disparities of the seven mainstream open-source Chinese word segmentation integration tools at the current stage. Subsequently, an exploration has been conducted into the impact of Chinese word segmentation on the generalization performance of classification algorithms under the proposed unified model framework for text classification based on fuzzy logic. Finally, many experiments have been performed on the actual data crawled from typical agricultural websites to empirically study the differences and robustness of the effect of different word segmentation tools on classification performance, as well as the contribution of the external dictionary. Comparative experimental results show which word segmentation tools have a solid effect on classification performance and a strong robust effect on the typical text feature extraction layer for classification tasks, and the external dictionary have no significant effect on classification performance. The research results have essential reference significance for how to select appropriate word segmentation tools to deal with Chinese natural language processing tasks in future. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. More cues or more languages? word segmentation using statistical learning in multilinguals, bilinguals, and monolinguals.
- Author
-
Tachakourt, Yasmine and Rassili, Outhmane
- Subjects
LANGUAGE & languages ,MONOLINGUALISM ,MULTILINGUALISM ,TONE (Phonetics) ,ACQUISITION of data - Abstract
This study aims to extend statistical learning (SL) research to multilinguals and provide an insight into what could facilitate word segmentation. We studied how the number of cues available in the input as well as the number of languages spoken influence SL and word segmentation. We used two SL tasks: one involving the tracking of transitional probabilities (TPs) between syllables of words, and another involving the tracking of two congruent cues-syllables and tones – in an artificial tone language. Data was collected from monolinguals, bilinguals, trilinguals, and quadrilinguals. Our results indicate that all language groups demonstrated similar SL capacity when segmenting words using TPs of syllables. However, when an additional cue was added, bilinguals, trilinguals, and quadrilinguals outperformed monolinguals. Interestingly, quadrilinguals also outperformed bilinguals. Performance was best for all groups when the input afforded two cues. This study suggests that while experience with multiple languages does not affect core SL ability, it enhances the tracking of multiple cues. The study further indicates that SL is affected by the number of cues available in the input as we found that performance was facilitated by the presence of two congruent cues. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Segmentación en palabras de textos escritos por alumnos de primer curso de Educación Primaria: un estudio en el contexto español.
- Author
-
Gutiérrez Cáceres, Rafaela
- Subjects
PROGRAMMING languages ,COMMUNICATIVE competence ,PRIMARY education ,SPANISH language ,EDUCATION research - Abstract
Copyright of Revista Complutense de Educación is the property of Universidad Complutense de Madrid and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
12. Unfolding Prosody Guides the Development of Word Segmentation.
- Author
-
Frota, Sónia, Severino, Cátia, and Vigário, Marina
- Subjects
SPEECH ,PRODUCTION planning ,PROSODIC analysis (Linguistics) ,INFANTS ,VOCABULARY - Abstract
Prosody is known to scaffold the learning of language, and thus understanding prosodic development is vital for language acquisition. The present study explored the unfolding prosody model of prosodic development (proposed in Frota's et al. study in 2016) beyond early production data, to examine whether it predicted the development of early segmentation abilities. European Portuguese-learning infants aged between 5 and 17 months were tested in a series of word segmentation experiments. Developing prosodic structure was evidenced in word segmentation as proposed by the unfolding model: (i) a simple monosyllabic word shape crucially placed at a major prosodic edge was segmented first, before more complex word shapes under similar prosodic conditions; (ii) the segmentation of more complex words was easier at a major prosodic edge than in phrase-medial position; and (iii) the segmentation of complex words with an iambic pattern preceded the segmentation of words with a trochaic pattern. These findings demonstrated that word segmentation evolved with unfolding prosody, suggesting that the prosodic units developed in the unfolding process are used both as speech production planning units and to extract word-forms from continuous speech. Therefore, our study contributes to a better understanding of the mechanisms underlying word segmentation, and to a better understanding of early prosodic development, a cornerstone of language acquisition. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Chinese readers utilize emotion information for word segmentation.
- Author
-
Huang, Linjieqiong, Zhang, Xiangyang, and Li, Xingshan
- Subjects
- *
CHINESE language , *INFORMATION processing , *EMOTIONS , *VOCABULARY - Abstract
We reported a large-scale Internet-based experiment to investigate the impact of emotion information on Chinese word segmentation, in which participants completed an overlapping ambiguous string (OAS) segmentation task and the Chinese version of Beck Depression Inventory-II in a counterbalanced order. OAS is a three-character string (ABC) in which the middle character can form a distinct word with both the character on its left side (word AB) and the character on its right side (word BC). Participants were presented with isolated OASs and were asked to report the word they identified first. Emotional OAS was constructed by a combination of a neutral word and an emotional word, with the neutral and emotional words sharing character B. We orthogonally manipulated the valence of the emotional words (positive vs. negative) and their position in the OAS (left-side vs. right-side). The results showed that compared with neutral words, both positive and negative words were more likely to be segmented, and this segmentation outcome was not affected by readers with different depression tendencies. These findings suggest that emotion information can influence word segmentation, and that both positive and negative words take precedence over neutral words in the word segmentation process. This study provides a new perspective and evidence to understand the impact of emotion information on word processing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Ham or hamster? Eye-tracking evidence of a clear speech benefit for word segmentation in quiet and in noise.
- Author
-
Guo, Zhe-chen and Smiljanic, Rajka
- Subjects
- *
LANGUAGE & languages , *NOISE , *SOUND , *HAMSTERS , *SCIENTIFIC observation , *VERBAL behavior testing , *DESCRIPTIVE statistics , *SPEECH perception , *ARTICULATION (Speech) , *EYE movements - Abstract
This study examined whether intelligibility-enhancing hyperarticulated clear speaking styles improve word segmentation during real-time speech processing in quiet and in noise. English-speaking listeners heard clearly and conversationally spoken sentences in which the target (e.g. ham) was temporarily ambiguous with a competitor (e.g. hamster) across a word boundary (e.g. ham starting) while their eye fixations to target and competitor images were recorded. Relative to conversational speech, clear speech led listeners to fixate the target image over the competitor image to a greater degree, indicating facilitation of word segmentation. Such facilitation emerged in quiet and in noise even before disambiguating segmental information (e.g. /ɑ/ in starting) was available. A parallel clear speech benefit was not found when the disyllabic word (e.g. hamster) was the target. The findings suggest that improved word segmentation partly underlies the well-documented clear speech perceptual and cognitive benefits and may arise from the enhancements of multiple word boundary cues. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Statistical word segmentation succeeds given the minimal amount of exposure.
- Author
-
Hao Wang, Felix, Luo, Meili, and Wang, Suiping
- Subjects
- *
LANGUAGE acquisition , *SPEECH , *STATISTICAL learning - Abstract
One of the first tasks in language acquisition is word segmentation, a process to extract word forms from continuous speech streams. Statistical approaches to word segmentation have been shown to be a powerful mechanism, in which word boundaries are inferred from sequence statistics. This approach requires the learner to represent the frequency of units from syllable sequences, though accounts differ on how much statistical exposure is required. In this study, we examined the computational limit with which words can be extracted from continuous sequences. First, we discussed why two occurrences of a word in a continuous sequence is the computational lower limit for this word to be statistically defined. Next, we created short syllable sequences that contained certain words either two or four times. Learners were presented with these syllable sequences one at a time, immediately followed by a test of the novel words from these sequences. We found that, with the computationally minimal amount of two exposures, words were successfully segmented from continuous sequences. Moreover, longer syllable sequences providing four exposures to words generated more robust learning results. The implications of these results are discussed in terms of how learners segment and store the word candidates from continuous sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. A Comparative Evaluation of Thai Word Segmentation Techniques for Profanity Classification
- Author
-
Prachuabsupakij, Wanthanee, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, and Ma, Yongsheng, editor
- Published
- 2024
- Full Text
- View/download PDF
17. An Automatic Process of Online Handwriting Recognition and Its Challenges
- Author
-
Mamta, Singh, Gurpreet, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Pastor-Escuredo, David, editor, Brigui, Imene, editor, Kesswani, Nishtha, editor, Bordoloi, Sushanta, editor, and Ray, Ashok Kumar, editor
- Published
- 2024
- Full Text
- View/download PDF
18. NLP-Based Processing of Gujarati Compound Word Sandhi’s Generation and Segmentation
- Author
-
Patel, Nitesh G., Patel, Dhiren B., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Rathore, Vijay Singh, editor, Tavares, Joao Manuel R. S., editor, Surendiran, B., editor, and Yadav, Anil, editor
- Published
- 2024
- Full Text
- View/download PDF
19. Cross-Language Text Search Algorithm Based on Context-Compatible Algorithms
- Author
-
Sheng, Jianqiao, Zhang, Liang, Wang, Xiyin, Xu, Run, Wu, Jiaqi, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Tan, Ying, editor, and Shi, Yuhui, editor
- Published
- 2024
- Full Text
- View/download PDF
20. Generative Byte-Level Models for Restoring Spaces, Punctuation, and Capitalization in Multiple Languages
- Author
-
Dyer, Laurence, Hughes, Anthony, Can, Burcu, Celebi, Emre, Series Editor, Chen, Jingdong, Series Editor, Gopi, E. S., Series Editor, Neustein, Amy, Series Editor, Liotta, Antonio, Series Editor, Di Mauro, Mario, Series Editor, and Abbas, Mourad, editor
- Published
- 2024
- Full Text
- View/download PDF
21. Word Segmentation of Hiragana Sentences Using Hiragana BERT
- Author
-
Izutsu, Jun, Komiya, Kanako, Shinnou, Hiroyuki, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Fenrong, editor, Sadanandan, Arun Anand, editor, Pham, Duc Nghia, editor, Mursanto, Petrus, editor, and Lukose, Dickson, editor
- Published
- 2024
- Full Text
- View/download PDF
22. UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation
- Author
-
Yonghua Wen, Yantuan Xian, Yuehan Wang, and Zhengtao Yu
- Subjects
word segmentation ,transformer encoder ,multiple n-grams ,Thai ,Burmese ,Khmer ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut’s multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation.
- Published
- 2024
- Full Text
- View/download PDF
23. Vowel Harmony in Language Acquisition
- Author
-
Goad, Heather, Ozburn, Avery, van der Hulst, Harry, book editor, and Ritter, Nancy A., book editor
- Published
- 2024
- Full Text
- View/download PDF
24. A review on handwritten text segmentation in Indian languages
- Author
-
Moitra, Moumita and Saha, Sujan Kumar
- Published
- 2024
- Full Text
- View/download PDF
25. Efficient word segmentation for enhancing Chinese spelling check in pre-trained language model
- Author
-
Li, Fangfang, Jiang, Jie, Tang, Dafu, Shan, Youran, Duan, Junwen, and Zhang, Shichao
- Published
- 2024
- Full Text
- View/download PDF
26. Efficient Word Segmentation Is Preserved in Older Adult Readers: Evidence From Eye Movements During Chinese Reading.
- Author
-
Li, Lin, Bao, Lingshan, Li, Zhuoer, Li, Sha, Liu, Jingyi, Wang, Pin, Warrington, Kayleigh L., Gunn, Sarah, and Paterson, Kevin B.
- Abstract
College-aged readers use efficient strategies to segment and recognize words in naturally unspaced Chinese text. Whether this capability changes across the adult lifespan is unknown, although segmenting words in unspaced text may be challenging for older readers due to visual and cognitive declines in older age, including poorer parafoveal processing of upcoming characters. Accordingly, we conducted two eye movement experiments to test for age differences in word segmentation, each with 48 young (18–30 years) and 36 older (65+ years) native Chinese readers. Following Zhou and Li (2021), we focused on the processing of "incremental" three-character words, like 幼儿园 (meaning "kindergartens"), which contain an embedded two-character word (e.g., 幼儿, meaning "children"). In Experiment 1, either the three-character word or its embedded word was presented as the target word in sentence contexts where the three-character word always was plausible, and the embedded word was either plausible or implausible. Both age groups produced similar plausibility effects, suggesting age constancy in accessing the embedded word early during ambiguity processing before ultimately assigning an incremental word analysis. Experiment 2 provided further evidence that both younger and older readers access the embedded word early during ambiguity processing, but rapidly select the appropriate (incremental) word. Crucially, the findings suggest that word segmentation strategies do not differ with age. Public Significance Statement: The ability to segment unspaced text into words is fundamental for reading in writing systems that do not use spaces to indicate the boundaries between words in text. This includes character-based scripts, like Chinese and Japanese; alphabetic scripts that do not contain spacing, like Thai; and languages in which long words are composed of multiple units of meaning, like Finnish. Given typical age-related declines in perception and some aspects of cognition, our study focused on whether older readers have difficulty relative to the young in segmenting unspaced text into words. The findings suggest that the ability to segment words is retained by older readers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. WORD SEGMENTATION SKILL IN INFANTS AND ITS INFLUENCE ON VOCABULARY DEVELOPMENT: A REVIEW.
- Author
-
Hanbay, Orhan
- Subjects
LANGUAGE acquisition ,EYE contact ,INFANTS ,SPEECH ,INFANT development ,COGNITIVE development - Abstract
Copyright of Route Educational & Social Science Journal (Ress Journal) is the property of Ress Academy Publishing and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
28. Machine learning and data analysis for word segmentation of classical Chinese poems: illustrations with Tang and Song examples.
- Author
-
Liu, Chao-Lin, Chang, Wei-Ting, Chu, Chang-Ting, and Zheng, Ti-Yong
- Subjects
- *
CHINESE language , *MACHINE learning , *DATA analysis , *POETRY (Literary form) , *CHINESE literature , *MUSIC charts - Abstract
Words are essential parts for understanding classical Chinese poems. We report a collection of 32,399 classical Chinese poems that were annotated with word boundaries. Statistics about the annotated poems support a few heuristic experiences, including the patterns of lines and a practice for the parallel structures (對仗), that researchers of Chinese literature discuss in the literature. The annotators were affiliated with two universities, so they could annotate the poems as independently as possible. Results of an inter-rater agreement study indicate that the annotators have consensus over the identified words 93 per cent of the time and have perfect consensus for the segmentation of a poem 42 per cent of the time. We applied unsupervised classification methods to annotate the poems in several different settings, and evaluated the results with human annotations. Under favorable conditions, the classifier identified about 88 per cent of the words, and segmented poems perfectly 22 per cent of the time. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Older Adults and Their Families' Online Reviews of Urban Nursing Homes in China.
- Author
-
Yuan, Hao, Shen, Xiumei, Kong, Guangyan, and Duan, Chenhao
- Subjects
- *
EVALUATION of medical care , *MEDICAL quality control , *CONTENT analysis , *INTERNET , *DESCRIPTIVE statistics , *NURSING care facilities , *FAMILY attitudes , *THEMATIC analysis , *METROPOLITAN areas , *RESEARCH methodology , *PATIENT satisfaction , *PATIENTS' attitudes , *REGRESSION analysis , *OLD age - Abstract
Background and Objectives Social media has made online care facility reviews popular. By analyzing online reviews of nursing homes (NHs), managers and designers can acquire insight into the perceptions of the older adults and their families. This study aims to help improve the care and environment of NHs. Research Design and Methods This study employed a mixed-methods approach to analyze online NH reviews. An inductive thematic analysis was utilized to identify and develop themes, followed by a detailed content analysis using Jieba, a Python-based program. Also, a regression analysis was conducted between the sentiment level of each subtheme and the final star ratings. Results Online reviews of NHs could be classified into 6 main themes, 18 subthemes, and 53 initial themes. Among the main themes, "service quality" received the most reviews, followed by "physical space environment." Of the 53 initial themes, "attitude and caring" received the most feedback, followed by "general impression of the space environment," and "meals and nutrition." Regression analysis using 18 subthemes revealed that, except for the "facility scale," all 17 subthemes were significantly connected with the final star rating. "Personal and property security" had the highest regression coefficients, followed by "service attitude" and "space." Discussion and Implications Online reviews provide a valuable supplement to conventional NH quality assessment criteria, enhancing person-centered care delivery. Based on the findings, recommendations for NH management and design are proposed to improve care quality, environment, and satisfaction for older adults and families. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. A Changing Role for Transitional Probabilities in Word Learning During the Transition to Toddlerhood?
- Author
-
Lany, Jill, Karaman, Ferhat, and Hay, Jessica F.
- Subjects
- *
COGNITION in children , *SPEECH perception in children , *LANGUAGE acquisition , *LEARNING strategies , *COMPARATIVE studies , *PROBABILITY theory - Abstract
Infants' sensitivity to transitional probabilities (TPs) supports language development by facilitating mapping high-TP (HTP) words to meaning, at least up to 18 months of age. Here we tested whether this HTP advantage holds as lexical development progresses, and infants become better at forming word–referent mappings. Two groups of 24-month-olds (N = 64 and all White, tested in the United States) first listened to Italian sentences containing HTP and low-TP (LTP) words. We then used HTP and LTP words, and sequences that violated these statistics, in a mapping task. Infants learned HTP and LTP words equally well. They also learned LTP violations as well as LTP words, but learned HTP words better than HTP violations. Thus, by 2 years of age sensitivity to TPs does not lead to an HTP advantage but rather to poor mapping of violations of HTP word forms. Public Significance Statement: Learning words is a fundamental aspect of early language development. This experiment sheds light on how the mechanisms that support word learning change across time, and suggests that experience with patterns in speech relevant to finding word forms play an important role in mapping word forms to meaning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Can Infants Retain Statistically Segmented Words and Mappings Across a Delay?
- Author
-
Karaman, Ferhat, Lany, Jill, and Hay, Jessica F.
- Abstract
Infants are sensitive to statistics in spoken language that aid word‐form segmentation and immediate mapping to referents. However, it is not clear whether this sensitivity influences the formation and retention of word‐referent mappings across a delay, two real‐world challenges that learners must overcome. We tested how the timing of referent training, relative to familiarization with transitional probabilities (TPs) in speech, impacts English‐learning 23‐month‐olds' ability to form and retain word‐referent mappings. In Experiment 1, we tested infants' ability to retain TP information across a 10‐min delay and use it in the service of word learning. Infants successfully mapped high‐TP but not low‐TP words to referents. In Experiment 2, infants readily mapped the same words even when they were unfamiliar. In Experiment 3, high‐ and low‐TP word‐referent mappings were trained immediately after familiarization, and infants readily remembered these associations 10 min later. In sum, although 23‐month‐old infants do not need strong statistics to map word forms to referents immediately, or to remember those mappings across a delay, infants are nevertheless sensitive to these statistics in the speech stream, and they influence mapping after a delay. These findings suggest that, by 23 months of age, sensitivity to statistics in speech may impact infants' language development by leading word forms with low coherence to be poorly mapped following even a short period of consolidation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Does the speaker's eye gaze facilitate infants' word segmentation from continuous speech? An ERP study.
- Author
-
Çetinçelik, Melis, Rowland, Caroline F., and Snijders, Tineke M.
- Abstract
The environment in which infants learn language is multimodal and rich with social cues. Yet, the effects of such cues, such as eye contact, on early speech perception have not been closely examined. This study assessed the role of ostensive speech, signalled through the speaker's eye gaze direction, on infants' word segmentation abilities. A familiarisation‐then‐test paradigm was used while electroencephalography (EEG) was recorded. Ten‐month‐old Dutch‐learning infants were familiarised with audio‐visual stories in which a speaker recited four sentences with one repeated target word. The speaker addressed them either with direct or with averted gaze while speaking. In the test phase following each story, infants heard familiar and novel words presented via audio‐only. Infants' familiarity with the words was assessed using event‐related potentials (ERPs). As predicted, infants showed a negative‐going ERP familiarity effect to the isolated familiarised words relative to the novel words over the left‐frontal region of interest during the test phase. While the word familiarity effect did not differ as a function of the speaker's gaze over the left‐frontal region of interest, there was also a (not predicted) positive‐going early ERP familiarity effect over right fronto‐central and central electrodes in the direct gaze condition only. This study provides electrophysiological evidence that infants can segment words from audio‐visual speech, regardless of the ostensiveness of the speaker's communication. However, the speaker's gaze direction seems to influence the processing of familiar words. Research Highlights: We examined 10‐month‐old infants' ERP word familiarity response using audio‐visual stories, in which a speaker addressed infants with direct or averted gaze while speaking.Ten‐month‐old infants can segment and recognise familiar words from audio‐visual speech, indicated by their negative‐going ERP response to familiar, relative to novel, words.This negative‐going ERP word familiarity effect was present for isolated words over left‐frontal electrodes regardless of whether the speaker offered eye contact while speaking.An additional positivity in response to familiar words was observed for direct gaze only, over right fronto‐central and central electrodes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Research on multi-granularity password analysis based on LLM
- Author
-
Meng HONG, Weidong QIU, Yangde WANG
- Subjects
large language model ,password analysis ,natural language processing ,word segmentation ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Password-based authentication has been widely used as the primary authentication mechanism.However, occasional large-scale password leaks have highlighted the vulnerability of passwords to risks such as guessing or theft.In recent years, research on password analysis using natural language processing techniques has progressed, treating passwords as a special form of natural language.Nevertheless, limited studies have investigated the impact of password text segmentation granularity on the effectiveness of password analysis with large language models.A multi-granularity password-analyzing framework was proposed based on a large language model, which follows the pre-training paradigm and autonomously learns prior knowledge of password distribution from large unlabelled datasets.The framework comprised three modules: the synchronization network, backbone network, and tail network.The synchronization network module implemented char-level, template-level, and chunk-level password segmentation, extracting knowledge on character distribution, structure, word chunk composition, and other password features.The backbone network module constructed a generic password model to learn the rules governing password composition.The tail network module generated candidate passwords for guessing and analyzing target databases.Experimental evaluations were conducted on eight password databases including Tianya and Twitter, analyzing and summarizing the effectiveness of the proposed framework under different language environments and word segmentation granularities.The results indicate that in Chinese user scenarios, the performance of the password-analyzing framework based on char-level and chunk-level segmentation is comparable, and significantly superior to the framework based on template-level segmentation.In English user scenarios, the framework based on chunk-level segmentation demonstrates the best password-analyzing performance.
- Published
- 2024
- Full Text
- View/download PDF
34. Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-Attention
- Author
-
Wazir Ali, Jay Kumar, Saifullah Tumani, Redhwan Nour, Adeeb Noor, and Zenglin Xu
- Subjects
Attention mechanism ,neural network ,long short-term memory ,representation learning ,word segmentation ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
- Published
- 2024
- Full Text
- View/download PDF
35. Evaluating Neural Network Models for Word Segmentation in Agglutinative Languages: Comparison With Rule-Based Approaches and Statistical Models
- Author
-
William Villegas-Ch, Rommel Gutierrez, Alexandra Maldonado Navarro, and Aracely Mera-Navarrete
- Subjects
Word segmentation ,agglutinative languages ,neural networks ,natural language processing (NLP) ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Word segmentation in agglutinative languages presents significant challenges due to morphological complexity and variability of linguistic structure. Although practical, traditional rule-based and statistical model-based approaches show limitations in handling these complexities. This study investigates the effectiveness of neural network models, specifically LSTM, Bi-LSTM with CRF, and BERT, in comparison to these traditional methods, using datasets from several agglutinative languages such as Turkish, Finnish, Hungarian, Nahuatl, and Swahili. The methodology includes preprocessing and data augmentation to improve data quality and consistency, followed by training and evaluating the selected models. The results reveal that the neural network models significantly outperform rule-based and statistical model-based approaches on all metrics assessed. Specifically, for the rule-based models, the BERT model achieved 92% accuracy and 91% F1-score in Turkish, compared to 70% and 67%, respectively. Moreover, the Bi-LSTM with CRF showed 86% recall in Finnish, significantly outperforming traditional models. Implementing advanced preprocessing and data augmentation techniques allows for optimizing the performance of the models. This study confirms the effectiveness of neural network models in word segmentation and provides a valuable framework for future research in natural language processing in complex linguistic contexts.
- Published
- 2024
- Full Text
- View/download PDF
36. Unfolding Prosody Guides the Development of Word Segmentation
- Author
-
Sónia Frota, Cátia Severino, and Marina Vigário
- Subjects
prosodic development ,prosodic structure ,word segmentation ,prosodic edge ,monosyllabic words ,bisyllabic words ,Language and Literature - Abstract
Prosody is known to scaffold the learning of language, and thus understanding prosodic development is vital for language acquisition. The present study explored the unfolding prosody model of prosodic development (proposed in Frota’s et al. study in 2016) beyond early production data, to examine whether it predicted the development of early segmentation abilities. European Portuguese-learning infants aged between 5 and 17 months were tested in a series of word segmentation experiments. Developing prosodic structure was evidenced in word segmentation as proposed by the unfolding model: (i) a simple monosyllabic word shape crucially placed at a major prosodic edge was segmented first, before more complex word shapes under similar prosodic conditions; (ii) the segmentation of more complex words was easier at a major prosodic edge than in phrase-medial position; and (iii) the segmentation of complex words with an iambic pattern preceded the segmentation of words with a trochaic pattern. These findings demonstrated that word segmentation evolved with unfolding prosody, suggesting that the prosodic units developed in the unfolding process are used both as speech production planning units and to extract word-forms from continuous speech. Therefore, our study contributes to a better understanding of the mechanisms underlying word segmentation, and to a better understanding of early prosodic development, a cornerstone of language acquisition.
- Published
- 2024
- Full Text
- View/download PDF
37. Heuristic-based text segmentation of bilingual handwritten documents for Gurumukhi-Latin scripts.
- Author
-
Kaur, Sukhandeep, Bawa, Seema, and Kumar, Ravinder
- Subjects
SCRIPTS ,ENGLISH language ,LATIN language ,PATTERN recognition systems ,MARKOV random fields - Abstract
This paper focuses on the segmentation of unconstrained handwritten documents containing bilingual data at the word level. Most of the official documents available in India are bilingual, i.e., in the regional language as well as English language. For the purpose of current research work, the handwritten documents from the academic domain containing Punjabi from Gurumukhi script and English language from Latin script have been considered. A heuristic approach based on projection profiles, statistical and structural properties of script has been designed to segment the textual data of documents. The proposed approach can segment the closed, curved, skewed and touched text lines having large variations in pattern and size. For word segmentation, an end-point detection algorithm has been designed to segment the words with intra word gap. The proposed approaches have been evaluated by designing a domain-based bilingual handwritten dataset after having consultations with academicians, and experts in the field. For text line segmentation, an average accuracy of 95.89% and 92.95% has been achieved for IAM and bilingual dataset respectively. However, for word segmentation, there has been an accuracy of 89.74% and 92.25% respectively for IAM and bilingual dataset. As many as 280 documents with various writing styles and content have been selected for the purpose. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. 基于LLM 的多粒度口令分析研究.
- Author
-
洪萌, 邱卫东, and 王杨德
- Abstract
Copyright of Chinese Journal of Network & Information Security is the property of Beijing Xintong Media Co., Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
39. Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language.
- Author
-
Nazir, Shahzad, Asif, Muhammad, Rehman, Mariam, and Ahmad, Shahbaz
- Abstract
In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world's 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Creation of An Intelligent System for Uzbek Language Teaching Using Phoneme-Based Speech Recognition.
- Author
-
Ibragimova, Sayyora
- Subjects
ARTIFICIAL intelligence ,SPEECH perception ,AUTOMATIC speech recognition ,NEUROLINGUISTICS ,TURKIC languages ,SPEECH ,WAVELET transforms - Abstract
The recent surge in interest to learn the Uzbek language among foreigners has underscored the need for innovative teaching tools. Despite the limited studies on intelligent systems for phonemic speech recognition in the Uzbek context, this research aimed to address this gap. The purpose of this study was to create an intelligent system for teaching the Uzbek language as a foreign language based on the technology of phonemic recognition of speech signals. It was developed an intelligent system for Uzbek language instruction using phonemic speech recognition technology. The approach utilized various methods, including pinpointing challenging phonemes, comparative data analyses, and analytical-synthetic breakdowns of linguistic components, all enhanced by the wavelet transform's signal refinement. The system's precision in recognizing speech signals phoneme-by-phoneme, emphasizing difficult sounds for learners, promises broader AI-driven language study applications. Specifically designed for the Uzbek language, the system achieves an accuracy range of 67% to 95%. This breakthrough not only propels AI-driven language processing but offers a robust tool for improving Uzbek language instruction, especially beneficial for the Turkic language group. Future avenues include its use in computer modeling and automatic speech processing for Turkic languages, solidifying its innovative contribution to AI-driven language teaching. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
41. Normalized dataset for Sanskrit word segmentation and morphological parsing
- Author
-
Krishnan, Sriram, Kulkarni, Amba, and Huet, Gérard
- Published
- 2024
- Full Text
- View/download PDF
42. Reducing Approximation and Estimation Errors with Heterogeneous Annotations
- Author
-
Sun, Weiwei, Ide, Nancy, Series Editor, Huang, Chu-Ren, editor, Hsieh, Shu-Kai, editor, and Jin, Peng, editor
- Published
- 2023
- Full Text
- View/download PDF
43. Practical and Robust Chinese Word Segmentation and PoS Tagging
- Author
-
Huang, Chu-Ren, Ide, Nancy, Series Editor, Huang, Chu-Ren, editor, Hsieh, Shu-Kai, editor, and Jin, Peng, editor
- Published
- 2023
- Full Text
- View/download PDF
44. A Review of Various Line Segmentation Techniques Used in Handwritten Character Recognition
- Author
-
Joseph, Solley, George, Jossy, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Joshi, Amit, editor, Mahmud, Mufti, editor, and Ragel, Roshan G., editor
- Published
- 2023
- Full Text
- View/download PDF
45. Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language
- Author
-
Shahzad Nazir, Muhammad Asif, Mariam Rehman, and Shahbaz Ahmad
- Subjects
Word segmentation ,Text normalization ,Machine learning ,Low resourced languages ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.
- Published
- 2024
- Full Text
- View/download PDF
46. Using Syntax and Shallow Semantic Analysis for Vietnamese Question Generation.
- Author
-
Tran, Phuoc, Nguyen, Duy Khanh, Tran, Tram, and Vo, Bay
- Subjects
VIETNAMESE language ,SYNTAX (Grammar) ,NATURAL language processing ,SEMANTICS ,PARTS of speech - Abstract
This paper presents a method of using syntax and shallow semantic analysis for Vietnamese question generation (QG). Specifically, our proposed technique concentrates on investigating both the syntactic and shallow semantic structure of each sentence. The main goal of our method is to generate questions from a single sentence. These generated questions are known as factoid questions which require short, fact-based answers. In general, syntax-based analysis is one of the most popular approaches within the QG field, but it requires linguistic expert knowledge as well as a deep understanding of syntax rules in the Vietnamese language. It is thus considered a high-cost and inefficient solution due to the requirement of significant human effort to achieve qualified syntax rules. To deal with this problem, we collected the syntax rules in Vietnamese from a Vietnamese language textbook. Moreover, we also used different natural language processing (NLP) techniques to analyze Vietnamese shallow syntax and semantics for the QG task. These techniques include: sentence segmentation, word segmentation, part of speech, chunking, dependency parsing, and named entity recognition. We used human evaluation to assess the credibility of our model, which means we manually generated questions from the corpus, and then compared them with the generated questions. The empirical evidence demonstrates that our proposed technique has significant performance, in which the generated questions are very similar to those which are created by humans. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
47. CLASSIC Utterance Boundary: A Chunking‐Based Model of Early Naturalistic Word Segmentation.
- Author
-
Cabiddu, Francesco, Bott, Lewis, Jones, Gary, and Gambi, Chiara
- Subjects
- *
VOCABULARY , *ENGLISH language , *LANGUAGE awareness , *LANGUAGE & languages , *LEARNING ability - Abstract
Word segmentation is a crucial step in children's vocabulary learning. While computational models of word segmentation can capture infants' performance in small‐scale artificial tasks, the examination of early word segmentation in naturalistic settings has been limited by the lack of measures that can relate models' performance to developmental data. Here, we extended CLASSIC (Chunking Lexical and Sublexical Sequences in Children; Jones et al., 2021), a corpus‐trained chunking model that can simulate several memory and phonological and vocabulary learning phenomena to allow it to perform word segmentation using utterance boundary information, and we have named this extended version CLASSIC utterance boundary (CLASSIC‐UB). Further, we compared our model to the performance of children on a wide range of new measures, capitalizing on the link between word segmentation and vocabulary learning abilities. We showed that the combination of chunking and utterance‐boundary information used by CLASSIC utterance boundary allowed a better prediction of English‐learning children's output vocabulary than did other models. A one‐page Accessible Summary of this article in non‐technical language is freely available in the Supporting Information online and at https://oasis‐database.org [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
48. A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts
- Author
-
Jinqu Zhang, Lang Qian, Shu Wang, Yunqiang Zhu, Zhenji Gao, Hailong Yu, and Weirong Li
- Subjects
Geoscience ,corpus augmentation ,word segmentation ,Chinese ,Mathematical geography. Cartography ,GA1-1776 - Abstract
ABSTRACTFor geoscience text, rich domain corpora have become the basis of improving the model performance in word segmentation. However, the lack of domain-specific corpus with annotation labelled has become a major obstacle to professional information mining in geoscience fields. In this paper, we propose a corpus augmentation method based on Levenshtein distance. According to the technique, a geoscience dictionary of 20,137 words was collected and constructed by crawling the keywords from published papers in China National Knowledge Infrastructure (CNKI). The dictionary was further used as the main source of synonyms to enrich the geoscience corpus according to the Levenshtein distance between words. Finally, a Chinese word segmentation model combining the BERT, Bi-gated recurrent neural network (Bi-GRU), and conditional random fields (CRF) was implemented. Geoscience corpus composed of complex long specific vocabularies has been selected to test the proposed word segmentation framework. CNN-LSTM, Bi-LSTM-CRF, and Bi-GRU-CRF models were all selected to evaluate the effects of Levenshtein data augmentation technique. Experiments results prove that the proposed methods achieve a significant performance improvement of more than 10%. It has great potential for natural languages processing tasks like named entity recognition and relation extraction.
- Published
- 2023
- Full Text
- View/download PDF
49. The role of format familiarity and word frequency in Chinese reading
- Author
-
Mingjing Chen and Jiamei Lu
- Subjects
Eye movement ,Chinese reading ,word segmentation ,vocabulary recognition ,E-Z reader model ,Human anatomy ,QM1-695 - Abstract
For Chinese readers, reading from left to right is the norm, while reading from right to left is unfamiliar. This study comprises two experiments investigating how format familiarity and word frequency affect reading by Chinese people. Experiment 1 examines the roles of format familiarity (reading from left to right is the familiar Chinese format, and reading from right to left is the unfamiliar Chinese format) and word frequency in vocabulary recognition. Forty students read the same Chinese sentences from left to right and from right to left. Target words were divided into high and low frequency words. In Experiment 2, participants engaged in right-to-left reading training for 10 days to test whether their right-to-left reading performance could be improved. The study yields several main findings. First, format familiarity affects vocabulary recognition. Participants reading from left to right had shorter fixation times, higher skipping rates, and viewing positions closer to word center.. Second, word frequency affects vocabulary recognition in Chinese reading. Third, right-to-left reading training could improve reading performance. In the early indexes, the interaction effect of format familiarity and word frequency was significant. There was also a significant word-frequency effect from left to right but not from right to left. Therefore, word segmentation and vocabulary recognition may be sequential in Chinese reading.
- Published
- 2023
- Full Text
- View/download PDF
50. The Role of Format Familiarity and Word Frequency in Chinese Reading.
- Author
-
Chen Ming Jing and Lu Jia Mei
- Subjects
WORD frequency ,CHINESE people ,READING ,CHINESE language ,COMPARATIVE grammar - Abstract
For Chinese readers, reading from left to right is the norm, while reading from right to left is unfamiliar. This study comprises two experiments investigating how format familiarity and word frequency affect reading by Chinese people. Experiment 1 examines the roles of format familiarity (reading from left to right is the familiar Chinese format, and reading from right to left is the unfamiliar Chinese format) and word frequency in vocabulary recognition. Forty students read the same Chinese sentences from left to right and from right to left. Target words were divided into high and low frequency words. In Experiment 2, participants engaged in right-to-left reading training for 10 days to test whether their right-to-left reading performance could be improved. The study yields several main findings. First, format familiarity affects vocabulary recognition. Participants reading from left to right had shorter fixation times, higher skipping rates, and viewing positions closer to word center.. Second, word frequency affects vocabulary recognition in Chinese reading. Third, right-to-left reading training could improve reading performance. In the early indexes, the interaction effect of format familiarity and word frequency was significant. There was also a significant word-frequency effect from left to right but not from right to left. Therefore, word segmentation and vocabulary recognition may be sequential in Chinese reading. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.