Descriptor: "Corpora" / Topic: natural language processing - Searchworks@Jio Institute Digital Library Search Results

1. Health-Related Content in Transformer-Based Language Models: Exploring Bias in Domain General vs. Domain Specific Training Sets.

Author: Samo G and Bonan C
Subjects: Bias, Natural Language Processing, Language
Abstract: In this communication, we demonstrate that the bias observed in domain general training sets with health-related content is not improved in domain specific health-communication corpora, contra.
Published: 2023
Full Text: View/download PDF

2. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.

Author: Oliveira LESE, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, Carvalho DR, Al Hasan S, and Moro CMC
Subjects: Electronic Health Records, Humans, Portugal, Reproducibility of Results, Medicine, Natural Language Processing
Abstract: Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field., Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations., Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores., Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus., (© 2022. The Author(s).)
Published: 2022
Full Text: View/download PDF

3. Biomedical Flat and Nested Named Entity Recognition: Methods, Challenges, and Advances.

Author: Park, Yesol, Son, Gyujin, and Rho, Mina
Subjects: NATURAL language processing, CORPORA
Abstract: Biomedical named entity recognition (BioNER) aims to identify and classify biomedical entities (i.e., diseases, chemicals, and genes) from text into predefined classes. This process serves as an important initial step in extracting biomedical information from textual sources. Considering the structure of the entities it addresses, BioNER tasks are divided into two categories: flat NER, where entities are non-overlapping, and nested NER, which identifies entities embedded within another. While early studies primarily addressed flat NER, recent advances in neural models have enabled more sophisticated approaches to nested NER, gaining increasing relevance in the biomedical field, where entity relationships are often complex and hierarchically structured. This review, thus, focuses on the latest progress in large-scale pre-trained language model-based approaches, which have shown the significantly improved performance of NER. The state-of-the-art flat NER models have achieved average F1-scores of 84% on BC2GM, 89% on NCBI Disease, and 92% on BC4CHEM, while nested NER models have reached 80% on the GENIA dataset, indicating room for enhancement. In addition, we discuss persistent challenges, including inconsistencies of named entities annotated across different corpora and the limited availability of named entities of various entity types, particularly for multi-type or nested NER. To the best of our knowledge, this paper is the first comprehensive review of pre-trained language model-based flat and nested BioNER models, providing a categorical analysis among the methods and related challenges for future research and development in the field. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Exploring the Potential of Neural Machine Translation for Cross-Language Clinical Natural Language Processing (NLP) Resource Generation through Annotation Projection.

Author: Rodríguez-Miret, Jan, Farré-Maduell, Eulàlia, Lima-López, Salvador, Vigil, Laura, Briva-Iglesias, Vicent, and Krallinger, Martin
Subjects: *NATURAL language processing, *SPANISH language, *CORPORA, *MACHINE translating, *ANNOTATIONS, *CLINICAL medicine
Abstract: Recent advancements in neural machine translation (NMT) offer promising potential for generating cross-language clinical natural language processing (NLP) resources. There is a pressing need to be able to foster the development of clinical NLP tools that extract key clinical entities in a comparable way for a multitude of medical application scenarios that are hindered by lack of multilingual annotated data. This study explores the efficacy of using NMT and annotation projection techniques with expert-in-the-loop validation to develop named entity recognition (NER) systems for an under-resourced target language (Catalan) by leveraging Spanish clinical corpora annotated by domain experts. We employed a state-of-the-art NMT system to translate three clinical case corpora. The translated annotations were then projected onto the target language texts and subsequently validated and corrected by clinical domain experts. The efficacy of the resulting NER systems was evaluated against manually annotated test sets in the target language. Our findings indicate that this approach not only facilitates the generation of high-quality training data for the target language (Catalan) but also demonstrates the potential to extend this methodology to other languages, thereby enhancing multilingual clinical NLP resource development. The generated corpora and components are publicly accessible, potentially providing a valuable resource for further research and application in multilingual clinical settings. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Language with vision: A study on grounded word and sentence embeddings.

Author: Shahmohammadi, Hassan, Heitmeier, Maria, Shafaei-Bajestan, Elnaz, Lensch, Hendrik P. A., and Baayen, R. Harald
Subjects: *NATURAL language processing, *MACHINE learning, *ENGLISH language, *CORPORA, *VISION
Abstract: Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioral datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT (Devlin et al, 2018), but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at (https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2). [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Probing Effects of Contextual Bias on Number Magnitude Estimation.

Author: Xuehao Du, Ping Ji, Wei Qin, Lei Wang, and Yunshi Lan
Subjects: NATURAL language processing, MAGNITUDE estimation, ENCODING, CORPORA
Abstract: The semantic understanding of numbers requires association with context. However, powerful neural networks overfit spurious correlations between context and numbers in training corpus can lead to the occurrence of contextual bias, which may affect the network’s accurate estimation of number magnitude when making inferences in real-world data. To investigate the resilience of current methodologies against contextual bias, we introduce a novel out-of-distribution (OOD) numerical question-answering (QA) dataset that features specific correlations between context and numbers in the training data, which are not present in the OOD test data. We evaluate the robustness of different numerical encoding and decoding methods when confronted with contextual bias on this dataset. Our findings indicate that encoding methods incorporating more detailed digit information exhibit greater resilience against contextual bias. Inspired by this finding, we propose a digit-aware position embedding strategy, and the experimental results demonstrate that this strategy is highly effective in improving the robustness of neural networks against contextual bias. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish.

Author: Dunstan, Jocelyn, Vakili, Thomas, Miranda, Luis, Villena, Fabián, Aracena, Claudio, Quiroga, Tamara, Vera, Paulina, Viteri Valenzuela, Sebastián, and Rocco, Victor
Subjects: *LANGUAGE models, *NATURAL language processing, *PERSONALLY identifiable information, *CORPORA, *INDUSTRIAL hygiene
Abstract: Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Oxymoron: An Automatic Detection from the Corpus.

Author: Senapati, Apurbalal
Subjects: LANGUAGE models, NATURAL language processing, BENGALI language, CORPORA, COMPUTATIONAL linguistics
Abstract: An oxymoron is a linguistic phenomenon in which a pair of opposite or antonymous words are combined to convey a new meaning. Sometimes, it is used to express figurative, irony, or rhetoric within the text. This issue has received relatively less attention in the realms of linguistics and computational disciplines. Oxymorons play a significant role in various language-processing applications. This study represents a pioneering effort in the exploration of oxymorons in the Bengali language. A corpus-based study of oxymoron is a fundamental issue that has not been explored so far. A system has been proposed for the automated recognition of oxymorons from a given corpus. Frequency analysis, semantic similarity, and an antonym dictionary have been employed to discern oxymorons within the corpus. The system achieved promising results when tested on a Bengali corpus, and found 308 distinct oxymorons. A corpus-based descriptive statistics is measured in two different corpora. The most common oxymorons are ranked based on their frequency. Their notable presence underscores the importance of the Bengali language. This study aimed to explore fundamental questions concerning oxymorons, such as the automated detection of oxymorons within a corpus, descriptive statistics regarding oxymorons across languages, and the process of their construction and creation. Additionally, efforts were made to extract oxymorons from large language models using zero-shot prompts, but the results were not as promising compared to our proposed system. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Raising the Bar on Acceptability Judgments Classification: An Experiment on ItaCoLA Using ELECTRA.

Author: Guarasci, Raffaele, Minutolo, Aniello, Buonaiuto, Giuseppe, De Pietro, Giuseppe, and Esposito, Massimo
Subjects: NATURAL language processing, LEGAL judgments, LANGUAGE models, CORPORA, CLASSIFICATION
Abstract: The task of automatically evaluating acceptability judgments has relished increasing success in Natural Language Processing, starting from including the Corpus of Linguistic Acceptability (CoLa) in the GLUE benchmark dataset. CoLa spawned a thread that led to the development of several similar datasets in different languages, broadening the investigation possibilities to many languages other than English. In this study, leveraging the Italian Corpus of Linguistic Acceptability (ItaCoLA), comprising nearly 10,000 sentences with acceptability judgments, we propose a new methodology that utilizes the neural language model ELECTRA. This approach exceeds the scores obtained from current baselines and demonstrates that it can overcome language-specific limitations in dealing with specific phenomena. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language Processing.

Author: Alrayzah, Asmaa, Alsolami, Fawaz, and Saleh, Mostafa
Subjects: ARABIC language, TRANSFORMER models, NATURAL language processing, CORPORA
Abstract: The research presented in the following paper focuses on the effectiveness of a modern standard Arabic corpus, AraFast, in training transformer models for natural language processing tasks, particularly in Arabic. In the study described herein, four experiments were conducted to evaluate the use of AraFast across different configurations: segmented, unsegmented, and mini versions. The main outcomes of the present study are as follows: Transformer models trained with larger and cleaner versions of AraFast, especially in question-answering, indicate the impact of corpus quality and size on model efficacy. Secondly, a dramatic reduction in training loss was observed with the mini version of AraFast, underscoring the importance of optimizing corpus size for effective training. Moreover, the segmented text format led to a decrease in training loss, highlighting segmentation as a beneficial strategy in Arabic NLP. In addition, using the study findings, challenges in managing noisy data derived from web sources are identified, which were found to significantly hinder model performance. These findings collectively demonstrate the critical role of well-prepared, segmented, and clean corpora in advancing Arabic NLP capabilities. The insights from AraFast's application can guide the development of more efficient NLP models and suggest directions for future research in enhancing Arabic language processing tools. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Corpus Use in Cross-linguistic Research: Paving the Way for Teaching, Translation and Professional Communication.

Author: Xiaoyi Yang and Yuan Ping
Subjects: *NATURAL language processing, *LINGUISTIC minorities, *CONTRASTIVE linguistics, *MACHINE translating, *CORPORA, *PHILOLOGY
Published: 2024
Full Text: View/download PDF

12. Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion.

Author: Thielmann, Anton, Reuter, Arik, Seifert, Quentin, Bergherr, Elisabeth, and Säfken, Benjamin
Subjects: *NATURAL language processing, *WORD frequency, *CORPORA, *DOCUMENT clustering, *IDENTIFICATION
Abstract: Extracting and identifying latent topics in large text corpora have gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared with state-of-the-art topic modeling and document clustering models. The code is available at the following link: https://github.com/AnFreTh/STREAM. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding.

Author: Mou, Xinyu, He, Cuilin, Tan, Liwei, Yu, Junjie, Liang, Huadong, Zhang, Jianyu, Tian, Yan, Yang, Yu-Fang, Xu, Ting, Wang, Qing, Cao, Miao, Chen, Zijiao, Hu, Chuan-Peng, Wang, Xindi, Liu, Quanying, and Wu, Haiyan
Subjects: NEUROLINGUISTICS, CORPORA, LANGUAGE models, NATURAL language processing, CHINESE language, ELECTROENCEPHALOGRAPHY
Abstract: An Electroencephalography (EEG) dataset utilizing rich text stimuli can advance the understanding of how the brain encodes semantic information and contribute to semantic decoding in brain-computer interface (BCI). Addressing the scarcity of EEG datasets featuring Chinese linguistic stimuli, we present the ChineseEEG dataset, a high-density EEG dataset complemented by simultaneous eye-tracking recordings. This dataset was compiled while 10 participants silently read approximately 13 hours of Chinese text from two well-known novels. This dataset provides long-duration EEG recordings, along with pre-processed EEG sensor-level data and semantic embeddings of reading materials extracted by a pre-trained natural language processing (NLP) model. As a pilot EEG dataset derived from natural Chinese linguistic stimuli, ChineseEEG can significantly support research across neuroscience, NLP, and linguistics. It establishes a benchmark dataset for Chinese semantic decoding, aids in the development of BCIs, and facilitates the exploration of alignment between large language models and human cognitive processes. It can also aid research into the brain's mechanisms of language processing within the context of the Chinese natural language. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. A tree-based corpus annotated with Cyber-Syndrome, symptoms, and acupoints.

Author: Wang, Wenxi, Zhao, Zhan, and Ning, Huansheng
Subjects: NATURAL language processing, ACUPUNCTURE points, CORPORA, MULTICASTING (Computer networks)
Abstract: Prolonged and over-excessive interaction with cyberspace poses a threat to people's health and leads to the occurrence of Cyber-Syndrome, which covers not only physiological but also psychological disorders. This paper aims to create a tree-shaped gold-standard corpus that annotates the Cyber-Syndrome, clinical manifestations, and acupoints that can alleviate their symptoms or signs, designating this corpus as CS-A. In the CS-A corpus, this paper defines six entities and relations subject to annotation. There are 448 texts to annotate in total manually. After three rounds of updating the annotation guidelines, the inter-annotator agreement (IAA) improved significantly, resulting in a higher IAA score of 86.05%. The purpose of constructing CS-A corpus is to increase the popularity of Cyber-Syndrome and draw attention to its subtle impact on people's health. Meanwhile, annotated corpus promotes the development of natural language processing technology. Some model experiments can be implemented based on this corpus, such as optimizing and improving models for discontinuous entity recognition, nested entity recognition, etc. The CS-A corpus has been uploaded to figshare. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. A machine-based corpus optimization method for extracting domain-oriented technical words: an example of COVID-19 corpus data.

Author: Chen, Liang-Ching, Chang, Kuei-Hu, Wu, Chia-Heng, and Chen, Shin-Chi
Subjects: *NATURAL language processing, *COVID-19, *CORPORA, *WORD frequency
Abstract: Although natural language processing (NLP) refers to a process involving the development of algorithms or computational models that empower machines to understand, interpret, and generate human language, machines are still unable to fully grasp the meanings behind words. Specifically, they cannot assist humans in categorizing words with general or technical purposes without predefined standards or baselines. Empirically, prior researches have relied on inefficient manual tasks to exclude these words when extracting technical words (i.e., terminology or terms used within a specific field or domain of expertise) for obtaining domain information from the target corpus. Therefore, to enhance the efficiency of extracting domain-oriented technical words in corpus analysis, this paper proposes a machine-based corpus optimization method that compiles an advanced general-purpose word list (AGWL) to serve as the exclusion baseline for the machine to extract domain-oriented technical words. To validate the proposed method, this paper utilizes 52 COVID-19 research articles as the target corpus and an empirical example. After compared to traditional methods, the proposed method offers significant contributions: (1) it can automatically eliminate the most common function words in corpus data; (2) through a machine-driven process, it removes general-purpose words with high frequency and dispersion rates –57% of word types belonging to general-purpose words, constituting 90% of the total words in the target corpus. This results in 43% of word types representing domain-oriented technical words that makes up 10% of the total words in the target corpus are able to be extracted. This allows future researchers to focus exclusively on the remaining 43% of word types in the optimized word list (OWL), enhancing the efficiency of corpus analysis for extracting domain knowledge. (3) The proposed method establishes a set of standard operation procedure (SOP) that can be duplicated and generally applied to optimize any corpus data. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures.

Author: Chen, Xueqing, Gao, Yang, Wang, Ludi, Cui, Wenjuan, Huang, Jiamin, Du, Yi, and Wang, Bin
Subjects: LANGUAGE models, ELECTROCATALYSTS, SCIENTIFIC literature, CORPORA, DATABASES, NATURAL language processing, MACHINE learning
Abstract: CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. Review: Durrant, Brenchley and McCallum. 2021. Understanding Development and Proficiency in Writing: Quantitative Corpus Linguistic Approaches. Cambridge: Cambridge University Press.

Author: Alanazi, Zaha
Subjects: CORPORA, NATURAL language processing, SECOND language acquisition, LANGUAGE ability, LANGUAGE research, LITERATURE reviews
Abstract: The book "Understanding Development and Proficiency in Writing: Quantitative Corpus Linguistic Approaches" by Philip Durrant, Mark Brenchley, and Lee McCallum provides a comprehensive synthesis of corpus-based research on writing development. The book covers various aspects of writing development, including syntax, vocabulary, formulaic language, and cohesion. Each chapter critically evaluates definitions and theories, presents research findings, and highlights gaps and methodological issues for further investigation. While the book offers valuable insights for English educators and researchers, it could have provided more information on the application of research in instructional settings and language assessment. Overall, it is a valuable resource for those interested in writing development using corpora. [Extracted from the article]
Published: 2024
Full Text: View/download PDF

18. Enriching Portuguese Medieval Texts with Named Entity Recognition.

Author: Inês Bico, Maria, Baptista, Jorge, Batista, Fernando, and Cardeira, Esperança
Subjects: *INFORMATION retrieval, *NATURAL language processing, *PORTUGUESE language, *CORPORA, *HISTORICAL source material, *KNOWLEDGE base
Abstract: Historical data poses unique challenges to natural language processing (NLP) and information retrieval (IR) tools, including digitization errors, lack of annotated data, and diachronic-specific issues. However, the increasing recognition of the value in historical documents has promoted efforts to semantically enrich and optimize their analysis. This article contributes to this endeavour by enriching the Corpus de Textos Antigos through NLP tools and techniques to enhance its usability and support research. The corpus undergoes linguistic annotation, including part-of-speech tagging, lemma annotation and named entity recognition (NER). Subsequently, the article delves into the tasks of entity disambiguation and entity linking, which involve identifying and disambiguating named entities by referring to a knowledge base (KB). Addressing the challenges posed by factors such as text state, epoch and the chosen KB, the article presents insights into related work, annotation results and the linguistic interest of a medieval annotated corpus for named entities. It concludes by discussing the challenges and providing avenues for future research in this domain. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. SetembroBR: a social media corpus for depression and anxiety disorder prediction.

Author: Santos, Wesley Ramos dos, de Oliveira, Rafael Lage, and Paraboni, Ivandré
Subjects: *ANXIETY disorders, *MICROBLOGS, *SOCIAL media, *NATURAL language processing, *PORTUGUESE language, *MENTAL illness
Abstract: The present work introduces a novel dataset—hereby called the SetembroBR corpus—for the study and development of depression and anxiety disorder predictive models in the Portuguese language based on the information prior to a diagnosis. The corpus comprises both text- and network-related information related to 3.9 thousand Twitter users who self-reported a diagnosis or treatment for a mental disorder, and its use is illustrated by a number of experiments addressing the issues of depression and anxiety disorder prediction from social media data. Our present results are intended as a first step towards investigating how mental health statuses are expressed on Portuguese-speaking social media, and pave the way for computational applications intended to assist with a pressing issue of great social interest. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus.

Author: Iqbal, Hafiz Rizwan, Maqsood, Rashad, Raza, Agha Ali, and Hassan, Saeed-Ul
Subjects: ARTIFICIAL neural networks, LANGUAGE models, PARAPHRASE, NATURAL language processing, CORPORA
Abstract: Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models. This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD ( $F_1 = 96.80$ for paraphrase detection and $F_1 = 88.90$ for text reuse and plagiarism detection) outperformed WENGO ( $F_1 = 81.64$ for paraphrase detection and $F_1 = 61.19$ for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. State-of-the-Art Review of the Corpus Linguistics Field From the Beginning Until the Development of ChatGPT.

Author: Altameemi, Yaser M.
Subjects: CORPORA, NATURAL language processing, CHATGPT, ANTHROPOLOGICAL linguistics, HISTORICAL maps, THEMATIC maps, EVIDENCE gaps
Abstract: The present paper highlights the recent state of and development in the corpus linguistics (CL) field. Although several reviews have been conducted on CL, these reviews have focused on specific areas, such as education, or did not provide an overall clear overview of the future implications of the field (Baker et al., 2008; Biber & Reppen, 2020; Biber et al., 1998; G. N. Leech, 1991; Mcenery et al., 2019; McEnery & Hardie, 2012). The author begins this paper with providing an overview that can guide new researchers in this field as well as postgraduates who require a general historical and thematic map of CL. The general overview discusses the publications of scholars who have participated in this field as well as the central tools that have been applied in CL. For specific details regarding the development of the field, the author analysed 217 articles from the 3 highest-impact factor journals according to the Web of Science over the last four years (2019-2022). The findings reveal a rapid development of the field in terms of practical and methodological perspectives, specifically regarding the investigations of language uses in different contexts. Thus, this paper indicates a significantly strong correlation between CL and technological development, such as natural language processing (NLP), and how this approach could fill the research gap of utilising CL in other areas of linguistics. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. ACCELERATING THE PROCESS OF TEXT DATA CORPORA GENERATION BY THE DETERMINISTIC METHOD.

Author: Yusyn, Yakiv and Zabolotnia, Tetiana
Subjects: CORPORA, NATURAL language processing, PROGRAMMING languages, COMPUTER software testing, ELECTRONIC data processing
Abstract: The object of research is the process of generating text data corpora using the CorDeGen method. The problem solved in this study is the insufficient efficiency of generating corpora of text data by the CorDeGen method according to the speed criterion. Based on the analysis of the abstract CorDeGen method – the steps it consists of, the algorithm that implements it – the possibilities of its parallelization have been determined. As a result, two new modified methods of the base CorDeGen method were developed: “naive” parallel and parallel. These methods differ from each other in whether they preserve the order of terms in the generated texts compared to the texts generated by the base method (“naive” parallel does not preserve, parallel does). Using the .NET platform and the C# programming language, the software implementation of both proposed methods was performed in this work; a property-based testing methodology was used to validate both implementations. The results of efficiency testing showed that for corpora of sufficiently large sizes, the use of parallel CorDeGen methods speeds up the generation time by 2 times, compared to the base method. The acceleration effect is explained precisely by the parallelization of the process of generating the next term – its creation, calculation of the number of occurrences of texts, and recording – which takes most of the time in the base method. This means that if it is necessary to generate sufficiently large corpora in a limited time, in practice it is reasonable to use the developed parallel methods of CorDeGen instead of the base one. The choice of a particular parallel method (naive or conventional) for a practical application depends on whether or not the ability to predict the order of terms in the generated texts is important. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. Review of Viana (2022): Teaching English with Corpora: A Resource Book.

Author: Pérez-Paredes, Pascual
Subjects: *ENGLISH language, *NATURAL language processing, *CORPORA, *GRATITUDE, *LANGUAGE teachers, *COMPUTER assisted language instruction
Abstract: "Teaching English with Corpora: A Resource Book" by Vander Viana is a valuable resource for TESOL professionals and students majoring in TESOL. The book fills a gap in the literature by providing ready-to-use activities and guidance on how to effectively use corpora in English language teaching. Divided into two parts, the book offers lesson plans for general English and English for specific purposes, covering various topics such as phrasal verbs, discourse markers, and academic writing. The book includes detailed instructions, materials, and online support, making it a practical tool for language teachers. It also encourages discussions about language use and the importance of understanding grammar in context. [Extracted from the article]
Published: 2024
Full Text: View/download PDF

24. Learner corpus research for pedagogical purposes: An overview and some research perspectives.

Author: Götz, Sandra and Granger, Sylviane
Subjects: SECOND language acquisition, LANGUAGE ability testing, NATURAL language processing, COMPUTER assisted language instruction, CORPORA
Abstract: This article provides an overview of learner corpus research for pedagogical purposes, discussing its use in materials development, language testing and assessment, and classroom practice. It highlights the advantages of using learner corpora in reference tool development and textbook development, as well as the potential for change through large-scale projects and increased accessibility of research findings. The article also explores the obstacles faced by teachers in integrating learner corpora into their practice and discusses the use of computer-assisted language learning programs. It emphasizes the potential of learner corpora to enhance language assessment and tailor it to specific learner backgrounds. The article concludes by calling for stronger collaborations between researchers and teachers and the inclusion of learner corpus-based pedagogy in teacher education programs. [Extracted from the article]
Published: 2024
Full Text: View/download PDF

25. Parallel-Based Corpus Annotation for Malay Health Documents.

Author: Hafsah, Saad, Saidah, Zakaria, Lailatul Qadri, and Naswir, Ahmad Fadhil
Subjects: NATURAL language processing, MANAGEMENT of electronic health records, MALAY language, MEDICAL terminology, CORPORA, MEDICAL personnel
Abstract: Named entity recognition (NER) is a crucial component of various natural language processing (NLP) applications, particularly in healthcare. It involves accurately identifying and extracting named entities such as medical terms, diseases, and drug names, and healthcare professionals are essential for tasks like clinical text analysis, electronic health record management, and medical research. However, healthcare NER faces challenges, especially in Malay, in which specialized corpora are limited, and no general corpus is available yet. To address this, the paper proposes a method for constructing an annotated corpus of Malay health documents. The researchers leverage a parallel source that contains annotated entities in English due to the limited tools available for the Malay language, and it is very language-dependent. Additional credible Malay documents are incorporated as sources to enhance the development. The targeted health entities in this research include penyakit (diseases), simptom (symptoms), and rawatan (treatments). The primary objective is to facilitate the development of NER algorithms specifically tailored to the healthcare domain in the Malay language. The methodology encompasses data collection, preprocessing, annotation of text in both English and Malay, and corpus creation. The outcome of this research is the establishment of the Malay Health Document Annotated Corpus, which serves as a valuable resource for training and evaluating NLP models in the Malay language. Future research directions may focus on developing domain-specific NER models, exploring alternative algorithms, and enhancing performance. Overall, this research aims to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

26. A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging.

Author: Maulud, Dastan, Jacksi, Karwan, and Ali, Ismael
Subjects: *NATURAL language processing, *HIDDEN Markov models, *NATURAL languages, *PARTS of speech, *CORPORA
Abstract: With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach's accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study's findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

27. Building and benchmarking the motivated deception corpus: Improving the quality of deceptive text through gaming.

Author: Barsever, Dan, Steyvers, Mark, and Neftci, Emre
Subjects: *DECEPTION, *MACHINE learning, *NATURAL language processing, *CORPORA, *FAKE news
Abstract: When one studies fake news or false reviews, the first step to take is to find a corpus of text samples to work with. However, most deceptive corpora suffer from an intrinsic problem: there is little incentive for the providers of the deception to put their best effort, which risks lowering the quality and realism of the deception. The corpus described in this project, the Motivated Deception Corpus, aims to rectify this problem by gamifying the process of deceptive text collection. By having subjects play the game Two Truths and a Lie, and by rewarding those subjects that successfully fool their peers, we collect samples in such a way that the process itself improves the quality of the text. We have amassed a large corpus of deceptive text that is strongly incentivized to be convincing, and thus more reflective of real deceptive text. We provide results from several configurations of neural network prediction models to establish machine learning benchmarks on the data. This new corpus is demonstratively more challenging to classify with the current state of the art than previous corpora. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

28. Computers' Interpretations of Knowledge Representation Using Pre-Conceptual Schemas: An Approach Based on the BERT and Llama 2-Chat Models.

Author: Insuasti, Jesus, Roa, Felipe, and Zapata-Jaramillo, Carlos Mario
Subjects: LANGUAGE models, KNOWLEDGE representation (Information theory), CORPORA, SYSTEMS engineering, COMPUTERS, NATURAL language processing
Abstract: Pre-conceptual schemas are a straightforward way to represent knowledge using controlled language regardless of context. Despite the benefits of using pre-conceptual schemas by humans, they present challenges when interpreted by computers. We propose an approach to making computers able to interpret the basic pre-conceptual schemas made by humans. To do that, the construction of a linguistic corpus is required to work with large language models—LLM. The linguistic corpus was mainly fed using Master's and doctoral theses from the digital repository of the University of Nariño to produce a training dataset for re-training the BERT model; in addition, we complement this by explaining the elicited sentences in triads from the pre-conceptual schemas using one of the cutting-edge large language models in natural language processing: Llama 2-Chat by Meta AI. The diverse topics covered in these theses allowed us to expand the spectrum of linguistic use in the BERT model and empower the generative capabilities using the fine-tuned Llama 2-Chat model and the proposed solution. As a result, the first version of a computational solution was built to consume the language models based on BERT and Llama 2-Chat and thus automatically interpret pre-conceptual schemas by computers via natural language processing, adding, at the same time, generative capabilities. The validation of the computational solution was performed in two phases: the first one for detecting sentences and interacting with pre-conceptual schemas with students in the Formal Languages and Automata Theory course—the seventh semester of the systems engineering undergraduate program at the University of Nariño's Tumaco campus. The second phase was for exploring the generative capabilities based on pre-conceptual schemas; this second phase was performed with students in the Object-oriented Design course—the second semester of the systems engineering undergraduate program at the University of Nariño's Tumaco campus. This validation yielded favorable results in implementing natural language processing using the BERT and Llama 2-Chat models. In this way, some bases were laid for future developments related to this research topic. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

29. esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish.

Author: Gutiérrez-Fandiño, Asier, Pérez-Fernández, David, Armengol-Estapé, Jordi, Griol, David, Kharitonova, Ksenia, and Callejas, Zoraida
Subjects: NATURAL language processing, SPANISH language, CORPORA, TRANSFORMER models, WEBSITES
Abstract: In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

30. Identifying and Predicting Stereotype Change in Large Language Corpora: 72 Groups, 115 Years (1900–2015), and Four Text Sources.

Author: Charlesworth, Tessa E. S., Sanjeev, Nishanth, Hatzenbuehler, Mark L., and Banaji, Mahzarin R.
Subjects: *STEREOTYPE content model, *STEREOTYPES, *NATURAL language processing, *LINGUISTIC change, *CORPORA
Abstract: The social world is carved into a complex variety of groups each associated with unique stereotypes that persist and shift over time. Innovations in natural language processing (word embeddings) enabled this comprehensive study on variability and correlates of change/stability in both manifest and latent stereotypes for 72 diverse groups tracked across 115 years of four English-language text corpora. Results showed, first, that group stereotypes changed by a moderate-to-large degree in manifest content (i.e., top traits associated with groups) but remained relatively more stable in latent structure (i.e., average cosine similarity of top traits' embeddings and vectors of valence, warmth, or competence). This dissociation suggests new insights into how stereotypes and their consequences may endure despite documented changes in other aspects of group representations. Second, results showed substantial variability of change/stability across the 72 groups, with some groups revealing large shifts in manifest and latent content, but others showing near-stability. Third, groups also varied in how consistently they were stereotyped across texts, with some groups showing divergent content, but others showing near-identical representations. Fourth, this variability in change/stability across groups was predicted from a combination of linguistic (e.g., frequency of mentioning the group; consistency of group stereotypes across texts) and social (e.g., the type of group) correlates. Groups that were more frequently mentioned in text changed more than those rarely mentioned; sociodemographic groups changed more than other group types (e.g., body-related stigmas, mental illnesses, occupations), providing the first quantitative evidence of specific group features that may support historical stereotype change. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

31. Natural language processing for knowledge discovery and information extraction from energetics corpora.

Author: VanGessel, Francis G., Perry, Efrem, Mohan, Salil, Barham, Oliver M., and Cavolowsky, Mark
Subjects: DATA mining, NATURAL language processing, LANGUAGE models, TRANSFORMER models, CORPORA
Abstract: We present a demonstration of the utility of Natural Language Processing (NLP) for aiding research into energetic materials and associated systems. The NLP method enables machine understanding of textual data, offering an automated route to knowledge discovery and information extraction from energetics text. We apply three established unsupervised NLP models: Latent Dirichlet Allocation, Word2Vec, and the Transformer to a large curated dataset of energetics‐related scientific articles. We demonstrate that each NLP algorithm is capable of identifying energetic topics and concepts, generating a language model which aligns with Subject Matter Expert knowledge. Furthermore, we present a document classification pipeline for energetics text. Our classification pipeline achieves 59–76 % accuracy depending on the NLP model used, with the highest performing Transformer model rivaling inter‐annotator agreement metrics. The NLP approaches studied in this work can identify concepts germane to energetics and therefore hold promise as a tool for accelerating energetics research efforts and energetics material development. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

32. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms.

Author: Yang, Xiao, Saha, Shyamasree, Venkatesan, Aravind, Tirunagari, Santosh, Vartak, Vid, and McEntyre, Johanna
Subjects: NATURAL language processing, DEEP learning, CORPORA, EUROPEAN literature, DATABASES, HEBBIAN memory
Abstract: Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

33. Automatic Text Analysis of Reflective Essays to Quantify the Impact of the Modification of a Mechanical Engineering Course.

Author: Narendranath, Aneet and Allen, Jeffrey S.
Subjects: *MECHANICAL engineering education, *TEXT mining, *NATURAL language processing, *CORPORA, *ENGINEERING students
Abstract: Students reflective essays in engineering education provide insight and context for instructional modification and assessment. However, the assessment of reflective essays numbering in thousands can be time-consuming This is notably important when trying to find specific changes in focus from one essay to another and measur ing how strong those changes are across multiple corpora of essays In this paper we describe and demonstrate an au tomated text analysis method for the at-scale, corpus normalized analysis of reflective essays. We apply it to quantitatively measure whether the modification of an undergraduate mechanical engineering course had the conjectured impact of a stronger emphasis on teamwork. Our analytical method is a "pipeline" composed of Text Mining (TM), Natural Language Processing (NLP), and Recurrence Quantification Analysis (ROA). We use this method to measure the presence of a specific thematic element in reflective essays to confirm the impact of the modification of a team-driven, model-based engineering design course. The original course and its modification were visualized using Sandovals conjecture mapping framework. The novel innovation of this approach is that the input (text from hundreds of reflective essays, sourced one at a time) when passed through this pipeline quickly produces a quantitative indication of the presence of thematic el ements and their recurence normalized across a corpus of hundreds of essays. A comparison of this quantitative indicator across separate corpora (each corpus of essays is for a different year) of reflective essays signaled a change in student focus toward the conjectured outcome We conclude that the TM-NLP-ROA pipeline can be applied for quick and at-scale extraction of the relative magnitude of thematic statements from reflective essays We observe that our conjectured redesign had the impact that we desired [ABSTRACT FROM AUTHOR]
Published: 2023

34. Optimal Quad Channel Long Short-Term Memory Based Fake News Classification on English Corpus.

Author: Hamza, Manar Ahmed, Alshahrani, Hala J., Tarmissi, Khaled, Yafoz, Ayman, Mehanna, Amal S., Yaseen, Ishfaq, Abdelmageed, Amgad Atta, and Eldesouki, Mohamed I.
Subjects: FAKE news, CORPORA, ENGLISH language, SOCIAL media, NATURAL language processing
Abstract: The term 'corpus' refers to a huge volume of structured datasets containing machine-readable texts. Such texts are generated in a natural communicative setting. The explosion of social media permitted individuals to spread data with minimal examination and filters freely. Due to this, the old problem of fake news has resurfaced. It has become an important concern due to its negative impact on the community. To manage the spread of fake news, automatic recognition approaches have been investigated earlier using Artificial Intelligence (AI) and Machine Learning (ML) techniques. To perform the medicinal text classification tasks, the ML approaches were applied, and they performed quite effectively. Still, a huge effort is required from the human side to generate the labelled training data. The recent progress of the Deep Learning (DL) methods seems to be a promising solution to tackle difficult types of Natural Language Processing (NLP) tasks, especially fake news detection. To unlock social media data, an automatic text classifier is highly helpful in the domain of NLP. The current research article focuses on the design of the Optimal Quad ChannelHybrid Long Short-Term Memory-based Fake News Classification (QCLSTM-FNC) approach. The presented QCLSTM-FNC approach aims to identify and differentiate fake news from actual news. To attain this, the proposed QCLSTM-FNC approach follows two methods such as the pre-processing data method and the Glovebased word embedding process. Besides, the QCLSTM model is utilized for classification. To boost the classification results of the QCLSTM model, a Quasi-Oppositional Sandpiper Optimization (QOSPO) algorithm is utilized to fine-tune the hyperparameters. The proposed QCLSTM-FNC approach was experimentally validated against a benchmark dataset. The QCLSTMFNC approach successfully outperformed all other existing DL models under different measures. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

35. Optimal Deep Hybrid Boltzmann Machine Based Arabic Corpus Classification Model.

Author: Al Duhayyim, Mesfer, Al-onazi, Badriyya B., Nour, Mohamed K., Yafoz, Ayman, Mehanna, Amal S., Yaseen, Ishfaq, Abdelmageed, Amgad Atta, and Mohammed, Gouse Pasha
Subjects: BOLTZMANN machine, NATURAL language processing, ARABIC language, CORPORA, DEEP learning
Abstract: Natural Language Processing (NLP) for the Arabic language has gained much significance in recent years. The most commonly-utilized NLP task is the 'Text Classification' process. Its main intention is to apply the Machine Learning (ML) approaches for automatically classifying the textual files into one or more pre-defined categories. In ML approaches, the first and foremost crucial step is identifying an appropriate large dataset to test and train the method. One of the trending ML techniques, i.e., Deep Learning (DL) technique needs huge volumes of different types of datasets for training to yield the best outcomes. The current study designs a new Dice Optimization with a Deep Hybrid Boltzmann Machinebased Arabic Corpus Classification (DODHBM-ACC) model in this background. The presented DODHBM-ACC model primarily relies upon different stages of pre-processing and the word2vec word embedding process. For Arabic text classification, the DHBM technique is utilized. This technique is a hybrid version of the Deep Boltzmann Machine (DBM) and Deep Belief Network (DBN). It has the advantage of learning the decisive intention of the classification process. To adjust the hyperparameters of the DHBM technique, the Dice Optimization Algorithm (DOA) is exploited in this study. The experimental analysis was conducted to establish the superior performance of the proposed DODHBM-ACC model. The outcomes inferred the better performance of the proposed DODHBM-ACC model over other recent approaches. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

36. BASPRO: A Balanced Script Producer for Speech Corpus Collection Based on the Genetic Algorithm.

Author: Chen, Yu-Wen, Wang, Hsin-Min, and Tsao, Yu
Subjects: AUTOMATIC speech recognition, SPEECH, NATURAL language processing, GENETIC algorithms, COSINE function, SPEECH enhancement, CORPORA
Abstract: The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

37. The fragility of artists' reputations from 1795 to 2020.

Author: Letian Zhang, Mitali Banerjee, Shinan Wang, and Zhuoqiao Hong
Subjects: *COLLECTIVE memory, *MATTHEW effect, *ARTISTS, *REPUTATION, *NATURAL language processing, *CORPORA
Abstract: This study explores the longevity of artistic reputation. We empirically examine whether artists are more- or less-venerated after their death. We construct a massive historical corpus spanning 1795 to 2020 and build separate word-embedding models for each five-year period to examine how the reputations of over 3,300 famous artists-- including painters, architects, composers, musicians, and writers--evolve after their death. We find that most artists gain their highest reputation right before their death, after which it declines, losing nearly one SD every century. This posthumous decline applies to artists in all domains, includes those who died young or unexpectedly, and contradicts the popular view that artists' reputations endure. Contrary to the Matthew effect, the reputational decline is the steepest for those who had the highest reputations while alive. Two mechanisms--artists' reduced visibility and the public's changing taste--are associated with much of the posthumous reputational decline. This study underscores the fragility of human reputation and shows how the collective memory of artists unfolds over time. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

38. Hierarchical Clause Annotation: Building a Clause-Level Corpus for Semantic Parsing with Complex Sentences.

Author: Fan, Yunlong, Li, Bin, Sataer, Yikemaiti, Gao, Miao, Shi, Chuanqi, Cao, Siyi, and Gao, Zhiqiang
Subjects: PARSING (Computer grammar), TEXT summarization, SEMANTICS, MACHINE translating, NATURAL language processing, CORPORA, DATA mining, ANNOTATIONS
Abstract: Featured Application: Hierarchical clause annotation could be applied in many downstream tasks of natural language processing, including abstract meaning representation parsing, semantic dependency parsing, text summarization, argument mining, information extraction, question answering, machine translation, etc. Most natural-language-processing (NLP) tasks suffer performance degradation when encountering long complex sentences, such as semantic parsing, syntactic parsing, machine translation, and text summarization. Previous works addressed the issue with the intuition of decomposing complex sentences and linking simple ones, such as rhetorical-structure-theory (RST)-style discourse parsing, split-and-rephrase (SPRP), text simplification (TS), simple sentence decomposition (SSD), etc. However, these works are not applicable for semantic parsing such as abstract meaning representation (AMR) parsing and semantic dependency parsing due to misalignments with semantic relations and unavailabilities to preserve the original semantics. Following the same intuition and avoiding the deficiencies of previous works, we propose a novel framework, hierarchical clause annotation (HCA), for capturing clausal structures of complex sentences, based on the linguistic research of clause hierarchy. With the HCA framework, we annotated a large HCA corpus to explore the potentialities of integrating HCA structural features into semantic parsing with complex sentences. Moreover, we decomposed HCA into two subtasks, i.e., clause segmentation and clause parsing, and provide neural baseline models for more-silver annotations. In evaluating the proposed models on our manually annotated HCA dataset, the performances of clause segmentation and parsing resulted in 91.3% F1-scores and 88.5% Parseval scores, respectively. Due to the same model architectures employed, the performance differences of the clause/discourse segmentation and parsing subtasks was reflected in our HCA corpus and compared discourse corpora, where our sentences contained more segment units and fewer interrelations than those in the compared corpora. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

39. An Extended AHP-Based Corpus Assessment Approach for Handling Keyword Ranking of NLP: An Example of COVID-19 Corpus Data.

Author: Chen, Liang-Ching and Chang, Kuei-Hu
Subjects: *NATURAL language processing, *ANALYTIC hierarchy process, *ENVIRONMENTAL research, *CORPORA, *COVID-19, *ENVIRONMENTAL sciences
Abstract: The use of corpus assessment approaches to determine and rank keywords for corpus data is critical due to the issues of information retrieval (IR) in Natural Language Processing (NLP), such as when encountering COVID-19, as it can determine whether people can rapidly obtain knowledge of the disease. The algorithms used for corpus assessment have to consider multiple parameters and integrate individuals' subjective evaluation information simultaneously to meet real-world needs. However, traditional keyword-list-generating approaches are based on only one parameter (i.e., the keyness value) to determine and rank keywords, which is insufficient. To improve the evaluation benefit of the traditional keyword-list-generating approach, this paper proposed an extended analytic hierarchy process (AHP)-based corpus assessment approach to, firstly, refine the corpus data and then use the AHP method to compute the relative weights of three parameters (keyness, frequency, and range). To verify the proposed approach, this paper adopted 53 COVID-19-related research environmental science research articles from the Web of Science (WOS) as an empirical example. After comparing with the traditional keyword-list-generating approach and the equal weights (EW) method, the significant contributions are: (1) using the machine-based technique to remove function and meaningless words for optimizing the corpus data; (2) being able to consider multiple parameters simultaneously; and (3) being able to integrate the experts' evaluation results to determine the relative weights of the parameters. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

40. Semi-supervised method for improving general-purpose and domain-specific textual corpora labels.

Author: Babikov, Igor, Kovalchuk, Sergey, and Soldatov, Ivan
Subjects: SUPERVISED learning, LANGUAGE models, NATURAL language processing, CORPORA, PIPELINE failures
Abstract: The study is devoted to the problem of label quality assurance and labeling cost minimization. It is important to have high quality labels in order to build efficient supervised machine learning pipelines. The aforementioned labels have proved to be costly to acquire; however, one can work with weaker instances of supervision that may be helpful to create a reliably set of ground truth labels for classifiers. The authors present a semi-supervised approach using pretrained language models to increase supervision information quality from general-purpose as well domain-specific document sets where weak labels and unlabeled data is present. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

41. A Curriculum Learning Approach for Multi-Domain Text Classification Using Keyword Weight Ranking.

Author: Yuan, Zilin, Li, Yinghui, Li, Yangning, Zheng, Hai-Tao, He, Yaobin, Liu, Wenqiang, Huang, Dongxiao, and Wu, Bei
Subjects: CURRICULUM, LEARNING, CLASSIFICATION, NATURAL language processing, CORPORA
Abstract: Text classification is a well-established task in NLP, but it has two major limitations. Firstly, text classification is heavily reliant on domain-specific knowledge, meaning that a classifier that is trained on a given corpus may not perform well when presented with text from another domain. Secondly, text classification models require substantial amounts of annotated data for training, and in certain domains, there may be an insufficient quantity of labeled data available. Consequently, it is essential to explore methods for efficiently utilizing text data from various domains to improve the performance of models across a range of domains. One approach for achieving this is through the use of multi-domain text classification models that leverage adversarial training to extract domain-shared features among all domains as well as the specific features of each domain. After observing the varying distinctness of domain-specific features, our paper introduces a curriculum learning approach using a ranking system based on keyword weight to enhance the effectiveness of multi-domain text classification models. The experimental data from Amazon reviews and FDU-MTL datasets show that our method significantly improves the efficacy of multi-domain text classification models adopting adversarial learning and reaching state-of-the-art outcomes on these two datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

42. UNDERRL TAGGER: UN ETIQUETADOR GRAMATICAL PARA LENGUAS INFRASOPORTADAS TECNOLÓGICAMENTE Y LENGUAS MINORITARIAS.

Author: Pemberty Tamayo, José Luis, Molina Mejía, Jorge Mauricio, and Vallejo Zapata, Víctor Julián
Subjects: NATURAL language processing, LINGUISTIC minorities, CORPORA, LANGUAGE & languages, SOFTWARE architecture, DESIGN software
Abstract: Copyright of Forma y Funcion is the property of Universidad Nacional de Colombia, Facultad de Ciencias Humanas, Departamento de Linguistica and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2023
Full Text: View/download PDF

43. Review of Pascual & Mark (2021): Beyond Concordance Lines: Corpora in language education.

Author: Le Foll, Elen
Subjects: STUDENT attitudes, NATURAL language processing, CORPORA, LANGUAGE & languages, PROGRAMMING languages
Abstract: The book "Beyond Concordance Lines: Corpora in language education" edited by Pascual Pérez-Paredes and Geraldine Mark provides an overview of data-driven learning (DDL) approaches in language teaching and learning. The book is divided into three sections, covering the contextualization of DDL, corpus-based learner language research, and practical applications of DDL in diverse educational contexts. The chapters explore topics such as the historical overview of DDL research, the intersection between DDL and second language acquisition theories, and the impact of DDL on English as a Foreign Language learners. The book offers valuable insights into the state of DDL research and its potential for language education. [Extracted from the article]
Published: 2023
Full Text: View/download PDF

44. Research of natural language processing based on dynamic search corpus in cultural translation and emotional analysis.

Author: Wang, Junya
Subjects: *MACHINE translating, *LANGUAGE research, *NATURAL language processing, *TRANSLATING & interpreting, *CORPORA
Abstract: In order to enable students to directly face empirical data, summarize translation rules and learn translation skills, this paper studies the basis, motivation and methods of applying research dynamics in translation and teaching. Presenting data in class is the main method of dynamically searching corpora, which enables learners to face enough bilingual data that are easy to choose, and makes translation skills and teaching of translation of selected language items relatively focused. In recent years, the emotional analysis text has attracted academic scientists, and the professionals involved in the research, the use of research methods, and the cultural background related to language have become more and more extensive. In this paper, natural language processing is used to analyze emotions contained in translated texts. Natural language processing not only helps to manage the huge ability of data to efficiently translate text, but also helps to extract the hidden emotions in text translation. It only takes half the effort to achieve the multiplier effect. The multi label classification in natural language processing can reflect the information contained in emotion. The translated text is more detailed, which is helpful for further research. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

45. Constructing a cross-document event coreference corpus for Dutch.

Author: De Langhe, Loic, De Clercq, Orphée, and Hoste, Veronique
Subjects: *NATURAL language processing, *CORPORA, *LIE detectors & detection
Abstract: Event coreference resolution is a task in which different text fragments that refer to the same real-world event are automatically linked together. This task can be performed not only within a single document but also across different documents and can serve as a basis for many useful Natural Language Processing applications. Resources for this type of research, however, are extremely limited. We compiled the first large-scale dataset for cross-document event coreference resolution in Dutch, comparable in size to the most widely used English event coreference corpora. As data for event coreference is notoriously sparse, we took additional steps to maximize the number of coreference links in our corpus. Due to the complex nature of event coreference resolution, many algorithms consist of pipeline architectures which rely on a series of upstream tasks such as event detection, event argument identification and argument coreference. We tackle the task of event argument coreference to both illustrate the potential of our compiled corpus and to lay the groundwork for a Dutch event coreference resolution system in the future. Results show that existing NLP algorithms can be easily retrofitted to contribute to the subtasks of an event coreference resolution pipeline system. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

46. Assessing spoken lexical and lexicogrammatical proficiency using features of word, bigram, and dependency bigram use.

Author: Kyle, Kristopher and Eguchi, Masaki
Subjects: *LANGUAGE ability, *COLLOCATION (Linguistics), *CORPORA, *COMPUTATIONAL linguistics, *SECOND language acquisition
Abstract: The measurement of second language (L2) productive lexical proficiency has driven a great deal of research over the past two decades. Research has indicated that more proficient speakers and writers tend to use a wider range of words and that more proficient writers tend to use words that are more sophisticated (less frequent in reference corpora). Research over the past 15 years has also demonstrated that the way words are used in context (i.e., collocation use) is also an important indicator of both written and spoken proficiency. In this study, we extend recent research that has modeled writing proficiency using collocation indices based on grammatical dependencies (e.g., verb–direct object) to spoken contexts. In particular, we model speaking proficiency scores from a large corpus of oral proficiency interview responses using a range of well‐known indices of productive proficiency and newly developed grammatical dependency indices. The results indicated that all index types demonstrated small to moderate correlations with speaking proficiency individually but explained a large proportion of the variance when used in a multivariate model that included dependency collocation indices. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

47. When is a Crisis Realy a Crisis? Using NLP and Corpus Linguistic Methods to Reveal Diferences in Migration Discourse Acros Czech Media.

Author: Pekáček, Ondřej and Elmerot, Irene
Subjects: *NATURAL language processing, *CORPORA, *TERMS & phrases, *ALTERNATIVE mass media, *EUROPEAN Migrant Crisis, 2015-2016
Abstract: This article presents an interdisciplinary analysis of discourses on refugees, asylum seekers, immigrants, and migrants (RASIM) in mainstream and alternative media in the Czech Republic. Using techniques from corpus linguistics (CL) and natural language processing (NLP) and drawing on insights from media sociology, we demonstrate the value of an interdisciplinary approach for conducting robust research that can inform policymakers and media practitioners. Our analysis of nearly one million documents from January 2015 to February 2023 reveals distinctive terms and phrases used by alternative media, highlighting the growing divergence between the mainstream and alternative media discourse and its intensity over different periods. These findings have implications for understanding the mobilization of anti-systemic groups, particularly those on the far right. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

49. Natural Language Processing for Corpus Linguistics.

Author: Wen, Ju and Yi, Lan
Subjects: COMPUTATIONAL linguistics, DOCUMENT clustering, LANGUAGE research, NATURAL language processing, PRAGMATICS, CORPORA, LINGUISTIC analysis, NATURAL languages
Abstract: As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It is noteworthy that most of the analyses presented in this book are based on the I text analytics i python package that is dedicated to corpus-based text analysis using NLP techniques. Despite its success in a wide range of fields (Römer, [4]), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. [Extracted from the article]
Published: 2023
Full Text: View/download PDF

50. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters.

Author: Richter-Pechanski, Phillip, Wiesenbach, Philipp, Schwab, Dominic M., Kiriakou, Christina, He, Mingyang, Allers, Michael M., Tiefenbacher, Anna S., Kunz, Nicola, Martynova, Anna, Spiller, Noemie, Mierisch, Julian, Borchert, Florian, Schwind, Charlotte, Frey, Norbert, Dieterich, Christoph, and Geis, Nicolas A.
Subjects: NATURAL language processing, GERMAN language, TEMPORAL databases, CORPORA, DATA mining, LANGUAGE research
Abstract: We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor's letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

537 results on '"Corpora"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources