"Language Modeling" / Database: Academic Search Index / Journal: plos one - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Language Modeling"' showing total 10 results

Start Over "Language Modeling" Journal plos one Database Academic Search Index

10 results on '"Language Modeling"'

1. Exploring a method for extracting concerns of multiple breast cancer patients in the domain of patient narratives using BERT and its optimization by domain adaptation using masked language modeling.

Author: Watabe, Satoshi, Watanabe, Tomomi, Yada, Shuntaro, Aramaki, Eiji, Yajima, Hiroshi, Kizaki, Hayato, and Hori, Satoko
Subjects: *LANGUAGE models, *BREAST cancer, *CANCER patients, *MEDICAL history taking
Abstract: Narratives posted on the internet by patients contain a vast amount of information about various concerns. This study aimed to extract multiple concerns from interviews with breast cancer patients using the natural language processing (NLP) model bidirectional encoder representations from transformers (BERT). A total of 508 interview transcriptions of breast cancer patients written in Japanese were labeled with five types of concern labels: "treatment," "physical," "psychological," "work/financial," and "family/friends." The labeled texts were used to create a multi-label classifier by fine-tuning a pre-trained BERT model. Prior to fine-tuning, we also created several classifiers with domain adaptation using (1) breast cancer patients' blog articles and (2) breast cancer patients' interview transcriptions. The performance of the classifiers was evaluated in terms of precision through 5-fold cross-validation. The multi-label classifiers with only fine-tuning had precision values of over 0.80 for "physical" and "work/financial" out of the five concerns. On the other hand, precision for "treatment" was low at approximately 0.25. However, for the classifiers using domain adaptation, the precision of this label took a range of 0.40–0.51, with some cases improving by more than 0.2. This study showed combining domain adaptation with a multi-label classifier on target data made it possible to efficiently extract multiple concerns from interviews. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Establishing vocabulary tests as a benchmark for evaluating large language models.

Author: Martínez, Gonzalo, Conde, Javier, Merino-Gómez, Elena, Bermúdez-Margaretto, Beatriz, Hernández, José Alberto, Reviriego, Pedro, and Brysbaert, Marc
Abstract: Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. ET-Network: A novel efficient transformer deep learning model for automated Urdu handwritten text recognition.

Author: Hamza, Ameer, Ren, Shengbing, and Saeed, Usman
Subjects: *DEEP learning, *TEXT recognition, *TRANSFORMER models, *FEATURE extraction, *ERROR rates, *HANDWRITING
Abstract: Automatic Urdu handwritten text recognition is a challenging task in the OCR industry. Unlike printed text, Urdu handwriting lacks a uniform font and structure. This lack of uniformity causes data inconsistencies and recognition issues. Different writing styles, cursive scripts, and limited data make Urdu text recognition a complicated task. Major languages, such as English, have experienced advances in automated recognition, whereas low-resource languages, such as Urdu, still lag. Transformer-based models are promising for automated recognition in high- and low-resource languages such as Urdu. This paper presents a transformer-based method called ET-Network that integrates self-attention into EfficientNet for feature extraction and a transformer for language modeling. The use of self-attention layers in EfficientNet helps to extract global and local features that capture long-range dependencies. These features proceeded into a vanilla transformer to generate text, and a prefix beam search is used for the finest outcome. NUST-UHWR, UPTI2.0, and MMU-OCR-21 are three datasets used to train and test the ET Network for a handwritten Urdu script. The ET-Network improved the character error rate by 4% and the word error rate by 1.55%, while establishing a new state-of-the-art character error rate of 5.27% and a word error rate of 19.09% for Urdu handwritten text. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Quantifying gender bias towards politicians in cross-lingual language models.

Author: Stańczak, Karolina, Ray Choudhury, Sagnik, Pimentel, Tiago, Cotterell, Ryan, and Augenstein, Isabelle
Subjects: *LANGUAGE models, *SEX discrimination, *WOMEN politicians, *POLITICIANS, *NATURAL languages
Abstract: Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language. The present paper introduces a simple method for probing language models to conduct a multilingual study of gender bias towards politicians. We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender. To this end, we curate a dataset of 250k politicians worldwide, including their names and gender. Our study is conducted in seven languages across six different language modeling architectures. The results demonstrate that pre-trained language models' stance towards politicians varies strongly across analyzed languages. We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians. Finally, and contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

5. Bat4RCT: A suite of benchmark data and baseline methods for text classification of randomized controlled trials.

Author: Kim, Jenna, Kim, Jinmo, Lee, Aejin, and Kim, Jinseok
Subjects: *RANDOMIZED controlled trials, *TAGS (Metadata), *MACHINE learning, *CLASSIFICATION
Abstract: Randomized controlled trials (RCTs) play a major role in aiding biomedical research and practices. To inform this research, the demand for highly accurate retrieval of scientific articles on RCT research has grown in recent decades. However, correctly identifying all published RCTs in a given domain is a non-trivial task, which has motivated computer scientists to develop methods for identifying papers involving RCTs. Although existing studies have provided invaluable insights into how RCT tags can be predicted for biomedicine research articles, they used datasets from different sources in varying sizes and timeframes and their models and findings cannot be compared across studies. In addition, as datasets and code are rarely shared, researchers who conduct RCT classification have to write code from scratch, reinventing the wheel. In this paper, we present Bat4RCT, a suite of data and an integrated method to serve as a strong baseline for RCT classification, which includes the use of BERT-based models in comparison with conventional machine learning techniques. To validate our approach, all models are applied on 500,000 paper records in MEDLINE. The BERT-based models showed consistently higher recall scores than conventional machine learning and CNN models while producing slightly better or similar precision scores. The best performance was achieved by the BioBERT model when trained on both title and abstract texts, with the F1 score of 90.85%. This infrastructure of dataset and code will provide a competitive baseline for the evaluation and comparison of new methods and the convenience of future benchmarking. To our best knowledge, our study is the first work to apply BERT-based language modeling techniques to RCT classification tasks and to share dataset and code in order to promote reproducibility and improvement in text classification in biomedicine research. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

6. Pre-training Model Based on Parallel Cross-Modality Fusion Layer.

Author: Li, Xuewei, Han, Dezhi, and Chang, Chin-Chen
Subjects: *COMPUTER vision, *NATURAL language processing
Abstract: Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. In VQA, it is important to understand the alignment between visual concepts and linguistic semantics. In this paper, we proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. We use four different Pre-training missions, namely, Cross-Modality Mask Language Modeling, Cross-Modality Mask Region Modeling, Image-Text Matching, and Image-Text Q&A, to pre-train the P-PCFL model and improve its reasoning and universality, which help to learn the relationship between Intra-modality and Inter-modality. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. In addition, we also conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

7. TweepFake: About detecting deepfake tweets.

Author: Fagni, Tiziano, Falchi, Fabrizio, Gambini, Margherita, Martella, Antonio, and Tesconi, Maurizio
Subjects: *MICROBLOGS, *MARKOV processes, *SOCIAL media
Abstract: The recent advances in language modeling significantly improved the generative capabilities of deep neural models: in 2019 OpenAI released GPT-2, a pre-trained language model that can autonomously generate coherent, non-trivial and human-like text samples. Since then, ever more powerful text generative models have been developed. Adversaries can exploit these tremendous generative capabilities to enhance social bots that will have the ability to write plausible deepfake messages, hoping to contaminate public debate. To prevent this, it is crucial to develop deepfake social media messages detection systems. However, to the best of our knowledge no one has ever addressed the detection of machine-generated texts on social networks like Twitter or Facebook. With the aim of helping the research in this detection field, we collected the first dataset of real deepfake tweets, TweepFake. It is real in the sense that each deepfake tweet was actually posted on Twitter. We collected tweets from a total of 23 bots, imitating 17 human accounts. The bots are based on various generation techniques, i.e., Markov Chains, RNN, RNN+Markov, LSTM, GPT-2. We also randomly selected tweets from the humans imitated by the bots to have an overall balanced dataset of 25,572 tweets (half human and half bots generated). The dataset is publicly available on Kaggle. Lastly, we evaluated 13 deepfake text detection methods (based on various state-of-the-art approaches) to both demonstrate the challenges that Tweepfake poses and create a solid baseline of detection techniques. We hope that TweepFake can offer the opportunity to tackle the deepfake detection on social media messages as well. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

8. Keepin' it real: Linguistic models of authenticity judgments for artificially generated rap lyrics.

Author: Karsdorp, Folgert, Manjavacas, Enrique, and Kestemont, Mike
Subjects: *LINGUISTIC models, *JUDGMENT (Psychology), *MUSIC festivals, *POPULAR music, *SENSORY perception
Abstract: Through advances in neural language modeling, it has become possible to generate artificial texts in a variety of genres and styles. While the semantic coherence of such texts should not be over-estimated, the grammatical correctness and stylistic qualities of these artificial texts are at times remarkably convincing. In this paper, we report a study into crowd-sourced authenticity judgments for such artificially generated texts. As a case study, we have turned to rap lyrics, an established sub-genre of present-day popular music, known for its explicit content and unique rhythmical delivery of lyrics. The empirical basis of our study is an experiment carried out in the context of a large, mainstream contemporary music festival in the Netherlands. Apart from more generic factors, we model a diverse set of linguistic characteristics of the input that might have functioned as authenticity cues. It is shown that participants are only marginally capable of distinguishing between authentic and generated materials. By scrutinizing the linguistic features that influence the participants' authenticity judgments, it is shown that linguistic properties such as 'syntactic complexity', 'lexical diversity' and 'rhyme density' add to the user's perception of texts being authentic. This research contributes to the improvement of the quality and credibility of generated text. Additionally, it enhances our understanding of the perception of authentic and artificial art. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

9. Predictive modeling for odor character of a chemical using machine learning combined with natural language processing.

Author: Nozaki, Yuji and Nakamoto, Takamichi
Subjects: *NATURAL language processing, *MACHINE learning, *SENSORY evaluation, *TEXT mining, *COMPUTER simulation
Abstract: Recent studies on machine learning technology have reported successful performances in some visual and auditory recognition tasks, while little has been reported in the field of olfaction. In this paper we report computational methods to predict the odor impression of a chemical from its physicochemical properties. Our predictive model utilizes nonlinear dimensionality reduction on mass spectra data and performs the clustering of descriptors by natural language processing. Sensory evaluation is widely used to measure human impressions to smell or taste by using verbal descriptors, such as “spicy” and “sweet”. However, as it requires significant amounts of time and human resources, a large-scale sensory evaluation test is difficult to perform. Our model successfully predicts a group of descriptors for a target chemical through a series of computer simulations. Although the training text data used in the language modeling is not specialized for olfaction, the experimental results show that our method is useful for analyzing sensory datasets. This is the first report to combine machine olfaction with natural language processing for odor character prediction. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

10. The Brazilian Portuguese Lexicon: An Instrument for Psycholinguistic Research.

Author: Estivalet, Gustavo L. and Meunier, Fanny
Subjects: *LEXICON, *PSYCHOLINGUISTICS, *COMPUTATIONAL linguistics, *PORTUGUESE language, *INTERNET access, *LANGUAGE & languages
Abstract: In this article, we present the Brazilian Portuguese Lexicon, a new word-based corpus for psycholinguistic and computational linguistic research in Brazilian Portuguese. We describe the corpus development, the specific characteristics on the internet site and database for user access. We also perform distributional analyses of the corpus and comparisons to other current databases. Our main objective was to provide a large, reliable, and useful word-based corpus with a dynamic, easy-to-use, and intuitive interface with free internet access for word and word-criteria searches. We used the Núcleo Interinstitucional de Linguística Computacional’s corpus as the basic data source and developed the Brazilian Portuguese Lexicon by deriving and adding metalinguistic and psycholinguistic information about Brazilian Portuguese words. We obtained a final corpus with more than 30 million word tokens, 215 thousand word types and 25 categories of information about each word. This corpus was made available on the internet via a free-access site with two search engines: a simple search and a complex search. The simple engine basically searches for a list of words, while the complex engine accepts all types of criteria in the corpus categories. The output result presents all entries found in the corpus with the criteria specified in the input search and can be downloaded as a.csv file. We created a module in the results that delivers basic statistics about each search. The Brazilian Portuguese Lexicon also provides a pseudoword engine and specific tools for linguistic and statistical analysis. Therefore, the Brazilian Portuguese Lexicon is a convenient instrument for stimulus search, selection, control, and manipulation in psycholinguistic experiments, as also it is a powerful database for computational linguistics research and language modeling related to lexicon distribution, functioning, and behavior. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

10 results on '"Language Modeling"'

1. Exploring a method for extracting concerns of multiple breast cancer patients in the domain of patient narratives using BERT and its optimization by domain adaptation using masked language modeling.

2. Establishing vocabulary tests as a benchmark for evaluating large language models.

3. ET-Network: A novel efficient transformer deep learning model for automated Urdu handwritten text recognition.

4. Quantifying gender bias towards politicians in cross-lingual language models.

5. Bat4RCT: A suite of benchmark data and baseline methods for text classification of randomized controlled trials.

6. Pre-training Model Based on Parallel Cross-Modality Fusion Layer.

7. TweepFake: About detecting deepfake tweets.

8. Keepin' it real: Linguistic models of authenticity judgments for artificially generated rap lyrics.

9. Predictive modeling for odor character of a chemical using machine learning combined with natural language processing.

10. The Brazilian Portuguese Lexicon: An Instrument for Psycholinguistic Research.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

10 results on '"Language Modeling"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources