Descriptor: "BERT" - Searchworks@Jio Institute Digital Library Search Results

151. Source Code Error Understanding Using BERT for Multi-Label Classification

Author: Md Faizul Ibne Amin, Yutaka Watanobe, Md Mostafizer Rahman, and Atsushi Shirafuji
Subjects: Multi-label classification, BERT, CodeT5, CodeBERT, decision tree, random forest, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Programming is an essential skill in computer science and across a wide range of engineering disciplines. However, errors, often referred to as ‘bugs’ in code, can be challenging to identify and rectify for both students learning to program and experienced professionals. Understanding, identifying, and effectively addressing these errors are critical aspects of programming education and software development. To aid in understanding and classifying these errors, we propose a multi-label error classification approach for source code using fine-tuned BERT models (BERT_Uncased and BERT_Cased). The models achieved average classification accuracies of 90.58% and 90.80%, exact match accuracies of 48.28% and 49.13%, and weighted F1 scores of 0.796 and 0.799, respectively. Precision, Recall, Hamming Loss, and ROC-AUC metrics further evaluate the effectiveness of our models. Additionally, we employed several combinations of large language models (CodeT5, CodeBERT) with machine learning classifiers (Decision Tree, Random Forest, Ensemble Learning, ML-KNN), demonstrating the superiority of our proposed approach. These findings highlight the potential of multi-label error classification to advance programming education, software engineering, and related research fields.
Published: 2025
Full Text: View/download PDF

152. Named entity recognition based on span and category enhancement for Chinese news

Author: QI Ruiyan, LI Longjie, XU Shicheng, MA Ligong, and MA Zhixin
Subjects: News named entity recognition, BERT, Span, Category enhancement, Word boundary information, Electronic computers. Computer science, QA75.5-76.95
Abstract: In the field of news, the identification of named entities is complicated by complex syntactic structures and long entity names, which pose challenges for determining entity boundaries and lead to interruptions in predicting long entities using sequence labeling methods. To address these challenges, a model named SpaCE (span and category enhancement for Chinese news named entity recognition) was proposed. This model was developed based on the bidirectional encoder representation pre-trained model with a Transformer structure (BERT) and was enhanced by span prediction and category description to improve recognition performance. During the encoding of news text information, category descriptions were incorporated to enhance semantic knowledge, and a span-based decoding method was adopted to address interruptions in predicting long entities. Furthermore, word boundary information was introduced through precise labeling, and the entity matching strategy was optimized, effectively reducing non-entity matching caused by span decoding. Compared to baseline models, SpaCE demonstrated improved performance on three datasets. Furthermore, SpaCE exhibits strong named entity recognition capabilities on disordered texts, indicating its robustness.
Published: 2024
Full Text: View/download PDF

153. Large language models to process, analyze, and synthesize biomedical texts: a scoping review

Author: Simona Emilova Doneva, Sijing Qin, Beate Sick, Tilia Ellendorff, Jean-Philippe Goldman, Gerold Schneider, and Benjamin Victor Ineichen
Subjects: Natural language processing, Bioinformatics, Biomedicine, Large language models, BERT, Evidence synthesis, Computational linguistics. Natural language processing, P98-98.5, Electronic computers. Computer science, QA75.5-76.95
Abstract: Abstract The advent of large language models (LLMs) such as BERT and, more recently, GPT, is transforming our approach of analyzing and understanding biomedical texts. To stay informed about the latest advancements in this area, there is a need for up-to-date summaries on the role of LLM in Natural Language Processing (NLP) of biomedical texts. Thus, this scoping review aims to provide a detailed overview of the current state of biomedical NLP research and its applications, with a special focus on the evolving role of LLMs. We conducted a systematic search of PubMed, EMBASE, and Google Scholar for studies and conference proceedings published from 2017 to December 19, 2023, that develop or utilize LLMs for NLP tasks in biomedicine. We evaluated the risk of bias in these studies using a 3-item checklist. From 13,823 references, we selected 199 publications and conference proceedings for our review. LLMs are being applied to a wide array of tasks in the biomedical field, including knowledge management, text mining, drug discovery, and evidence synthesis. Prominent among these tasks are text classification, relation extraction, and named entity recognition. Although BERT-based models remain prevalent, the use of GPT-based models has substantially increased since 2023. We conclude that, despite offering opportunities to manage the growing volume of biomedical data, LLMs also present challenges, particularly in clinical medicine and evidence synthesis, such as issues with transparency and privacy concerns.
Published: 2024
Full Text: View/download PDF

154. Contextual Semantic Embeddings Based on Transformer Models for Arabic Biomedical Questions Classification

Author: Ismail Ait Talghalit, Hamza Alami, and Said Ouatik El Alaoui
Subjects: arabic question classification, biomedical domain, natural language processing, transformers, bert, fine-tuning, question answering systems, sentence embedding., Technological innovations. Automation, HD45-45.2
Abstract: Arabic biomedical question classification (ABQC) is a challenging task due to various reasons including, the specialized jargon expressed in Arabic language, complex semantics of Arabic vocabulary and the lack of specific datasets and corpora. When representing questions, only a few studies deal with ABQC by taking into account the word context. In this work, we propose a classification model designed for Arabic biomedical questions. We build vector representations capturing the contextual and semantic information of Arabic biomedical text, which presents numerous challenges, such as the derivational morphology of Arabic language, the specialized terminology of biomedical terms and the lack of capitalization in text. Our representation adapts the extensive knowledge encoded in BERT (Bidirectional Encoder Representations from Transformers) and other transformer models, to address the aforementioned challenges. Several experiments have been conducted on a dedicated Arabic biomedical dataset namely: MAQA, with well-known transformer models including BERT, AraBERT, BioBERT, RoBERTa, and DistilBERT fine-tuned for the classification task. Obtained results show that our method achieves remarkable performance with an accuracy of 93.31% and an F1-score of 93.35%. Doi: 10.28991/HIJ-2024-05-04-011 Full Text: PDF
Published: 2024
Full Text: View/download PDF

155. Enhancing early depression detection with AI: a comparative use of NLP models

Author: Bakir Hadzic, Parvez Mohammed, Michael Danner, Julia Ohse, Yihong Zhang, Youssef Shiban, and Matthias Rätsch
Subjects: mental health, depression detection, llm, bert, gpt-4, Control engineering systems. Automatic machinery (General), TJ212-225
Abstract: One of the most underdiagnosed medical conditions worldwide is depression. It has been demonstrated that the current classical procedures for early detection of depression are insufficient, which emphasizes the importance of seeking a more efficient approach to overcome this challenge. One of the most promising opportunities is arising in the field of Artificial Intelligence as AI-based models could have the capacity to offer a fast, widely accessible, unbiased and efficient method to address this problem. In this paper, we compared three natural language processing models, namely, BERT, GPT-3.5 and GPT-4 on three different datasets. Our findings show that different levels of efficacy are shown by fine-tuned BERT, GPT-3.5, and GPT-4 in identifying depression from textual data. By comparing the models on the metrics such as accuracy, precision, and recall, our results have shown that GPT-4 outperforms both BERT and GPT-3.5 models, even without previous fine-tuning, showcasing its enormous potential to be utilized for automated depression detection on textual data. In the paper, we present newly introduced datasets, fine-tuning and model testing processes, while also addressing limitations and discussing further considerations for future research.
Published: 2024
Full Text: View/download PDF

156. A hybrid deep learning method for identifying topics in large-scale urbantext data: Benefits and trade-offs

Author: Lore, Madison, Harten, Julia G, and Boeing, Geoff
Subjects: BERT, LLM, AI, NLP, artificial intelligence, machine learning, text analysis, mixed methods, qualitative research, housing markets, rental housing, affordability, urban informatics, craigslist
Abstract: Large-scale text data from public sources, including social media or online platforms, can expand urban planners’ ability to monitor and analyze urban conditions in near real-time. To overcome scalability challenges of manual techniques for qualitative data analysis, researchers and practitioners have turned to computer-automated methods, such as natural language processing (NLP) and deep learning. However, the benefits, challenges, and trade-offs of these methods remain poorly understood. How much meaning can different NLP techniques capture and how do their results compare to traditional manual techniques? Drawing on 90,000 online rental listings in Los Angeles County, this study proposes and compares manual, semi-automated, and fully automated methods for identifying context-informed topics in unstructured, user-generated text data. We find that fully automated methods perform best with more-structured text, but struggle to separate topics in free-flow text and when handling nuanced language. Introducing a manual technique first on a small data set to train a semi-automated method, however, improves accuracy even as the structure of the text degrades. We argue that while fully automated NLP methods are attractive replacements for scaling manual techniques, leveraging the contextual understanding of human expertise alongside efficient computer-based methods like BERT models generates better accuracy without sacrificing scalability.
Published: 2024

157. exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.

Author: Yang, Tiancheng, Sucholutsky, Ilia, Jen, Kuang-Yu, and Schonlau, Matthias
Subjects: BERT, Kidney, Language model, NLP, Natural language processing, Pathology, Renal, Transformer, Transplant
Abstract: BACKGROUND: Pathology reports contain key information about the patients diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. OBJECTIVE: To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) What kind of rejection does the patient show?; (2) What is the grade of interstitial fibrosis and tubular atrophy (IFTA)? METHODS: Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERTs tokenizer with six technical keywords and repeating the pre-training procedure. This extended the models vocabulary. All three models were fine-tuned with information retrieval heads. RESULTS: The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. CONCLUSION: ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizers vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
Published: 2024

158. Electric Vehicle Sentiment Analysis Using Large Language Models

Author: Hemlata Sharma, Faiz Ud Din, and Bayode Ogunleye
Subjects: BERT, electric vehicles, large language models, machine learning, natural language processing, RoBERTa, Electronic computers. Computer science, QA75.5-76.95, Probabilities. Mathematical statistics, QA273-280
Abstract: Sentiment analysis is a technique used to understand the public’s opinion towards an event, product, or organization. For example, sentiment analysis can be used to understand positive or negative opinions or attitudes towards electric vehicle (EV) brands. This provides companies with valuable insight into the public’s opinion of their products and brands. In the field of natural language processing (NLP), transformer models have shown great performance compared to traditional machine learning algorithms. However, these models have not been explored extensively in the EV domain. EV companies are becoming significant competitors in the automotive industry and are projected to cover up to 30% of the United States light vehicle market by 2030 In this study, we present a comparative study of large language models (LLMs) including bidirectional encoder representations from transformers (BERT), robustly optimised BERT (RoBERTa), and a generalised autoregressive pre-training method (XLNet) using Lucid Motors and Tesla Motors YouTube datasets. Results evidenced that LLMs like BERT and her variants are off-the-shelf algorithms for sentiment analysis, specifically when fine-tuned. Furthermore, our findings present the need for domain adaptation whilst utilizing LLMs. Finally, the experimental results showed that RoBERTa achieved consistent performance across the EV datasets with an F1 score of at least 92%.
Published: 2024
Full Text: View/download PDF

159. Detecting Fake Reviews Using BERT and Sublinear_TF Methods on Hotel Reviews in the Lombok Tourism Area

Author: Zulpan Hadi, M. Zulpahmi, Zaenudin ., and Akmaludin Asrory
Subjects: fake reviews, bert, random forest, svm, sublinear_tf, Electronic computers. Computer science, QA75.5-76.95
Abstract: The number of visitors to Lombok, one of the famous tourist destinations in Indonesia, increased from 400,595 in 2020 to 1,376,295 in 2022. Although the government supports the hotel industry, fake reviews are a significant problem that can damage hotel reputations and mislead tourists. This study uses BERT and Sublinear_TF feature extraction techniques to analyze fake reviews from three main areas: Gili Trawangan, Senggigi, and Kuta. BERT detects fake reviews by understanding the context of words, while Sublinear_TF emphasizes more informative words by reducing the weight of irrelevant common words. The results showed that the more extensive and diverse dataset from Gili Trawangan had the best classification results. The combination of BERT and Random Forest achieved the highest accuracy of 0.84. Overall, BERT excels in Gili Trawangan with an accuracy of 0.79 for SVM and 0.84 for Random Forest. In contrast, smaller and more homogeneous datasets such as Senggigi and Kuta have lower accuracy. In addition, Sublinear_TF performed well on Gili Trawangan with an accuracy of 0.82 using SVM and 0.83 using Random Forest; however, its performance declined in Senggigi and Kuta. BERT and Sublinear_TF techniques are more effective on large and diverse datasets such as Gili Trawangan. Sublinear_TF is better for varied data but less effective on more homogeneous datasets, while BERT with Random Forest showed the highest accuracy due to its ability to capture broader language context. This suggests that the size and variety of the dataset highly influence the success of fake review classification techniques.
Published: 2024
Full Text: View/download PDF

160. Enhancing emerging technology discovery in nanomedicine by integrating innovative sentences using BERT and NLDA

Author: Wang Yifan, Liu Xiaoping, and Zhu Xiang-Li
Subjects: bibliometrics, nanomedicine, emerging technologies, bert, topic modeling, Information technology, T58.5-58.64, Electronic computers. Computer science, QA75.5-76.95
Abstract: Nanomedicine has significant potential to revolutionize biomedicine and healthcare through innovations in diagnostics, therapeutics, and regenerative medicine. This study aims to develop a novel framework that integrates advanced natural language processing, noise-free topic modeling, and multidimensional bibliometrics to systematically identify emerging nanomedicine technology topics from scientific literature.
Published: 2024
Full Text: View/download PDF

161. Unraveling the DNA methylation landscape in dog blood across breeds

Author: Miyuki Nakamura, Yuki Matsumoto, Keiji Yasuda, Masatoshi Nagata, Ryo Nakaki, Masahiro Okumura, and Jumpei Yamazaki
Subjects: DNA methylation, WGBS, Dog, Natural language processing, BERT, Biotechnology, TP248.13-248.65, Genetics, QH426-470
Abstract: Abstract Background DNA methylation is a covalent bond modification that is observed mainly at cytosine bases in the context of CG pairs. DNA methylation patterns reflect the status of individual tissues, such as cell composition, age, and the local environment, in mammals. Genetic factors also impact DNA methylation, and the genetic diversity among various dog breeds provides a valuable platform for exploring this topic. Compared to those in the human genome, studies on the profiling of methylation in the dog genome have been less comprehensive. Results Our study provides extensive profiling of DNA methylation in the whole blood of three dog breeds using whole-genome bisulfite sequencing. The difference in DNA methylation between breeds was moderate after removing CpGs overlapping with potential genetic variation. However, variance in methylation between individuals was common and often occurred in promoters and CpG islands (CGIs). Moreover, we adopted contextual awareness methodology to characterize DNA primary sequences using natural language processing (NLP). This method could be used to effectively separate unmethylated CGIs from highly methylated CGIs in the sequences that are identified by the conventional criteria. Conclusions This study presents a comprehensive DNA methylation landscape in the dog blood. Our observations reveal the similar methylation patterns across dog breeds, while CGI regions showed high variations in DNA methylation level between individuals. Our study also highlights the potential of NLP approach for analyzing low-complexity DNA sequences, such as CGIs.
Published: 2024
Full Text: View/download PDF

162. Exploring Transformer Models and Domain Adaptation for Detecting Opinion Spam in Reviews

Author: Christopher G Harris
Subjects: transfer learning, bert, roberta, distilbert, opinion spam detection, nlp, llm, transformer models, Telecommunication, TK5101-6720
Abstract: As online reviews play a crucial role in purchasing decisions, businesses are increasingly incentivized to generate positive reviews, sometimes resorting to fake reviews or opinion spam. Detecting opinion spam requires well-trained models, but obtaining annotated training data in the same domain (e.g., hotels) can be challenging. Transfer learning addresses this by leveraging training data from a similar domain (e.g., restaurants). This paper examines three popular transformer models—BERT, RoBERTa, and DistilBERT—to evaluate how training data from different domains, including imbalanced datasets, impacts Transformer model performance. Notably, our evaluation of hotel opinion spam detection achieved an AUC of 0.927 using RoBERTa trained on YelpChi restaurant data.
Published: 2024
Full Text: View/download PDF

163. A data-driven multi-perspective approach to cybersecurity knowledge discovery through topic modelling

Author: Fahad Alqurashi and Istiak Ahmad
Subjects: Cyber security, Knowledge discovery, Semantic analysis, Bert, Topic modelling, Natural language processing, Engineering (General). Civil engineering (General), TA1-2040
Abstract: Cybersecurity is crucial for protecting the privacy of digital systems, maintaining economic stability, and ensuring national security. This study presents a comprehensive approach to cybersecurity knowledge discovery through topic modelling, using a multi-perspective analysis of academic and industry sources. The datasets include 15,751 articles from the Web of Science (WoS) database and 5,831 articles from Security Magazine, spanning from 2011 to 2023. We employed BERTopic for topic modelling, UMAP for dimensionality reduction, and HDBSCAN clustering algorithm for grouping and analysing distinct article clusters to uncover latent topics, enhancing the understanding of the evolving cybersecurity landscape. This study found 12 micro-clusters and three macro-clusters, namely technology, smart city and education, from the WoS database and 12 more micro-clusters and four macro-clusters, including organization, public security, governance, and education, from magazines. This study reveals key cybersecurity research and practice trends, such as the increasing focus on malware, ransomware, and cyber-attack mitigation. Additionally, temporal analysis indicates a significant rise in cybersecurity interest around 2020, followed by a diversification of topics. The results highlight the importance of integrating diverse data sources to capture a holistic view of cybersecurity developments. Future work will aim to refine the clustering algorithms to further improve topic extraction and analysis and expand the datasets to include more diverse sources and perspectives. This approach helps discover current cybersecurity trends and provides a foundation for more targeted and effective cybersecurity strategies.
Published: 2024
Full Text: View/download PDF

164. Temporal Relational Graph Convolutional Network Approach to Financial Performance Prediction

Author: Brindha Priyadarshini Jeyaraman, Bing Tian Dai, and Yuan Fang
Subjects: knowledge graph, finance, BERT, tweets, text, LSTM, Computer engineering. Computer hardware, TK7885-7895
Abstract: Accurately predicting financial entity performance remains a challenge due to the dynamic nature of financial markets and vast unstructured textual data. Financial knowledge graphs (FKGs) offer a structured representation for tackling this problem by representing complex financial relationships and concepts. However, constructing a comprehensive and accurate financial knowledge graph that captures the temporal dynamics of financial entities is non-trivial. We introduce FintechKG, a comprehensive financial knowledge graph developed through a three-dimensional information extraction process that incorporates commercial entities and temporal dimensions and uses a financial concept taxonomy that ensures financial domain entity and relationship extraction. We propose a temporal and relational graph convolutional network (RGCN)-based representation for FintechKG data across multiple timesteps, which captures temporal dependencies. This representation is then combined with FinBERT embeddings through a projection layer, enabling a richer feature space. To demonstrate the efficacy of FintechKG, we evaluate its performance using the example task of financial performance prediction. A logistic regression model uses these combined features and social media embeddings for performance prediction. We classify whether the revenue will increase or decrease. This approach demonstrates the effectiveness of FintechKG combined with textual information for accurate financial forecasting. Our work contributes a systematic FKG construction method and a framework that utilizes both relational and textual embeddings for improved financial performance prediction.
Published: 2024
Full Text: View/download PDF

165. User identification across online social networks based on gated multi-feature extraction

Author: Yan Mao and Cuicui Ye
Subjects: User identification, Across social networks, Bert, Gated mechanism, User display name, Engineering (General). Civil engineering (General), TA1-2040
Abstract: User identification is an essential technical support for downstream tasks such as recommendation systems, information retrieval, and collaborative filtering. Computing the similarity between user display names through classifiers is an effective solution for user identification across social networks. However, there are two problems with existing methods. Applying expert domain knowledge to extract handcrafted features of display names ignores the semantic information, resulting in poor performance of these methods. Selecting influential display names, and handcrafted features in user identification problems are also one of the difficulties. To solve these two problems, we propose a method based on the multi-feature fusion of display names using gated units. First, we extract the deep semantic features of display names through the BERT pre-trained multi-language model. Then, the gated mechanism is applied to select the handcrafted features we extracted to retain the essential features. Then, the adaptive factors are used to fuse handcrafted features and deep features to obtain user identification results across social networks. Finally, the efficiency of our model is verified on three constructed real-world multilingual display names datasets across multiple online social networks and compared with existing state-of-the-art methods. Experimental results show that the proposed algorithm outperforms the compared algorithms.
Published: 2024
Full Text: View/download PDF

166. Biomedical relation extraction method based on ensemble learning and attention mechanism

Author: Yaxun Jia, Haoyang Wang, Zhu Yuan, Lian Zhu, and Zuo-lin Xiang
Subjects: Biomedical relation extraction, Deep learning, BERT, Stacking, Attention mechanism, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale. Results This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model. Conclusions SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .
Published: 2024
Full Text: View/download PDF

167. Enhancing Literature Review Efficiency: A Case Study on Using Fine-Tuned BERT for Classifying Focused Ultrasound-Related Articles

Author: Reanna K. Panagides, Sean H. Fu, Skye H. Jung, Abhishek Singh, Rose T. Eluvathingal Muttikkal, R. Michael Broad, Timothy D. Meakem, and Rick A. Hamilton
Subjects: focused ultrasound, machine learning, text classification, BERT, Electronic computers. Computer science, QA75.5-76.95
Abstract: Over the past decade, focused ultrasound (FUS) has emerged as a promising therapeutic modality for various medical conditions. However, the exponential growth in the published literature on FUS therapies has made the literature review process increasingly time-consuming, inefficient, and error-prone. Machine learning approaches offer a promising solution to address these challenges. Therefore, the purpose of our study is to (1) explore and compare machine learning techniques for the text classification of scientific abstracts, and (2) integrate these machine learning techniques into the conventional literature review process. A classified dataset of 3588 scientific abstracts related and unrelated to FUS therapies sourced from the PubMed database was used to train various traditional machine learning and deep learning models. The fine-tuned Bio-ClinicalBERT (Bidirectional Encoder Representations from Transformers) model, which we named FusBERT, had comparatively optimal performance metrics with an accuracy of 0.91, a precision of 0.85, a recall of 0.99, and an F1 of 0.91. FusBERT was then successfully integrated into the literature review process. Ultimately, the integration of this model into the literature review pipeline will reduce the number of irrelevant manuscripts that the clinical team must screen, facilitating efficient access to emerging findings in the field.
Published: 2024
Full Text: View/download PDF

168. IPerFEX-2023: Indonesian personal financial entity extraction using indoBERT-BiGRU-CRF model

Author: Emmanuel Dave and Andry Chowanda
Subjects: BERT, BiGRU, BiLSTM, CRF Indonesian NER, Indonesian personal financial entity extraction, Computer engineering. Computer hardware, TK7885-7895, Information technology, T58.5-58.64, Electronic computers. Computer science, QA75.5-76.95
Abstract: Abstract There is minimal research focusing on applications of Indonesian named entity recognition (NER) in a specific domain. This study proposes an Indonesian personal financial entity extraction task that can be utilized in a financial assistant chatbot system to interpret user’s financial situation for personalization. Due to the simplicity that a chatbot has, it can promote financial management practices to youth as early as possible in their career. However, the challenge in financial NER is numerical entity extraction that relies heavily on contextual information and suffers the out-of-vocabulary (OOV) problem. Therefore, to extract 15 personal financial entities in daily Indonesian discussions (expense-type, expense-amount, income-type, income-amount, asset-type, asset-amount, saving-type, saving-amount, liability-type, liability-amount, family, time, financial-goal, age, and occupation), this research proposes a dataset, IPerFEX-2023, trained using the Bidirectional Gated Recurrent Unit BiGRU and Conditional Random Field (CRF) with Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) pre-trained model for feature embeddings (IndoBERT-BiGRU-CRF). It is compared with the corresponding Bidirectional Long Short-Term Memory (BiLSTM) (IndoBERT-BiLSTM-CRF) as baseline. Not only the IndoBERT-BiGRU-CRF model achieves the best performance with a 0.73 F1-score, but it is also 14% faster on average compared to the corresponding baseline model due to its simpler unit structure. This paper also discusses future directions covering model enhancement strategy based on the error analysis result and complementary tasks needed to complete personalization
Published: 2024
Full Text: View/download PDF

169. An approach to automatic answering for English reading comprehension tests

Author: Phat Tien Bui, Hieu Chi Tran, and Thanh Huu Duong
Subjects: bert, multiple choice, masked language model (mlm), question answering (qa), tokenization, transformer, word embeddings, Biotechnology, TP248.13-248.65
Abstract: This study focuses on the reading comprehension problem with multiple-choice answers, using the BERT model to achieve the highest performance. The ultimate goal is to create a solution to help solve reading comprehension problems without any reasoning or knowledge, suitable for the level of students in grades six and seven. The model will solve factoid questions from a given text. Our research topic will use a deep learning model-based approach to create a model that automatically answers the English reading comprehension question. We obtain promising results to give an accuracy of 78 percent.
Published: 2024
Full Text: View/download PDF

170. TTG-Text: A Graph-Based Text Representation Framework Enhanced by Typical Testors for Improved Classification.

Author: Sánchez-Antonio, Carlos, Valdez-Rodríguez, José E., and Calvo, Hiram
Subjects: *NATURAL language processing, *FEATURE selection, *ALGORITHMS, *CLASSIFICATION, *SYMBOLIC computation
Abstract: Recent advancements in graph-based text representation, particularly with embedding models and transformers such as BERT, have shown significant potential for enhancing natural language processing (NLP) tasks. However, challenges related to data sparsity and limited interpretability remain, especially when working with small or imbalanced datasets. This paper introduces TTG-Text, a novel framework that strengthens graph-based text representation by integrating typical testors—a symbolic feature selection technique that refines feature importance while reducing dimensionality. Unlike traditional TF-IDF weighting, TTG-Text leverages typical testors to enhance feature relevance within text graphs, resulting in improved model interpretability and performance, particularly for smaller datasets. Our evaluation on a text classification task using a graph convolutional network (GCN) demonstrates that TTG-Text achieves a 95% accuracy rate, surpassing conventional methods and BERT with fewer required training epochs. By combining symbolic algorithms with graph-based models, this hybrid approach offers a more interpretable, efficient, and high-performing solution for complex NLP tasks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

171. The Impact of Policy Thematic Differences on Industrial Development: An Empirical Study Based on China's Electric Vehicle Industry Policies at the Central and Local Levels.

Author: Liu, Zizheng and Xie, Tao
Subjects: *INDUSTRIALIZATION, *ELECTRIC vehicle industry, *INDUSTRIAL policy, *MARKET penetration, *DEEP learning
Abstract: Since the 21st century, the electric vehicle (EV) industry has become a key driver of global transformation, with increasing emphasis on the study and evaluation of industrial policies across nations. However, traditional frameworks struggle to capture the dynamic interactions between policies at different government levels or effectively analyze large volumes of policy texts. This study adopted a central–local policy interaction perspective, employing the BERT deep semantic learning model and a threshold regression model to investigate the impact of policy differences on industrial development. The findings reveal an inverted U-shaped relationship between central–local policy thematic similarity and EV market penetration, with the optimal similarity shifting as policy volume increases. This suggests the necessity of dynamically allocating central and local policies to balance national consistency with regional flexibility and promote synergy among regions. Recommendations include optimizing multi-level coordination, maintaining a balance between uniformity and specialization, strengthening policy error tolerance mechanisms, and fostering innovation. By integrating text analysis with econometric modeling, this study offers a novel framework aligned with China's political system, providing insights into central–local policy interactions and serving as a reference for other countries seeking to refine their industrial strategies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

172. A data-driven multi-perspective approach to cybersecurity knowledge discovery through topic modelling.

Author: Alqurashi, Fahad and Ahmad, Istiak
Subjects: NATURAL language processing, DATABASES, DATA privacy, SMART cities, INTERNET security, RANSOMWARE
Abstract: Cybersecurity is crucial for protecting the privacy of digital systems, maintaining economic stability, and ensuring national security. This study presents a comprehensive approach to cybersecurity knowledge discovery through topic modelling, using a multi-perspective analysis of academic and industry sources. The datasets include 15,751 articles from the Web of Science (WoS) database and 5,831 articles from Security Magazine, spanning from 2011 to 2023. We employed BERTopic for topic modelling, UMAP for dimensionality reduction, and HDBSCAN clustering algorithm for grouping and analysing distinct article clusters to uncover latent topics, enhancing the understanding of the evolving cybersecurity landscape. This study found 12 micro-clusters and three macro-clusters, namely technology, smart city and education, from the WoS database and 12 more micro-clusters and four macro-clusters, including organization, public security, governance, and education, from magazines. This study reveals key cybersecurity research and practice trends, such as the increasing focus on malware, ransomware, and cyber-attack mitigation. Additionally, temporal analysis indicates a significant rise in cybersecurity interest around 2020, followed by a diversification of topics. The results highlight the importance of integrating diverse data sources to capture a holistic view of cybersecurity developments. Future work will aim to refine the clustering algorithms to further improve topic extraction and analysis and expand the datasets to include more diverse sources and perspectives. This approach helps discover current cybersecurity trends and provides a foundation for more targeted and effective cybersecurity strategies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

173. Unraveling the DNA methylation landscape in dog blood across breeds.

Author: Nakamura, Miyuki, Matsumoto, Yuki, Yasuda, Keiji, Nagata, Masatoshi, Nakaki, Ryo, Okumura, Masahiro, and Yamazaki, Jumpei
Subjects: NATURAL language processing, WHOLE genome sequencing, DNA methylation, GENETIC variation, HUMAN genome, DOG breeds
Abstract: Background: DNA methylation is a covalent bond modification that is observed mainly at cytosine bases in the context of CG pairs. DNA methylation patterns reflect the status of individual tissues, such as cell composition, age, and the local environment, in mammals. Genetic factors also impact DNA methylation, and the genetic diversity among various dog breeds provides a valuable platform for exploring this topic. Compared to those in the human genome, studies on the profiling of methylation in the dog genome have been less comprehensive. Results: Our study provides extensive profiling of DNA methylation in the whole blood of three dog breeds using whole-genome bisulfite sequencing. The difference in DNA methylation between breeds was moderate after removing CpGs overlapping with potential genetic variation. However, variance in methylation between individuals was common and often occurred in promoters and CpG islands (CGIs). Moreover, we adopted contextual awareness methodology to characterize DNA primary sequences using natural language processing (NLP). This method could be used to effectively separate unmethylated CGIs from highly methylated CGIs in the sequences that are identified by the conventional criteria. Conclusions: This study presents a comprehensive DNA methylation landscape in the dog blood. Our observations reveal the similar methylation patterns across dog breeds, while CGI regions showed high variations in DNA methylation level between individuals. Our study also highlights the potential of NLP approach for analyzing low-complexity DNA sequences, such as CGIs. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

174. Rethinking of BERT sentence embedding for text classification.

Author: Galal, Omar, Abdel-Gawad, Ahmed H., and Farouk, Mona
Subjects: *LANGUAGE models, *SENTIMENT analysis, *TASK analysis, *SARCASM, *CLASSIFICATION
Abstract: Text classification is a fundamental task in NLP that is used in several real-life tasks and applications. Large pre-trained language models such as BERT achieve state-of-the-art performance in several NLP tasks including text classification tasks. Although BERT boosts text classification performance, the common way of using it for classification lacks many aspects of its advantages. This work rethinks the way of using BERT final layer and hidden layers embeddings by proposing different aggregation architectures for text classification tasks such as sentiment analysis and sarcasm detection. This research also proposes different approaches for using BERT as a feature extractor without fine-tuning whose performance surpasses its fine-tuning counterpart. It also proposes promising multi-task learning aggregation architectures to improve the performance of the related classification problems. The experiments of the different architectures show that freezing BERT can outperform fine-tuning it for sentiment analysis. The experiments also show that multi-task learning while freezing BERT boosts the performance of yet hard tasks such as sarcasm detection. The best-performing models achieved new state-of-the-art performance on the ArSarcasm-v2 dataset for Arabic sarcasm detection and sentiment analysis. For multi-task learning and freezing BERT, a new SOTA F1-score of 64.41 was achieved for the sarcasm detection with a 3.47% improvement and near SOTA FPN of 75.78 for the sentiment classification. For single-task learning, a new SOTA FPN of 75.26 was achieved for the sentiment with a 1.81% improvement. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

175. Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis.

Author: Wankmüller, Sandra
Subjects: *MACHINE learning, *NATURAL language processing, *DEEP learning, *TRANSFORMER models, *SOCIAL science research, *SUPERVISED learning
Abstract: Transformer-based models for transfer learning have the potential to achieve high prediction accuracies on text-based supervised learning tasks with relatively few training data instances. These models are thus likely to benefit social scientists that seek to have as accurate as possible text-based measures, but only have limited resources for annotating training data. To enable social scientists to leverage these potential benefits for their research, this article explains how these methods work, why they might be advantageous, and what their limitations are. Additionally, three Transformer-based models for transfer learning, BERT, RoBERTa, and the Longformer, are compared to conventional machine learning algorithms on three applications. Across all evaluated tasks, textual styles, and training data set sizes, the conventional models are consistently outperformed by transfer learning with Transformers, thereby demonstrating the benefits these models can bring to text-based social science research. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

176. Exploring ensemble optimized voting and stacking classifiers through Cross-validation for early detection of suicidal ideation.

Author: Shukla, Shiv Shankar Prasad and Singh, Maheshwari Prasad
Subjects: *MACHINE learning, *NATURAL language processing, *SUICIDAL ideation, *LONG short-term memory, *RANDOM forest algorithms, *DEEP learning
Abstract: Detecting behavioral changes associated with suicidal ideation on social media is essential yet complex. While machine learning and deep learning hold promise in this regard, current studies often lack generalizability due to single dataset reliance. Traditional embedding techniques struggle with semantic analysis,leading to challenges in achieving high accuracy models and conventional validation methods have data drift limitations. To address these challenges, this study proposes a novel evaluation approach using natural language processing across diverse platforms like Twitter and Reddit. By integrating BERT embedding, adept at handling semantic nuances, with an optimized Stacked Classifier combining different base classifiers and XGBoost as the meta-classifier, the model excels in swiftly detecting signs of suicidal ideation compared to the Voting Classifier, i.e., the combination of Decision Tree, Random Forest, Gradient Boost and XGBoost and several machine learning models. Additionally, the study explores advanced embedding techniques like MUSE and LLM, and deep learning models including Bi-LSTM, Bi-GRU, and Text-CNN for comparison.This ensemble approach aims to create a model that is not only interpretable but also robust, reducing computational complexity and enhancing resilience against noisy data—common challenges faced in text classification tasks. Through K-fold validation, which involves partitioning the dataset into k equal-sized subsets or "folds" and training the model k times, using k-1 folds for training and one-fold for testing each time, the proposed model achieves impressive accuracy rates of 97% on Reddit and 96% on Twitter datasets, underscoring its effectiveness in identifying suicidal ideation across social media platforms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

177. Exploring transformer models for sentiment classification: A comparison of BERT, RoBERTa, ALBERT, DistilBERT, and XLNet.

Author: Areshey, Ali and Mathkour, Hassan
Subjects: *LANGUAGE models, *TRANSFORMER models, *STATISTICAL learning, *SENTIMENT analysis, *MACHINE learning
Abstract: Transfer learning models have proven superior to classical machine learning approaches in various text classification tasks, such as sentiment analysis, question answering, news categorization, and natural language inference. Recently, these models have shown exceptional results in natural language understanding (NLU). Advanced attention‐based language models like BERT and XLNet excel at handling complex tasks across diverse contexts. However, they encounter difficulties when applied to specific domains. Platforms like Facebook, characterized by continually evolving casual and sophisticated language, demand meticulous context analysis even from human users. The literature has proposed numerous solutions using statistical and machine learning techniques to predict the sentiment (positive or negative) of online customer reviews, but most of them rely on various business, review, and reviewer features, which leads to generalizability issues. Furthermore, there have been very few studies investigating the effectiveness of state‐of‐the‐art pre‐trained language models for sentiment classification in reviews. Therefore, this study aims to assess the effectiveness of BERT, RoBERTa, ALBERT, DistilBERT, and XLNet in sentiment classification using the Yelp reviews dataset. The models were fine‐tuned, and the results obtained with the same hyperparameters are as follows: 98.30 for RoBERTa, 98.20 for XLNet, 97.40 for BERT, 97.20 for ALBERT, and 96.00 for DistilBERT. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

178. A smart video analytical framework for sarcasm detection using novel adaptive fusion network and SarcasNet-99 model.

Author: Murthy, Jamuna S. and Siddesh, G. M.
Subjects: *ARTIFICIAL neural networks, *FACIAL expression, *SELF-expression, *DEEP learning, *VIDEO excerpts
Abstract: Sarcasm is often related to something that has created a mass confusion among the general uninformed public. It is always associated with a mockery tone or trenchancy facial expression or weird language. Existing literatures that are profound in the field of sarcasm detection mainly focused on text-based input with sarcastic comments or facial expression-based analysis, i.e., image input. But both text and image input are not sufficient to analyze the underlying sarcasm behind the scene. This kind of analysis can also be misleading sometimes as the emotional expression can change with social circumstances (i.e., audio tone) over time. Hence to address these challenges, "A Smart Video Analytical framework for Sarcasm Detection using Deep Learning" is introduced where sarcasm detection is done by considering video modality. Proposed model extracts three important features from the video, i.e., text using proposed Enhanced-BERT, image using ImageNet and audio using Librosa. After extraction, each modality is addressed individually and is finally fused using proposed adaptive early fusion approach. The final task prediction of classification is done using novel deep neural network called "SarcasNet-99" to detect sarcasm in video over distributed framework called Apache Storm. TedX and GIF Reply datasets are used for model training and testing with around 10,000 + video clips. When compared against existing state-of-the-art techniques such as AlexNet, DenseNet, SqueezeNet and ResNet, the proposed model predicted accuracy 99.005% with LeakyReLU activation function. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

179. Approaches based on language models for aspect extraction for sentiment analysis in the Portuguese language.

Author: Neto, José Carlos Ferreira, Pereira, Denilson Alves, Barbosa, Bruno Henrique Groenner, and Ferreira, Danton Diego
Subjects: *LANGUAGE models, *NATURAL language processing, *SENTIMENT analysis, *EXTRACTION techniques, *PORTUGUESE language
Abstract: This work addresses the gap in aspect extraction techniques for Portuguese by adapting methods originally designed for English. It focuses on TV devices and literary reviews in the TV and ReLi datasets. For this, models based on the BERT architecture were employed, including pre-trained general domain (BERTimbau) and specific domain models (BERTtv and BERTreli). Also, this paper contributes with a novel double embedding technique that merges these models. We further explored the potential of large language models (LLMs) with a Portuguese-trained LLaMa variant, Cabrita. Efficient fine-tuning techniques such as LoRA (low-rank adaptation) for BERTimbau and QLoRA (quantized low-rank adaptation) for Cabrita were applied to optimize resource demands. The BERTimbau model, adjusted with LoRA, achieved the highest F1 scores (0.846 for TV and 0.615 for ReLi), while Cabrita showed lower performance (0.68 for TV and 0.46 for ReLi). This study underscores the potential of adapting and optimizing existing techniques for aspect extraction in Portuguese, marking a significant advancement in the field [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

180. IBNNER: A Biaffine Model-Based Chinese Nested Named Entity Recognition Method for Medical Texts.

Author: Ping Lu, Chongkun Shao, Shan Deng, Jiaying Zeng, and Kaibiao Lin
Subjects: CONVOLUTIONAL neural networks, CLINICAL decision support systems, LANGUAGE models, DATA mining, MEDICAL terminology
Abstract: Medical named entity recognition (MNER) enables the automatic identification of key entities in the medical domain, facilitating the extraction of useful information from many medical texts. Consequently, MNER holds significant application value in medical information processing and clinical decision support. Existing approaches encounter challenges such as complex and diverse medical terminologies, contextual ambiguity, and the presence of nested named entities, all of which are pivotal for accurate information extraction. This paper proposes an Improved Biaffine-based Nested Named Entity Recognition Model (IBNNER) to address the above challenges. Firstly, IBNNER utilizes BERT pre-training to obtain token embeddings representations from nested medical datasets, employing two feed-forward neural network layers to learn the starting and ending information of entity fragments; Additionally, introducing a biaffine mapping technique with position Embeddings to generate a score matrix from the two hidden layers; Finally, IBNNER employ a Convolutional neural network (CNN) to enhance the spatial distribution of the score matrix and employ a suitable loss function for training in multi-classification tasks. Experimental results show that IBNNER achieves a recall of 69.12% on the CMeEE dataset, surpassing the BERT model by 5.04%. Moreover, when compared to the BiLSTM model on the CLUENER2020 dataset, IBNNER exhibits improvements of 7.37%, 10.51%, and 8.95% in terms of precision, recall, and F1-score, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024

181. Leveraging Large Language Models for Enhancing Literature-Based Discovery.

Author: Taleb, Ikbal, Navaz, Alramzana Nujum, and Serhani, Mohamed Adel
Subjects: LANGUAGE models, GENERATIVE artificial intelligence, DRUG discovery, DATA scrubbing, DATA mining, GARLIC
Abstract: The exponential growth of biomedical literature necessitates advanced methods for Literature-Based Discovery (LBD) to uncover hidden, meaningful relationships and generate novel hypotheses. This research integrates Large Language Models (LLMs), particularly transformer-based models, to enhance LBD processes. Leveraging LLMs' capabilities in natural language understanding, information extraction, and hypothesis generation, we propose a framework that improves the scalability and precision of traditional LBD methods. Our approach integrates LLMs with semantic enhancement tools, continuous learning, domain-specific fine-tuning, and robust data cleansing processes, enabling automated analysis of vast text and identification of subtle patterns. Empirical validations, including scenarios on the effects of garlic on blood pressure and nutritional supplements on health outcomes, demonstrate the effectiveness of our LLM-based LBD framework in generating testable hypotheses. This research advances LBD methodologies, fosters interdisciplinary research, and accelerates discovery in the biomedical domain. Additionally, we discuss the potential of LLMs in drug discovery, highlighting their ability to extract and present key information from the literature. Detailed comparisons with traditional methods, including Swanson's ABC model, highlight our approach's advantages. This comprehensive approach opens new avenues for knowledge discovery and has the potential to revolutionize research practices. Future work will refine LLM techniques, explore Retrieval-Augmented Generation (RAG), and expand the framework to other domains, with a focus on dehallucination. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

182. Preservation of emotional context in tweet embeddings on social networking sites.

Author: Maruyama, Osamu, Yoshinaga, Asato, and Sawai, Ken-ichi
Abstract: In communication, emotional information is crucial, yet its preservation in tweet embeddings remains a challenge. This study aims to address this gap by exploring three distinct methods for generating embedding vectors of tweets: word2vec models, pre-trained BERT models, and fine-tuned BERT models. We conducted an analysis to assess the degree to which emotional information is conserved in the resulting embedding vectors. Our findings indicate that the fine-tuned BERT model exhibits a higher level of preservation of emotional information compared to other methods. These results underscore the importance of utilizing advanced natural language processing techniques for preserving emotional context in text data, with potential implications for enhancing sentiment analysis and understanding human communication in social media contexts. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

183. ConBGAT: a novel model combining convolutional neural networks, transformer and graph attention network for information extraction from scanned image.

Author: Ho Vo Hoang, Duy, Vo Quoc, Huy, and Hung, Bui Thanh
Subjects: LANGUAGE models, OPTICAL character recognition, DATA mining, CONVOLUTIONAL neural networks, DEEP learning
Abstract: Extracting information from scanned images is a critical task with far-reaching practical implications. Traditional methods often fall short by inadequately leveraging both image and text features, leading to less accurate and efficient outcomes. In this study, we introduce ConBGAT, a cutting-edge model that seamlessly integrates convolutional neural networks (CNNs), Transformers, and graph attention networks to address these shortcomings. Our approach constructs detailed graphs from text regions within images, utilizing advanced Optical Character Recognition to accurately detect and interpret characters. By combining superior extracted features of CNNs for image and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) for text, our model achieves a comprehensive and efficient data representation. Rigorous testing on real-world datasets shows that ConBGAT significantly outperforms existing methods, demonstrating its superior capability across multiple evaluation metrics. This advancement not only enhances accuracy but also sets a new benchmark for information extraction in scanned image. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

184. Unified extractive-abstractive summarization: a hybrid approach utilizing BERT and transformer models for enhanced document summarization.

Author: S., Divya, N., Sripriya, Andrew, J., and Mazzara, Manuel
Subjects: AUTOMATIC summarization, TEXT summarization, TRANSFORMER models, SYMMETRIC functions, UNITARY transformations
Abstract: With the exponential proliferation of digital documents, there arises a pressing need for automated document summarization (ADS). Summarization, a compression technique, condenses a source document into concise sentences that encapsulate its salient information for summary generation. A primary challenge lies in crafting a dependable summary, contingent upon both extracted features and human-established parameters. This article introduces an intelligent methodology that seamlessly integrates extractive and abstractive techniques to ensure heightened relevance between the input document and its summary. Initially, input sentences undergo transformation into representations utilizing BERT, subsequently transposed into a symmetric matrix based on their similarity. Semantically congruent sentences are then extracted from this matrix to construct an extractive summary. The transformer model integrates an objective function highly symmetric and invariant under unitary transformation for language generation. This model refines the extracted informative sentences and generates an abstractive summary akin to manually crafted summaries. Employing this hybrid summarization technique on the CNN/DailyMail dataset and DUC2004, we evaluate its efficacy using ROUGE metrics. Results demonstrate the superiority of our proposed technique over conventional summarization methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

185. Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models.

Author: Lee, Wonjong and Lee, Seulki
Subjects: DATA warehousing, INFORMATION resources management, KNOWLEDGE graphs, RISK assessment, SYSTEM safety, KNOWLEDGE base, NATURAL language processing
Abstract: As a significant percentage of disasters and fatal accidents still occur in the construction sector, it is legally obligatory to conduct workplace risk assessments to avoid accidents and enhance safety. Identifying harmful and hazardous elements is crucial to discern the distinctive characteristics of potential accidents. However, conventional risk-assessment approaches, which rely on the skills and experience of safety managers, may overlook important factors, leading to inconsistencies in the procedures employed across different sites. Such unstructured safety knowledge reduces accessibility and utility, increases reliance on individual skills, and renders information management inefficient. Recently, the focus has shifted from efficient data storage to obtaining valuable knowledge tailored to specific use-cases. Knowledge-graph-based systems integrate and manage the relationships between knowledge entities, thereby enhancing the development of knowledge bases. Research on automatically extracting and managing predefined knowledge from various forms of data through natural language processing (NLP) is ongoing. This study proposes a novel method that uses NLP and graph models to automatically extract predefined knowledge from unstructured construction data and build an entity-relationship-based risk-assessment knowledge base. We developed an entity-name recognition and keyword-extraction engine that defines the core knowledge related to construction safety and risk assessments. This engine can automatically extract predefined knowledge from unstructured data by learning from NLP data. The extracted risk-assessment knowledsge was used to create a knowledge base, and its efficiency and effectiveness were validated through comparisons with existing methods. The results of this study are significant because they lay the foundation for an automatic knowledge-management system for construction safety and risk assessment, offering both practical and academic contributions to the field of construction safety. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

186. Research on Medical Text Parsing Method Based on BiGRU-BiLSTM Multi-Task Learning.

Author: Fan, Yunli, Kang, Ruiqing, Huang, Wenjie, and Li, Lingyan
Subjects: NATURAL language processing, TASK performance, THERAPEUTICS, PARSING (Computer grammar), INTENTION, MEDICAL research
Abstract: With the development of technology, the popularity of online medical treatment is becoming more and more widespread. However, the accuracy and credibility of online medical treatment are affected by model design and semantic understanding. In particular, there are still some problems in the accurate understanding of complex structured texts, which affects the accuracy of judging users' intentions and needs. Therefore, this paper proposes a new method for medical text parsing, which realizes core tasks such as named entity recognition, intention recognition, and slot filling through a multi-task learning framework; uses BERT to obtain contextual semantic information; and combines BiGRU and BiLSTM neural networks, and uses CRF to realize sequence annotation and DPCNN to realize classification prediction. Thus, the task of entity recognition and intention recognition can be accomplished. On this basis, this paper builds a multi-task learning model based on BiGRU-BiLSTM, and uses CBLUE and CMID databases to verify the method. The verification results show that the accuracy of named entity recognition and intention recognition reaches 86% and 89%, respectively, which improves the performance of various tasks. The ability of the model to process complex text is enhanced. This method can improve the text generalization ability and improve the accuracy of online medical intelligent dialogue when it is used to analyze medical texts. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

187. Enhancing Misinformation Detection in Spanish Language with Deep Learning: BERT and RoBERTa Transformer Models.

Author: Blanco-Fernández, Yolanda, Otero-Vizoso, Javier, Gil-Solla, Alberto, and García-Duque, Jorge
Subjects: LANGUAGE models, TRANSFORMER models, SPANISH language, FAKE news, POLITICIANS
Abstract: This paper presents an approach to identifying political fake news in Spanish using Transformer architectures. Current methodologies often overlook political news due to the lack of quality datasets, especially in Spanish. To address this, we created a synthetic dataset of 57,231 Spanish political news articles, gathered via automated web scraping and enhanced with generative large language models. This dataset is used for fine-tuning and benchmarking Transformer models like BERT and RoBERTa for fake news detection. Our fine-tuned models showed outstanding performance on this dataset, with accuracy ranging from 97.4% to 98.6%. However, testing with a smaller, independent hand-curated dataset, including statements from political leaders during Spain's July 2023 electoral debates, revealed a performance drop to 71%. Although this suggests that the model needs additional refinements to handle the complexity and variability of real-world political discourse, achieving over 70% accuracy seems a promising result in the under-explored domain of Spanish political fake news detection. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

188. Fake Review Detection Model Based on Comment Content and Review Behavior.

Author: Sun, Pengfei, Bi, Weihong, Zhang, Yifan, Wang, Qiuyu, Kou, Feifei, Lu, Tongwei, and Chen, Jinpeng
Subjects: CONSUMER behavior, DEEP learning, TRANSFORMER models, GROWTH industries, CATERING services
Abstract: With the development of the Internet, services such as catering, beauty, accommodation, and entertainment can be reserved or consumed online. Therefore, consumers increasingly rely on online information to choose merchants, products, and services, with reviews becoming a crucial factor in their decision making. However, the authenticity of reviews is highly debated in the field of Internet-based process-of-life service consumption. In recent years, due to the rapid growth of these industries, the detection of fake reviews has gained increasing attention. Fake reviews seriously mislead customers and damage the authenticity of online reviews. Various fake review classifiers have been developed, taking into account the content of the reviews and the behavior involved in the reviews, such as rating, time, etc. However, there has been no research considering the credibility of reviewers and merchants as part of identifying fake reviews. In order to improve the accuracy of existing fake review classification and detection methods, this study utilizes a comment text processing module to model the content of reviews, utilizes a reviewer behavior processing module and a reviewed merchant behavior processing module to model consumer review behavior sequences that imply reviewer credibility and merchant review behavior sequences that imply merchant credibility, respectively, and finally merges the two features for fake review classification. The experimental results show that, compared to other models, the model proposed in this paper improves the classification performance by simultaneously modeling the content of reviews and the credibility of reviewers and merchants. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

189. Phishing website detection using novel integration of BERT and XLNet with deep learning sequential models.

Author: Rao, Kongara Srinivasa, Valluru, Dinesh, Patnala, Satishkumar, Devareddi, Ravi Babu, Rama Krishna, Tummalapalli Siva, and Sravani, Andavarapu
Subjects: LANGUAGE models, RECURRENT neural networks, DEEP learning, SEQUENTIAL learning, PHISHING
Abstract: Phishing websites pose a significant threat to online security, necessitating robust detection mechanisms to safeguard users' sensitive information. This study explores the efficacy of various deep learning architectures for phishing website detection. Initially, traditional sequential models, including recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU), achieve accuracies of 95%, 96%, and 96.5%, respectively, on a curated dataset. Building upon these results, hybrid architectures that combine the strengths of traditional sequential models with state-of-the-art language representation models, bidirectional encoder representations from transformers (BERT) and XLNet, are investigated. Combinations such as RNN with BERT, BERT with LSTM, BERT with GRU, RNN with XLNet, XLNet with LSTM, and XLNet with GRU are evaluated. Through experimentation, accuracies of 94.5%, 96.5%, 96.1%, 95.7%, 97.4%, and 97%, respectively, are achieved, demonstrating the effectiveness of hybrid deep learning architectures in enhancing phishing detection performance. These findings contribute to advancing the state-of-the-art in cybersecurity practices and underscore the importance of leveraging diverse model types to combat online threats effectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

190. Sentiment analysis of student evaluation feedback using transformer-based language models.

Author: Ibnu Daqiqil, Hendy Saputra, Syamsudhuha, Rahmad Kurniawan, and Yanti Andriyani
Subjects: LANGUAGE models, GENERATIVE pre-trained transformers, TRANSFORMER models, SENTIMENT analysis, DEEP learning
Abstract: This paper proposes an approach to sentiment analysis of student evaluation feedback using transformer-based language models. The primary objective of this study is to conduct an in-depth analysis of sentiment expressed in student evaluation feedback, with a focus on introducing contextual understanding into the sentiment classification process. In this research, four different variants of transformer language models were assessed, namely multilingual bidirectional encoder representations from transformers (MBERT), IndoBERT, RoBERTa Indonesia, and generative pre-trained transformer (GPT-2 Indonesia). Additionally, we also compared the performance of transformer models with two traditional models, namely support vector machine (SVM) and Naive Bayes (NB). The evaluation was conducted using feedback data collected from the Evaluasi Dosen oleh Mahasiswa (EDOM) system at Riau University, which had been categorized as either positive or negative. The outcomes indicate that IndoBERT base uncased exhibits the highest performance, with precision, accuracy, and recall values of 0.858, 0.929, and 0.911, respectively. This observation highlights the effectiveness of transformer-based language models in sentiment analysis of student evaluation feedback and provides insights for improving educational assessment practices. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

191. An ontology-based knowledge representation framework for aircraft maintenance processes to support work optimization.

Author: Kang, Zixu, Zhou, Dong, Guo, Ziyue, Zhou, Qidi, and Wu, Hongduo
Subjects: *KNOWLEDGE representation (Information theory), *SEMANTIC Web, *KNOWLEDGE management, *INFORMATION sharing
Abstract: As a critical business activity in the aircraft life cycle, maintenance processes are highly complex and require multidisciplinary knowledge. Knowledge integration and representation oriented toward aircraft maintenance processes are necessary to improve work efficiency. Nonetheless, conventional approaches lack effective unified management, which obstructs domain knowledge sharing and ultimately impedes maintenance work. In this context, this paper proposes a knowledge representation framework based on the benefits of ontology, which formalizes multidisciplinary knowledge for aircraft maintenance processes. An ontology of aircraft maintenance processes is developed for knowledge conceptualization and reuse. On this basis, a domain knowledge extraction model based on the bidirectional encoder representation from transformers (BERT) is constructed to automatically extract entities and relationships related to maintenance processes. With a series of Semantic Web Rule Language (SWRL) rules, a knowledge reasoning method is proposed based on the aircraft maintenance process ontology to mine hidden knowledge. We evaluate the developed ontology and demonstrate the feasibility and usefulness of the proposed knowledge reasoning method in a case study. The results show that the proposed knowledge representation framework provides an effective knowledge formalization method for complex knowledge in aircraft maintenance processes to support work optimization. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

192. A collaborative system for recommending a service within the cloud using deep learning.

Author: Bourenane, Djihene, Sad-Houari, Nawal, and Taghezout, Noria
Abstract: The purpose of this article is to conceive an advanced decision support environment that assists a company in selecting the best service requiring several constraints (expert or machine). The proposed contribution integrates the recommendation in deep learning technique deployed in Cloud. The first step involves preprocessing data using deep learning techniques, particularly NMT and BERT encoders, while the second step uses the neural network to generate recommendations. The neural network's architecture includes two hidden layers consisting of 16 and 8 neurons, configured with the "ReLU" activation function, while the output layer uses the "Softmax" function. The experiments have been conducted on a dataset of 20, 000 services. Results demonstrate that migrating the DL-based approach to cloud computing significantly reduces response time and memory consumption by approximately 15% and 10% for classification tasks, and 50% and 30% for recommendation tasks, respectively. Compared to an earlier version of the proposed approach based on machine learning, the DL-approach improves recommendation accuracy by approximately 3%. However, the results in terms of response time and memory usage remain variable, suggesting that deep learning requires considerable computing resources. The proposed method was benchmarked using evaluation metrics such as accuracy, MAE, response time, and user satisfaction, demonstrating its practical superiority and applicability across various industries. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

193. Fusion of BERT embeddings and elongation-driven features.

Author: Rafae, Abderrahim, Erritali, Mohammed, and Roche, Mathieu
Subjects: LANGUAGE models, SENTIMENT analysis, ORAL communication, WRITTEN communication, SOCIAL media
Abstract: Elongated words such as "Wiiiiiin" or "allloooo" are common in oral communication and are often used to emphasize or exaggerate the hidden message of the root word. While elongated words are rarely found in written languages and dictionaries, they are prevalent in social media networks. Considering elongation in sentiment analysis can provide valuable insights into user sentiments. In this article, we analyze the impact of elongation on sentiment classification, along with an in-depth study of lexical forms of elongation. We propose a method to enhance sentiment classification accuracy by incorporating elongation-based features using BERT (bidirectional encoder representations from transformers) approaches. Experimental results conducted on Twitter data demonstrate that our model achieves an average accuracy of 87% through 10-fold cross-validation experiments. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

194. MREDTA: A BERT and transformer‐based molecular representation encoder for predicting drug‐target binding affinity.

Author: Sun, Xu, Huang, Juanjuan, Fang, Yabo, Jin, Yixuan, Wu, Jiageng, Wang, Guoqing, and Jia, Jiwei
Abstract: Drug‐target binding affinity (DTA) prediction is vital for drug repositioning. The accuracy and generalizability of DTA models remain a major challenge. Here, we develop a model composed of BERT‐Trans Block, Multi‐Trans Block, and DTI Learning modules, referred to as Molecular Representation Encoder‐based DTA prediction (MREDTA). MREDTA has three advantages: (1) extraction of both local and global molecular features simultaneously through skip connections; (2) improved sensitivity to molecular structures through the Multi‐Trans Block; (3) enhanced generalizability through the introduction of BERT. Compared with 12 advanced models, benchmark testing of KIBA and Davis datasets demonstrated optimal performance of MREDTA. In case study, we applied MREDTA to 2034 FDA‐approved drugs for treating non‐small‐cell lung cancer (NSCLC), all of which act on mutant EGFRT790M protein. The corresponding molecular docking results demonstrated the robustness of MREDTA. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

195. DeepIndel: An Interpretable Deep Learning Approach for Predicting CRISPR/Cas9-Mediated Editing Outcomes.

Author: Zhang, Guishan, Xie, Huanzeng, and Dai, Xianhua
Subjects: *NUCLEOTIDE sequence, *CRISPRS, *DNA sequencing, *GENOME editing, *REGRESSION analysis
Abstract: CRISPR/Cas9 has been applied to edit the genome of various organisms, but our understanding of editing outcomes at specific sites after Cas9-mediated DNA cleavage is still limited. Several deep learning-based methods have been proposed for repair outcome prediction; however, there is still room for improvement in terms of performance regarding frameshifts and model interpretability. Here, we present DeepIndel, an end-to-end multi-label regression model for predicting repair outcomes based on the BERT-base module. We demonstrate that our model outperforms existing methods in terms of accuracy and generalizability across various metrics. Furthermore, we utilized Deep SHAP to visualize the importance of nucleotides at various positions for DNA sequence and found that mononucleotides and trinucleotides in DNA sequences surrounding the cut site play a significant role in repair outcome prediction. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

196. An Efficient Fusion Network for Fake News Classification.

Author: Alzaidi, Muhammad Swaileh A., Alshammari, Alya, Hassan, Abdulkhaleq Q. A., Yousafzai, Samia Nawaz, Thaljaoui, Adel, Fitriyani, Norma Latif, Kim, Changgyun, and Syafrudin, Muhammad
Subjects: *LANGUAGE models, *FAKE news, *TECHNOLOGICAL progress, *TRUST, *SOCIAL networks
Abstract: Nowadays, it is very tough to differentiate between real news and fake news due to fast-growing social networks and technological progress. Manipulative news is defined as calculated misinformation with the aim of creating false beliefs. This kind of fake news is highly detrimental to society since it deepens political division and weakens trust in authorities and institutions. Therefore, the identification of fake news has emerged as a major field of research that seeks to validate content. The proposed model operates in two stages: First, TF-IDF is applied to an entire document to obtain its global features, and its spatial and temporal features are simultaneously obtained by employing Bidirectional Encoder Representations from Transformers and Bidirectional Long Short-Term Memory with a Gated Recurrent Unit. The Fast Learning Network efficiently classifies the extracted features. Comparative experiments were conducted on three easily and publicly obtainable large-scale datasets for the purposes of analyzing the efficiency of the approach proposed. The results also show how well the model performs compared with past methods of classification. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

197. A Survey of Hate Speech Detection for Arabic Social Media: Methods and Datasets.

Author: Al-Saqqa, Samar, Awajan, Arafat, and Hammo, Bassam
Subjects: SOCIAL media, HATE speech, EVIDENCE gaps, DEEP learning, RACE, ETHNICITY
Abstract: The last decade has witnessed an increase in harmful content on social media. The great proliferation of hate speech and other forms of aggressive language serves as evidence of this trend. The significant growth of user-generated content has prompted researchers to explore advanced techniques for hate speech detection, and most social media platforms have implemented measures to prevent posts targeting individuals or groups based on characteristics such as race, ethnicity, religion, gender, or nationality. This survey summarizes the methods used for hate speech detection in Arabic contexts, focusing on machine learning, deep learning, and transfer learning approaches. Furthermore, it presents Arabic datasets specifically constructed for this task, and identifies existing research gaps, offering insights to guide future studies in this field. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

198. Application of machine learning in technological forecasting.

Author: Nkolongo, Franck Tshibanda, Mehdi, Adda, and Echchakoui, Said
Subjects: CIRCULAR economy, PLASTIC analysis (Engineering), WEB databases, SCIENCE databases, POLLUTION
Abstract: The plastics industry is vital to Canada's economy, particularly in Quebec. However, environmental challenges persist, prompting companies to invest in research to enhance product performance and sustainability. Recent developments include biodegradable polymers and composite materials. This research aims to develop an automated method for extracting and analyzing text data through text similarity analysis and LDA (Latent Dirichlet Allocation) topic modeling. This approach helps identify both existing and emerging patented innovations, creating new categories within the patent classification system. The RoBERTa model, based on BERT and trained on patent data, has proven highly effective in identifying semantic similarities between technological classes and their patent summaries, achieving an accuracy significantly greater than 80%, regardless of the similarity threshold. The LDA topic analysis showed a 52% topic consistency score. A review of academic publication summaries from the Web of Science database revealed, for example, transitional approaches to the circular economy. These approaches represent a promising option for managing the end-of-life of plastics while reducing environmental pollution. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

199. Biomedical relation extraction method based on ensemble learning and attention mechanism.

Author: Jia, Yaxun, Wang, Haoyang, Yuan, Zhu, Zhu, Lian, and Xiang, Zuo-lin
Subjects: DEEP learning, MEDICAL research, ENSEMBLE learning
Abstract: Background: Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale. Results: This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model. Conclusions: SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

200. Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models.

Author: Zhang, Kai, Li, Jinqiu, Wang, Bingqian, and Meng, Haoran
Subjects: LANGUAGE models, NATURAL language processing, DISTILLATION, GLUE, TEACHERS
Abstract: Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERTTINY-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERTTINY-KD and exceeds the performance of BERT4-PKD and DistilBERT4 by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERTBASE, the BERTTINY-AMKD model retains over 96.3% of the performance of the teacher model, BERTBASE. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

5,481 results on '"BERT"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources