Descriptor: "glove" / Topic: word embedding - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"glove"' showing total 48 results

Start Over Descriptor "glove" Topic word embedding

48 results on '"glove"'

1. Suicide Ideation Prediction Through Deep Learning: An Integration of CNN and Bidirectional LSTM with Word Embeddings

Author: Oyewale, Christianah T., Ibitoye, Ayodeji O. J., Akinyemi, Joseph D., Onifade, Olufade F. W., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
Published: 2024
Full Text: View/download PDF

2. Inter project defect classification based on word embedding.

Author: Kumar, Sushil, Sharma, Meera, Muttoo, S. K., and Singh, V. B.
Abstract: Defect classification is a process to classify defects based on predefined categories. It is time consuming and manual process. Many automatic defect classification methods have been proposed to speed up the process of defect classification. However, these methods have not utilized the inter relations among the defect reports. In the literature for defect classification, Term Frequency-Inverse Document Frequency and Bag of words based approaches have been proposed. In this paper, we have proposed word embedding based model for the defect classification which is proven to be better in comparison with the existing methods. We have also proposed models for inter project defect classification by considering combination of different datasets of the same domain. We tested the proposed approach on 4096 defect reports using K nearest neighbor, Random forest, Decision tree, Support vector machine, Stochastic gradient descent and Ada boost classifiers in terms of accuracy, precision, recall and F1-score. Experimental results show that Decision tree achieves highest accuracy 98.21% while trained and tested on GloVe word embedding. We have also generated new word embedding using the bug reports corpus. Further, we compare the proposed model with Lopes et.al., 2020 and results show that our model outperforms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. SMS Spam Classification–Simple Deep Learning Models With Higher Accuracy Using BUNOW And GloVe Word Embedding

Author: Surajit Giri, Sayak Das, Sutirtha Bharati Das, and Siddhartha Banerjee
Subjects: sms spam, machine learning, cnn, cnn-lstm, word embedding, glove, bunow, Engineering (General). Civil engineering (General), TA1-2040, Chemical engineering, TP155-156, Physics, QC1-999
Abstract: Unwanted text messages are called Spam SMSs. It has been proven that Machine Learning Models can categorize spam messages efficiently and with great accuracy. However, the lack of proper spam filtering software or misclassification of genuine SMS as spam by existing software, the use of spam detection applications has not become popular. In this paper, we propose multiple deep neural network models to classify spam messages. Tiago’s Dataset is used for this research. Initially, preprocessing step is applied to the messages in the data set, which involves lowercasing the text, tokenization, lemmatization of the text, and removal of numbers, punctuations, and stop words. These preprocessed messages are fed in two different deep learning models with simpler architectures, namely Convolution Neural Network and a hybrid Convolution Neural Network with Long Short-Term Memory Network for classification. To increase the accuracy of these two simple architectures, BUNOW and GloVe word embedding techniques are incorporated with deep learning models. BUNOW and GloVe are popular choices in sentiment analysis, but in this work, these two-word embedding techniques are tried in the context of text classification to improve accuracy. The best accuracy of 98.44% is achieved by the CNN LSTM BUNOW model after 15 epochs on a 70% - 30% train-test split. The proposed model can be used in many practical applications like real-time SMS spam detection, email spam detection, sentiment analysis, text categorization, etc.
Published: 2023
Full Text: View/download PDF

4. Text Vectorization Techniques Based on Wordnet.

Author: Držík, Dávid and Šteflovič, Kirsten
Subjects: *NATURAL language processing, *DATA augmentation, *DATABASES
Abstract: The utilization of text vectorization techniques has become essential for numerous classification tasks in present-day natural language processing. Word embedding methods commonly used today, such as Word2Vec, GloVe, etc., are based on the semantic similarity of words. WordNet, as a lexical database of words, provides a rich source of semantic information. In our article, we propose a text vectorization technique using extended text data with the data augmentation method, specifically by replacing words with their synonyms obtained from WordNet. The results obtained from text classification tasks using multiple classifiers demonstrate that expanding the corpus with this method leads to improved vector representations of words. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

5. A supervised learning‐based approach for focused web crawling for IoMT using global co‐occurrence matrix.

Author: Rajiv, S and Navaneethan, C
Subjects: *SUPERVISED learning, *WEBSITES, *ARTIFICIAL neural networks, *INTERNET content, *SUPPORT vector machines, *RANDOM forest algorithms
Abstract: Irrelevant search results for a given topic end up wasting search engine users' time. A learning focused web crawler downloads relevant URLs for a given topic using machine‐learning algorithms. The dynamic nature of the web is a challenge in related computation for focused web crawlers. Studies have shown that the learning focused crawler utilizes term frequency‐inverse document frequency (TF‐IDF) to compute the relevance between a web page and a given topic. The TF‐IDF detects similarity of the given topic to its co‐occurrence on the web page. The necessity of efficient mechanism to compute the relevance of URLs syntactically and semantically has led to the proposal of this paper with a word embedding approach to compute the relevance of the web page. The global vector representation cosine similarity is calculated between a topic and the web page contents. The calculated cosine similarity is provided as input to the trained random forest classifier to predict the relevancy of the web page. The evaluation results proved that the proposed crawler produced an average hrate of 0.41 and prate of 0.59, which outperformed other learning‐focused crawlers on support vector machines, Naive Bayes and artificial neural networks. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

6. Data-Driven Solution to Identify Sentiments from Online Drug Reviews.

Author: Haque, Rezaul, Laskar, Saddam Hossain, Khushbu, Katura Gania, Hasan, Md Junayed, and Uddin, Jia
Subjects: NATURAL language processing, MACHINE learning, USER-generated content, DRUG side effects, MEDICAL personnel, DEEP learning
Abstract: With the proliferation of the internet, social networking sites have become a primary source of user-generated content, including vast amounts of information about medications, diagnoses, treatments, and disorders. Comments on previously used medicines, contained within these data, can be leveraged to identify crucial adverse drug reactions, and machine learning (ML) approaches such as sentiment analysis (SA) can be employed to derive valuable insights. However, given the sheer volume of comments, it is often impractical for consumers to manually review all of them before determining a purchase decision. Therefore, drug assessments can serve as a valuable source of medical information for both healthcare professionals and the general public, aiding in decision making and improving public monitoring systems by revealing collective experiences. Nonetheless, the unstructured and linguistic nature of the comments poses a significant challenge for effective categorization, with previous studies having utilized machine and deep learning (DL) algorithms to address this challenge. Despite both approaches showing promising results, DL classifiers outperformed ML classifiers in previous studies. Therefore, the objective of our study was to improve upon earlier research by applying SA to medication reviews and training five ML algorithms on two distinct feature extractions and four DL classifiers on two different word-embedding approaches to obtain higher categorization scores. Our findings indicated that the random forest trained on the count vectorizer outperformed all other ML algorithms, achieving an accuracy and F1 score of 96.65% and 96.42%, respectively. Furthermore, the bidirectional LSTM (Bi-LSTM) model trained on GloVe embedding resulted in an even better accuracy and F1 score, reaching 97.40% and 97.42%, respectively. Hence, by utilizing appropriate natural language processing and ML algorithms, we were able to achieve superior results compared to earlier studies. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. Exploring Semantic Similarity Measure Based on Word Embedding Representation for Arabic Passages Retrieval

Author: Lahbari, Imane, Alaoui, Said Ouatik El, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Balas, Valentina E., editor, and Ezziyyani, Mostafa, editor
Published: 2022
Full Text: View/download PDF

8. Sentiment Analysis: Choosing the Right Word Embedding for Deep Learning Model

Author: Garg, Sarita Bansal, Subrahmanyam, V. V., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bianchini, Monica, editor, Piuri, Vincenzo, editor, Das, Sanjoy, editor, and Shaw, Rabindra Nath, editor
Published: 2022
Full Text: View/download PDF

9. Cyberbullying detection from tweets using deep learning

Author: Bharti, Shubham, Yadav, Arun Kumar, Kumar, Mohit, and Yadav, Divakar
Published: 2022
Full Text: View/download PDF

10. Exploring the effectiveness of word embedding based deep learning model for improving email classification

Author: Asudani, Deepak Suresh, Nagwani, Naresh Kumar, and Singh, Pradeep
Published: 2022
Full Text: View/download PDF

11. Comparison of Various Word Embeddings for Hate-Speech Detection

Author: Jain, Minni, Goel, Puneet, Singla, Puneet, Tehlan, Rahul, Xhafa, Fatos, Series Editor, Khanna, Ashish, editor, Gupta, Deepak, editor, Pólkowski, Zdzisław, editor, Bhattacharyya, Siddhartha, editor, and Castillo, Oscar, editor
Published: 2021
Full Text: View/download PDF

12. Textual Analysis of News for Stock Market Prediction

Author: Bogdanov, Alexander V., Bogan, Maxim, Stankus, Alexey, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Misra, Sanjay, editor, Garau, Chiara, editor, Blečić, Ivan, editor, Taniar, David, editor, Apduhan, Bernady O., editor, Rocha, Ana Maria A. C., editor, Tarantino, Eufemia, editor, and Torre, Carmelo Maria, editor
Published: 2021
Full Text: View/download PDF

13. TEXT EMOTION RECOGNITION USING FAST TEXT WORD EMBEDDING IN BI-DIRECTIONAL GATED RECURRENT UNIT.

Author: DEVI C., AKALYA, RENUKA D., KARTHIKA, HARISUDHAN T., JEEVANANTHAM V. K., JHANANI J., and VARSHINI S., KAVI
Subjects: EMOTION recognition, ARTIFICIAL neural networks, TEXT recognition, CONVOLUTIONAL neural networks, EMOTIONS, SPEECH perception
Abstract: Emotions are states of readiness in the mind that result from evaluations of one's own thinking or events. Although almost all of the important events in our lives are marked by emotions, the nature, causes, and effects of emotions are some of the least understood parts of the human experience. Emotion recognition is playing a promising role in the domains of human-computer interaction and artificial intelligence. A human's emotions can be detected using a variety of methods, including facial gestures, blood pressure, body movements, heart rate, and textual data. From an application standpoint, the ability to identify human emotions in text is becoming more and more crucial in computational linguistics. In this work, we present a classification methodology based on deep neural networks. The Bi-directional Gated Recurrent Unit (Bi-GRU) employed here demonstrates its effectiveness on the Multimodal Emotion Lines Dataset (MELD) when compared to Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). For word encoding, a comparison of three pre-trained word embeddings namely Glove, Word2Vec, and fastText is made. The findings from the MELD corpus support the conclusion that fastText is the best word embedding for the proposed Bi-GRU model. The experiment utilized the "glove.6B.300d" vector space. It consists of two million word representations in 300 dimensions trained on Common Crawl with sub-word information (600 billion tokens). The accuracy scores of GloVe, Word2Vec, and fastText (300 dimensions each) are tabulated and studied in order to highlight the improved results with fastText on the MELD dataset tested. It is observed that the Bidirectional Gated Recurrent Unit (Bi-GRU) with fastText word embedding outperforms GloVe and Word2Vec with an accuracy of 79.7%. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. Deep Neural Network Framework Based on Word Embedding for Protein Glutarylation Sites Prediction.

Author: Liu, Chuan-Ming, Ta, Van-Dai, Le, Nguyen Quoc Khanh, Tadesse, Direselign Addis, and Shi, Chongyang
Subjects: *ARTIFICIAL neural networks, *AMINO acid sequence, *FORECASTING
Abstract: In recent years, much research has found that dysregulation of glutarylation is associated with many human diseases, such as diabetes, cancer, and glutaric aciduria type I. Therefore, glutarylation identification and characterization are essential tasks for determining modification-specific proteomics. This study aims to propose a novel deep neural network framework based on word embedding techniques for glutarylation sites prediction. Multiple deep neural network models are implemented to evaluate the performance of glutarylation sites prediction. Furthermore, an extensive experimental comparison of word embedding techniques is conducted to utilize the most efficient method for improving protein sequence data representation. The results suggest that the proposed deep neural networks not only improve protein sequence representation but also work effectively in glutarylation sites prediction by obtaining a higher accuracy and confidence rate compared to the previous work. Moreover, embedding techniques were proven to be more productive than the pre-trained word embedding techniques for glutarylation sequence representation. Our proposed method has significantly outperformed all traditional performance metrics compared to the advanced integrated vector support, with accuracy, specificity, sensitivity, and correlation coefficient of 0.79, 0.89, 0.59, and 0.51, respectively. It shows the potential to detect new glutarylation sites and uncover the relationships between glutarylation and well-known lysine modification. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. Study of Information Retrieval and Machine Learning-Based Software Bug Localization Models

Author: Tamanna, Sangwan, Om Prakash, Bansal, Jagdish Chand, Series Editor, Deep, Kusum, Series Editor, Nagar, Atulya K., Series Editor, Sharma, Harish, editor, Govindan, Kannan, editor, Poonia, Ramesh C., editor, Kumar, Sandeep, editor, and El-Medany, Wael M., editor
Published: 2020
Full Text: View/download PDF

16. Emoticon Prediction on Textual Data Using Stacked LSTM Model

Author: Mittal, Mamta, Arora, Maanak, Pandey, Tushar, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bansal, Jagdish Chand, editor, Gupta, Mukesh Kumar, editor, Sharma, Harish, editor, and Agarwal, Basant, editor
Published: 2020
Full Text: View/download PDF

17. Data-Driven Solution to Identify Sentiments from Online Drug Reviews

Author: Rezaul Haque, Saddam Hossain Laskar, Katura Gania Khushbu, Md Junayed Hasan, and Jia Uddin
Subjects: deep learning, word embedding, Bi-LSTM, GloVe, drug sentiment analysis, drug discovery, Electronic computers. Computer science, QA75.5-76.95
Abstract: With the proliferation of the internet, social networking sites have become a primary source of user-generated content, including vast amounts of information about medications, diagnoses, treatments, and disorders. Comments on previously used medicines, contained within these data, can be leveraged to identify crucial adverse drug reactions, and machine learning (ML) approaches such as sentiment analysis (SA) can be employed to derive valuable insights. However, given the sheer volume of comments, it is often impractical for consumers to manually review all of them before determining a purchase decision. Therefore, drug assessments can serve as a valuable source of medical information for both healthcare professionals and the general public, aiding in decision making and improving public monitoring systems by revealing collective experiences. Nonetheless, the unstructured and linguistic nature of the comments poses a significant challenge for effective categorization, with previous studies having utilized machine and deep learning (DL) algorithms to address this challenge. Despite both approaches showing promising results, DL classifiers outperformed ML classifiers in previous studies. Therefore, the objective of our study was to improve upon earlier research by applying SA to medication reviews and training five ML algorithms on two distinct feature extractions and four DL classifiers on two different word-embedding approaches to obtain higher categorization scores. Our findings indicated that the random forest trained on the count vectorizer outperformed all other ML algorithms, achieving an accuracy and F1 score of 96.65% and 96.42%, respectively. Furthermore, the bidirectional LSTM (Bi-LSTM) model trained on GloVe embedding resulted in an even better accuracy and F1 score, reaching 97.40% and 97.42%, respectively. Hence, by utilizing appropriate natural language processing and ML algorithms, we were able to achieve superior results compared to earlier studies.
Published: 2023
Full Text: View/download PDF

18. Geoscience language models and their intrinsic evaluation

Author: Christopher J.M. Lawley, Stefania Raimondo, Tianyi Chen, Lindsay Brin, Anton Zakharov, Daniel Kur, Jenny Hui, Glen Newton, Sari L. Burgoyne, and Geneviève Marquis
Subjects: Word embedding, Language models, Machine learning, Artificial intelligence, BERT, GloVe, Geography. Anthropology. Recreation, Geology, QE1-996.5, Electronic computers. Computer science, QA75.5-76.95
Abstract: Geoscientists use observations and descriptions of the rock record to study the origins and history of our planet, which has resulted in a vast volume of scientific literature. Recent progress in natural language processing (NLP) has the potential to parse through and extract knowledge from unstructured text, but there has, so far, been only limited work on the concepts and vocabularies that are specific to geoscience. Herein we harvest and process public geoscientific reports (i.e., Canadian federal and provincial geological survey publications databases) and a subset of open access and peer-reviewed publications to train new, geoscience-specific language models to address that knowledge gap. Language model performance is validated using a series of new geoscience-specific NLP tasks (i.e., analogies, clustering, relatedness, and nearest neighbour analysis) that were developed as part of the current study. The raw and processed national geological survey corpora, language models, and evaluation criteria are all made public for the first time. We demonstrate that non-contextual (i.e., Global Vectors for Word Representation, GloVe) and contextual (i.e., Bidirectional Encoder Representations from Transformers, BERT) language models updated using the geoscientific corpora outperform the generic versions of these models for each of the evaluation criteria. Principal component analysis further demonstrates that word embeddings trained on geoscientific text capture meaningful semantic relationships, including rock classifications, mineral properties and compositions, and the geochemical behaviour of elements. Semantic relationships that emerge from the vector space have the potential to unlock latent knowledge within unstructured text, and perhaps more importantly, also highlight the potential for other downstream geoscience-focused NLP tasks (e.g., keyword prediction, document similarity, recommender systems, rock and mineral classification).
Published: 2022
Full Text: View/download PDF

19. AltibbiVec: A Word Embedding Model for Medical and Health Applications in the Arabic Language

Author: Maria Habib, Mohammad Faris, Alaa Alomari, and Hossam Faris
Subjects: Arabic, fastText, GloVe, healthcare, pre-trained, word embedding, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: In recent years, the utilization of natural language processing (NLP) and Machine Learning (ML) techniques in clinical decision support systems have shown their ability in improving and automating the diagnosis process, and reducing potential clinical errors. NLP in the Arabic language is more intricate due to several limitations, such as the lack of datasets and analytical resources compared to other languages like English. However, a clinical decision support system in the Arabic context is of significant importance. A fundamental process in NLP is extracting features from text-based data via text embedding. Word embedding is a representation of words in a numeric format that encodes the statistic, semantic, or context information. Building a neural word embedding model requires hundreds of thousands of data instances to find hidden patterns of relationships within sentences. Essentially, extracting relevant and informative features promotes the performance of the learning algorithms. The objective of this paper is to propose an Arabic neural-based word embedding model in the medical and healthcare context (called “AltibbiVec”). Around 1.5 million medical consultations and questions written in different dialects are obtained from Altibbi telemedicine company and used to train the embedding model. Three different embedding models are developed and compared, which are Word2Vec, fastText, and GloVe. The trained models were evaluated by different criteria, including the word clustering and the similarity of words. Besides, performing a specialty-based question classification. The results show that Word2Vec and fastText capture sufficiently the semantics of text more than GloVe. Hence, they are recommended for healthcare NLP-based applications.
Published: 2021
Full Text: View/download PDF

20. Measuring associational thinking through word embeddings.

Author: Periñán-Pascual, Carlos
Subjects: NATURAL language processing, PEARSON correlation (Statistics)
Abstract: The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. Modern Approaches to Detecting and Classifying Toxic Comments Using Neural Networks.

Author: Morzhov, S. V.
Abstract: The rising popularity of online platforms on which users communicate with each other, share opinions about various events, and leave comments has spurred on the development of natural language processing algorithms. Content moderation requires analyzing tens of millions of messages published by users of a given social network daily in real time, in order to prevent the spread of various illegal or offensive information, threats, and other types of toxic comments. Of course, such a large amount of data can be processed quickly enough only automatically. That leads to the problem of teaching computers to "understand" human written speech, which is nontrivial even if understand here means nothing more than classify. The rapid evolution of machine learning technologies has led to ubiquitous implementation of new algorithms. With the use of deep learning technologies, we are now able to quite successfully solve many problems that had for years been considered almost impossible. This article considers algorithms constructed using deep learning technologies and neural networks that solve the problem of detecting and classifying toxic comments. In addition, the article presents the results of testing both the developed algorithms and an ensemble of all considered algorithms on a large training set collected and tagged by Google and Jigsaw. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Extracting information from textual descriptions for actuarial applications.

Author: Manski, Scott, Yang, Kaixu, Lee, Gee Y., and Maiti, Tapabrata
Subjects: VECTORS (Calculus), MATRICES (Mathematics), INSURANCE companies
Abstract: Initial insurance losses are often reported with a textual description of the claim. The claims manager must determine the adequate case reserve for each known claim. In this paper, we present a framework for predicting the amount of loss given a textual description of the claim using a large number of words found in the descriptions. Prior work has focused on classifying insurance claims based on keywords selected by a human expert, whereas in this paper the focus is on loss amount prediction with automatic word selection. In order to transform words into numeric vectors, we use word cosine similarities and word embedding matrices. When we consider all unique words found in the training dataset and impose a generalised additive model to the resulting explanatory variables, the resulting design matrix is high dimensional. For this reason, we use a group lasso penalty to reduce the number of coefficients in the model. The scalable, analytical framework proposed provides for a parsimonious and interpretable model. Finally, we discuss the implications of the analysis, including how the framework may be used by an insurance company and how the interpretation of the covariates can lead to significant policy change. The code can be found in the TAGAM R package (github.com/scottmanski/TAGAM). [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. Avoiding Unintended Bias in Toxicity Classification with Neural Networks

Author: Sergey Morzhov
Subjects: toxicity, natural language processing, nlp, deep learning, recurrent neural networks, rnn, lstm, gru, attention mechanism, word embedding, fasttext, glove, bert, Telecommunication, TK5101-6720
Abstract: The growing popularity of online platforms that allow users to communicate with each other, exchange opinions about various events and leave comments, has contributed to the development of natural language processing algorithms. Tens of millions of messages per day published by users of a certain social network must be analyzed in real time for moderation to prevent the spread of various illegal or offensive information, threats and other types of toxic comments. Of course, such a large amount of information can be processed quite quickly only automatically. That is why it is necessary to find a way to teach a computer to ``understand'' a text written by a man. It is a non-trivial task, even if the word ``understand'' here means only to detect or classify. The rapid development of machine learning technologies has led to the widespread adoption of new algorithms. Many tasks that for years were considered almost impossible to solve using computer now can be successfully solved with deep learning technologies. In this article the author presents new algorithms that can successfully solve the problem of toxic comments detection using deep learning technologies and neural networks. Furthermore, in this article will be presented the results of the developed algorithms, as well as the results of their ensemble, tested on a large training set, gathered and marked up by Google and Jigsaw.
Published: 2020
Full Text: View/download PDF

24. Deep Neural Network Framework Based on Word Embedding for Protein Glutarylation Sites Prediction

Author: Chuan-Ming Liu, Van-Dai Ta, Nguyen Quoc Khanh Le, Direselign Addis Tadesse, and Chongyang Shi
Subjects: glutarylation site prediction, deep neural networks, word embedding, LSTM, ELMo, GloVe, Science
Abstract: In recent years, much research has found that dysregulation of glutarylation is associated with many human diseases, such as diabetes, cancer, and glutaric aciduria type I. Therefore, glutarylation identification and characterization are essential tasks for determining modification-specific proteomics. This study aims to propose a novel deep neural network framework based on word embedding techniques for glutarylation sites prediction. Multiple deep neural network models are implemented to evaluate the performance of glutarylation sites prediction. Furthermore, an extensive experimental comparison of word embedding techniques is conducted to utilize the most efficient method for improving protein sequence data representation. The results suggest that the proposed deep neural networks not only improve protein sequence representation but also work effectively in glutarylation sites prediction by obtaining a higher accuracy and confidence rate compared to the previous work. Moreover, embedding techniques were proven to be more productive than the pre-trained word embedding techniques for glutarylation sequence representation. Our proposed method has significantly outperformed all traditional performance metrics compared to the advanced integrated vector support, with accuracy, specificity, sensitivity, and correlation coefficient of 0.79, 0.89, 0.59, and 0.51, respectively. It shows the potential to detect new glutarylation sites and uncover the relationships between glutarylation and well-known lysine modification.
Published: 2022
Full Text: View/download PDF

25. Parallel implementation of solving linear equations using OpenMP

Author: Paliwal, Maitreyee, Chilla, Rishita Reddy, Prasanth, N Narayanan, Goundar, Sam, and Raja, S. P.
Published: 2022
Full Text: View/download PDF

26. An Enhanced Sentiment Analysis Framework Based on Pre-Trained Word Embedding.

Author: Mohamed, Ensaf Hussein, Moussa, Mohammed ElSaid, and Haggag, Mohamed Hassan
Subjects: *SENTIMENT analysis, *NATURAL language processing, *DEEP learning, *MACHINE theory, *MACHINE learning, *FEATURE extraction
Abstract: Sentiment analysis (SA) is a technique that lets people in different fields such as business, economy, research, government, and politics to know about people's opinions, which greatly affects the process of decision-making. SA techniques are classified into: lexicon-based techniques, machine learning techniques, and a hybrid between both approaches. Each approach has its limitations and drawbacks, the machine learning approach depends on manual feature extraction, lexicon-based approach relies on sentiment lexicons that are usually unscalable, unreliable, and manually annotated by human experts. Nowadays, word-embedding techniques have been commonly used in SA classification. Currently, Word2Vec and GloVe are some of the most accurate and usable word embedding techniques, which can transform words into meaningful semantic vectors. However, these techniques ignore sentiment information of texts and require a huge corpus of texts for training and generating accurate vectors, which are used as inputs of deep learning models. In this paper, we propose an enhanced ensemble classifier framework. Our framework is based on our previously published lexicon-based method, bag-of-words, and pre-trained word embedding, first the sentence is preprocessed by removing stop-words, POS tagging, stemming and lemmatization, shortening exaggerated word. Second, the processed sentence is passed to three modules, our previous lexicon-based method (Sum Votes), bag-of-words module and semantic module (Word2Vec and Glove) and produced feature vectors. Finally, the previous features vectors are fed into 11 different classifiers. The proposed framework is tested and evaluated over four datasets with five different lexicons, the experiment results show that our proposed model outperforms the previous lexicon based and the machine learning methods individually. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. Word Embedding for Rhetorical Sentence Categorization on Scientific Articles

Author: Ghoziyah Haitan Rachman, Masayu Leylia Khodra, and Dwi Hendratmo Widyantoro
Subjects: GloVe, rhetorical sentence categorization, scientific article, word embedding, Word2Vec., Telecommunication, TK5101-6720, Information technology, T58.5-58.64
Abstract: A common task in summarizing scientific articles is employing the rhetorical structure of sentences. Determining rhetorical sentences itself passes through the process of text categorization. In order to get good performance, some works in text categorization have been done by employing word embedding. This paper presents rhetorical sentence categorization of scientific articles by using word embedding to capture semantically similar words. A comparison of employing Word2Vec and GloVe is shown. First, two experiments are evaluated using five classifiers, namely NaÃ¯ve Bayes, Linear SVM, IBK, J48, and Maximum Entropy. Then, the best classifier from the first two experiments was employed. This research showed that Word2Vec CBOW performed better than Skip-Gram and GloVe. The best experimental result was from Word2Vec CBOW for 20,155 resource papers from ACL-ARC, features from Teufel and the previous label feature. In this experiment, Linear SVM produced the highest F-measure performance at 43.44%.
Published: 2018
Full Text: View/download PDF

28. Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

Author: Yang Yuan, Xiao Li, and Ya-Ting Yang
Subjects: word embedding, word alignment probability, distance attenuation function, word2vec, glove, Information technology, T58.5-58.64
Abstract: To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
Published: 2019
Full Text: View/download PDF

29. Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings

Author: Derwin Suhartono, Bagas Pradipabista Nayoga, Ryan Adipradana, and Ryan Suryadi
Subjects: Control and Optimization, Word embedding, Computer Networks and Communications, Computer science, Recurrent neural network, computer.software_genre, Supervised text classification, Resource (project management), Indonesian language, Computer Science (miscellaneous), Fake news analyzer, Misinformation, Electrical and Electronic Engineering, Instrumentation, fastText, Hoax, business.industry, language.human_language, Indonesian, Hardware and Architecture, Control and Systems Engineering, language, Embedding, GloVe, The Internet, Artificial intelligence, business, computer, Natural language processing, Information Systems
Abstract: Misinformation has become an innocuous yet potentially harmful problem ever since the development of internet. Numbers of efforts are done to prevent the consumption of misinformation, including the use of artificial intelligence (AI), mainly natural language processing (NLP). Unfortunately, most of natural language processing use English as its linguistic approach since English is a high resource language. On the contrary, Indonesia language is considered a low resource language thus the amount of effort to diminish consumption of misinformation is low compared to English-based natural language processing. This experiment is intended to compare fastText and GloVe embeddings for four deep neural networks (DNN) models: long short-term memory (LSTM), bidirectional long short-term memory (BI-LSTM), gated recurrent unit (GRU) and bidirectional gated recurrent unit (BI-GRU) in terms of metrics score when classifying news between three classes: fake, valid, and satire. The latter results show that fastText embedding is better than GloVe embedding in supervised text classification, along with BI-GRU + fastText yielding the best result.
Published: 2021

30. Word Embedding for Rhetorical Sentence Categorization on Scientific Articles.

Author: Rachman, Ghoziyah Haitan, Khodra, Masayu Leylia, and Widyantoro, Dwi Hendratmo
Subjects: COMPUTATIONAL linguistics, RHETORICAL criticism, ARTIFICIAL intelligence, INFORMATION retrieval, MACHINE learning
Abstract: A common task in summarizing scientific articles is employing the rhetorical structure of sentences. Determining rhetorical sentences itself passes through the process of text categorization. In order to get good performance, some works in text categorization have been done by employing word embedding. This paper presents rhetorical sentence categorization of scientific articles by using word embedding to capture semantically similar words. A comparison of employing Word2Vec and GloVe is shown. First, two experiments are evaluated using five classifiers, namely Naïve Bayes, Linear SVM, IBK, J48, and Maximum Entropy. Then, the best classifier from the first two experiments was employed. This research showed that Word2Vec CBOW performed better than Skip-Gram and GloVe. The best experimental result was from Word2Vec CBOW for 20,155 resource papers from ACL-ARC, features from Teufel and the previous label feature. In this experiment, Linear SVM produced the highest F-measure performance at 43.44%. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

31. Content Tree Word Embedding for document representation.

Author: Kamkarhaghighi, Mehran and Makrehchi, Masoud
Subjects: *SENTIMENT analysis, *COMPUTATIONAL linguistics, *LANGUAGE & languages, *DEEP learning, *DATA mining
Abstract: Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recently, word embedding approaches have drawn much attention in text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages. Relying only on pre-trained word vectors may neglect the local context and increase word ambiguity. In this study, a new approach, Content Tree Word Embedding (CTWE), is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. CTWE is basically a framework for document representation while using word embedding feature learning. The CTWE structure is locally learned from training data and ultimately represents the local context. While CTWE is constructed, each word vector is updated based on its location in the content tree. For the task of classification, the results show an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

32. A ConvBiLSTM Deep Learning Model-Based Approach for Twitter Sentiment Classification

Author: Rachid Ben Said, Ömer Özgür Tanrıöver, and Sakirin Tam
Subjects: Word embedding, General Computer Science, Computer science, Bi-LSTM, Feature extraction, Context (language use), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), General Materials Science, Word2vec, Word2Vec, 0105 earth and related environmental sciences, Natural Language Processing, business.industry, Deep learning, Sentiment analysis, General Engineering, Pattern recognition, sentiment analysis, 020201 artificial intelligence & image processing, GloVe, Artificial intelligence, lcsh:Electrical engineering. Electronics. Nuclear engineering, business, lcsh:TK1-9971, CNN
Abstract: Being one of the most widely used social media tools, Twitter is seen as an important source of information for acquiring people’s attitudes, emotions, views and feedbacks. Within this context, Twitter sentiment analysis techniques were developed to decide whether textual tweets express a positive or negative opinion. In contrast to lower classification performance of traditional algorithms, deep learning models, including Convolution Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM), have achieved a significant result in sentiment analysis. Although CNN can extract high-level local features efficiently by using convolutional layer and max-pooling layer, it cannot effectively learn sequence of correlations. On the other hand, Bi-LSTM uses two LSTM directions to improve the contexts available to deep learning algorithms, but Bi-LSTM cannot extract local features in a parallel way. Therefore, applying a single CNN or single Bi-LSTM for sentiment analysis cannot achieve the optimal classification result. An integrating structure of CNN and Bi-LSTM model is proposed in this study. ConvBiLSTM is implemented; a word embedding model which converts tweets into numerical values, CNN layer receives feature embedding as input and produces smaller dimension of features, and the Bi-LSTM model takes the input from the CNN layer and produces classification result. Word2Vec and GloVe were distinctly applied to observe the impact of the word embedding result on the proposed model. ConvBiLSTM was applied with retrieved Tweets and SST-2 datasets. ConvBiLSTM model with Word2Vec on retrieved Tweets dataset outperformed the other models with 91.13% accuracy.
Published: 2021

33. A Personality Mining System for German Twitter Posts With Global Vectors Word Embedding

Author: Henning Usselmann, Rangina Ahmad, and Dominik Siemon
Subjects: Word embedding, Source code, General Computer Science, Computer science, media_common.quotation_subject, Sample (statistics), German, big five, Personality, Web application, General Materials Science, Electrical and Electronic Engineering, Big Five personality traits, Personality mining, media_common, Information retrieval, business.industry, General Engineering, LIWC, language.human_language, TK1-9971, machine learning, Feeling, language, GloVe, Electrical engineering. Electronics. Nuclear engineering, business
Abstract: People’s personality influences their behaviors, attitudes, beliefs, and feelings. Therefore, many scientific studies already benefit from easy ways of measuring personality. By analyzing the written text of a person, it is possible to derive Big Five personality traits. One approach to this is to apply the unsupervised learning algorithm Global Vectors Word Embedding (or Representation), abbreviated GloVe, to English Twitter posts. The overall objective of our research is to show that this algorithm can also be applied to German Twitter posts. Therefore, we built a framework for training and applying machine learning models for personality predictions. We tested if a working prediction model for English Twitter users can be adapted for German users. This could reduce efforts for collecting training data. We evaluated our models based on a personality survey with a sample of German users. The method of adapting an existing model does not perform as well as expected but helps prepare the framework for higher volumes of data. In the end, the final model is based on the evaluation data, which results in an acceptable performance. Via a web application (https://www.miping.de) anyone can easily retrieve personality scores for any public German Twitter user. Altogether, it is shown that GloVe is suitable to predict personality based on German language. The published framework and source code allow for independent improvements to and easy application of the trained model. Now, scientific studies and other applications, e.g., chatbots, could easily incorporate personality data.
Published: 2021

34. AltibbiVec: A Word Embedding Model for Medical and Health Applications in the Arabic Language

Author: Alaa Alomari, Hossam Faris, Maria Habib, and Mohammad Faris
Subjects: Word embedding, General Computer Science, Computer science, Context (language use), computer.software_genre, Semantics, Data modeling, General Materials Science, Word2vec, fastText, Context model, Arabic, business.industry, pre-trained, General Engineering, healthcare, word embedding, TK1-9971, Embedding, GloVe, Artificial intelligence, Electrical engineering. Electronics. Nuclear engineering, business, computer, Natural language processing, Word (computer architecture)
Abstract: In recent years, the utilization of natural language processing (NLP) and Machine Learning (ML) techniques in clinical decision support systems have shown their ability in improving and automating the diagnosis process, and reducing potential clinical errors. NLP in the Arabic language is more intricate due to several limitations, such as the lack of datasets and analytical resources compared to other languages like English. However, a clinical decision support system in the Arabic context is of significant importance. A fundamental process in NLP is extracting features from text-based data via text embedding. Word embedding is a representation of words in a numeric format that encodes the statistic, semantic, or context information. Building a neural word embedding model requires hundreds of thousands of data instances to find hidden patterns of relationships within sentences. Essentially, extracting relevant and informative features promotes the performance of the learning algorithms. The objective of this paper is to propose an Arabic neural-based word embedding model in the medical and healthcare context (called “AltibbiVec”). Around 1.5 million medical consultations and questions written in different dialects are obtained from Altibbi telemedicine company and used to train the embedding model. Three different embedding models are developed and compared, which are Word2Vec, fastText, and GloVe. The trained models were evaluated by different criteria, including the word clustering and the similarity of words. Besides, performing a specialty-based question classification. The results show that Word2Vec and fastText capture sufficiently the semantics of text more than GloVe. Hence, they are recommended for healthcare NLP-based applications.
Published: 2021

35. A word embedding trained on South African news data

Author: Martin Canaan Mafunda, Maria Schuld, Kevin Durrheim, and Sindisiwe Mazibuko
Subjects: South Africa, natural language processing (NLP), news data, Word2Vec, GloVe, word embedding
Abstract: This article presents results from a study that developed and tested a word embedding trained on a dataset of South African news articles. A word embedding is an algorithm-generated word representation that can be used to analyse the corpus of words that the embedding is trained on. The embedding on which this article is based was generated using the Word2Vec algorithm, which was trained on a dataset of 1.3 million African news articles published between January 2018 and March 2021, containing a vocabulary of approximately 124,000 unique words. The efficacy of this Word2Vec South African news embedding was then tested, and compared to the efficacy provided by the globally used GloVe algorithm. The testing of the local Word2Vec embedding showed that it performed well, with similar efficacy to that provided by GloVe. The South African news word embedding generated by this study is freely available for public use.
Published: 2022

36. Comparative study of word embedding methods in topic segmentation.

Author: Naili, Marwa, Chaibi, Anja Habacha, and Ben Ghezala, Henda Hajjami
Subjects: VOCABULARY, NATURAL languages, LANGUAGE & languages, SEMANTICS, TOPIC & comment (Grammar)
Abstract: The vector representations of words are very useful in different natural language processing tasks in order to capture the semantic meaning of words. In this context, the three known methods are: LSA, Word2Vec and GloVe. In this paper, these methods will be investigated in the field of topic segmentation for both languages Arabic and English. Moreover, Word2Vec is studied in depth by using different models and approximation algorithms. As results, we found out that LSA, Word2Vec and GloVe depend on the used language. However, Word2Vec presents the best word vector representation yet it depends on the choice of model. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

37. Modern Approaches to Detect and Classify Comment Toxicity Using Neural Networks

Author: Сeргей Владимирович Моржов
Subjects: Word embedding, Computer science, 02 engineering and technology, Information technology, Convolutional neural network, lstm, Task (project management), World Wide Web, convolutional neural networks, 0202 electrical engineering, electronic engineering, information engineering, recurrent neural networks, natural language processing, cnn, Artificial neural network, business.industry, gru, Deep learning, toxicity, deep learning, 020206 networking & telecommunications, fasttext, nlp, word embedding, T58.5-58.64, Popularity, Recurrent neural network, 020201 artificial intelligence & image processing, Artificial intelligence, business, glove, Word (computer architecture)
Abstract: The growth of popularity of online platforms which allow users to communicate with each other, share opinions about various events, and leave comments boosted the development of natural language processing algorithms. Tens of millions of messages per day are published by users of a particular social network need to be analyzed in real time for moderation in order to prevent the spread of various illegal or offensive information, threats and other types of toxic comments. Of course, such a large amount of information can be processed quite quickly only automatically. that is why there is a need to and a way to teach computers to “understand” a text written by humans. It is a non-trivial task even if the word “understand” here means only “to classify”. the rapid evolution of machine learning technologies has led to ubiquitous implementation of new algorithms. A lot of tasks, which for many years were considered almost impossible to solve, are now quite successfully solved using deep learning technologies. this article considers algorithms built using deep learning technologies and neural networks which can successfully solve the problem of detection and classification of toxic comments. In addition, the article presents the results of the developed algorithms, as well as the results of the ensemble of all considered algorithms on a large training set collected and tagged by Google and Jigsaw.
Published: 2020

38. Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks

Author: Lu, Qinglan
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Computer Sciences, Word embedding, GloVe, Word2Vec, Word co-occurrence
Abstract: One of the trends in Natural Language Processing (NLP) is the use of word embedding. Its aim is to build a low dimensional vector representation of words from text corpora. Global Vectors for Word Representation (GloVe) and Sikp-Gram with Negative Sampling (SGNS) are two representative word embedding methods. Existing papers have different conclusions on the performance of these two methods. This thesis focuses on GloVe and studies its commonalities and differences with SGNS. Word co-occurrence is the cornerstone of all word embedding algorithms. One difference between GloVe and SGNS is the definition of co-occurrence. The weight of co-occurring words tapers o↵ with the distance between them. GloVe and SGNS adopts different weighting schemes. In SGNS, weight decreases linearly with the distance. In GloVe, the weight decreases harmonically, giving less weight to the words in the center of the window. We propose GloVe-L (GloVe Linear), by changing the weighting scheme to the linear weighting. We find that GloVe-L outperforms GloVe consistently in word similarity tasks. The conclusion is supported by extensive experiments on 8 Word evaluation benchmarks on Wikipedia training corpus. The thesis also explores the impact of hyper-parameters on the result, including window size and xmax in GloVe. Another interesting observation is that Glove-L does not work well for word analogy tasks.
Published: 2021

39. Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks

Author: Lu, Quinlan
Subjects: Computer Sciences, Word embedding, GloVe, Word2Vec, Word co-occurrence
Abstract: One of the trends in Natural Language Processing (NLP) is the use of word embedding. Its aim is to build a low dimensional vector representation of words from text corpora. Global Vectors for Word Representation (GloVe) and Sikp-Gram with Negative Sampling (SGNS) are two representative word embedding methods. Existing papers have different conclusions on the performance of these two methods. This thesis focuses on GloVe and studies its commonalities and differences with SGNS. Word co-occurrence is the cornerstone of all word embedding algorithms. One difference between GloVe and SGNS is the definition of co-occurrence. The weight of co-occurring words tapers o↵ with the distance between them. GloVe and SGNS adopts different weighting schemes. In SGNS, weight decreases linearly with the distance. In GloVe, the weight decreases harmonically, giving less weight to the words in the center of the window. We propose GloVe-L (GloVe Linear), by changing the weighting scheme to the linear weighting. We find that GloVe-L outperforms GloVe consistently in word similarity tasks. The conclusion is supported by extensive experiments on 8 Word evaluation benchmarks on Wikipedia training corpus. The thesis also explores the impact of hyper-parameters on the result, including window size and xmax in GloVe. Another interesting observation is that Glove-L does not work well for word analogy tasks.
Published: 2021

40. MBiLSTMGloVe: Embedding GloVe knowledge into the corpus using multi-layer BiLSTM deep learning model for social media sentiment analysis.

Author: Pimpalkar, Amit and Raj R, Jeberson Retna
Subjects: *DEEP learning, *SENTIMENT analysis, *SOCIAL media, *ARTIFICIAL intelligence, *NATURAL language processing, *MACHINE learning
Abstract: The fast improvement and transformation of online media and unique sites with critical reviews of items, movies, goods, etc. have created a tremendous assortment of assets for clients everywhere around the globe. This information might contain a great deal of data, including product reviews, anticipating market changes, and the extremity of film assessments. Sentiment Analysis (SA) innovation produces phonetic comprehension according to the viewpoint of machines through the handling and investigation of immense amounts of information, which is a hot expedition passageway heading into the field of man-made reasoning, a.k.a. Artificial Intelligence (AI). To address the substance appendage from short texts, we want to investigate the further semantics of words by exploiting thoughtful Machine Learning (ML) and Deep Learning (DL) strategies. In this way, AI, ML, and DL procedures can control and distribute intuition introspection in these difficulties. Our recommended model, based on the DL method and the GloVe word embedding approach, learns the features using a CNN layer and then coordinates those parts into a Multi-Layered Bi-DirectionalLong-Short-Term Memory (MBiLSTM) to capture long-range embedded circumstances. The main aim of this experiment is to give an adequate answer to examine feelings and user reviews in positive and negative classifications. Our runs show that a test accuracy of 92.05% and a validation accuracy of 93.55% can be attained with the given model. The framework is assessed using IMDB datasets. The proposed model outflanks existing pattern models, which show that going past the substance of a tweet is valuable in opinion classification orders since it gives the classifier a deep understanding of the chore. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

41. Automatic Retail Product Identification System for Cashierless Stores

Author: Zhong, Shiting
Subjects: Word Embedding, Teknik och teknologier, Neural Network, Retail Product Identification, Engineering and Technology, Word2Vec, GloVe, Text Classification
Abstract: The introduction of artificial intelligence techniques in the retail market is making a revolution in shopping experience. It allows shoppers to walk into a store, grab what they want and simply walk out without scanning barcodes or having to stand in long queues. That is what we call cashierless stores. In this project, it aims to provide an efficient solution to the automatic retail product identification. This solution presents an artifact one can use to build an end-to-end smart system for cashierless stores. Henceforth, a solution based on text classification is proposed to recognize and identify the products. For that, deep learning techniques are used such as RNN and LSTM to build the classifier. The performance of this classifier is evaluated using various metrics and it shows its efficiency with an accuracy exceeding 86%.
Published: 2021

42. Avoiding Unintended Bias in Toxicity Classification with Neural Networks

Author: Sergey V. Morzhov
Subjects: Computer science, lstm, bert, lcsh:Telecommunication, Task (project management), lcsh:TK5101-6720, recurrent neural networks, natural language processing, Artificial neural network, Social network, gru, business.industry, Deep learning, Offensive, toxicity, deep learning, fasttext, nlp, rnn, word embedding, Data science, Popularity, Jigsaw, Artificial intelligence, attention mechanism, glove, business, Word (computer architecture)
Abstract: The growing popularity of online platforms that allow users to communicate with each other, exchange opinions about various events and leave comments, has contributed to the development of natural language processing algorithms. Tens of millions of messages per day published by users of a certain social network must be analyzed in real time for moderation to prevent the spread of various illegal or offensive information, threats and other types of toxic comments. Of course, such a large amount of information can be processed quite quickly only automatically. That is why it is necessary to find a way to teach a computer to “understand” a text written by a man. It is a non-trivial task, even if the word “understand” here means only to detect or classify. The rapid development of machine learning technologies has led to the widespread adoption of new algorithms. Many tasks that for years were considered almost impossible to solve using computer now can be successfully solved with deep learning technologies. In this article, the author presents modern approaches to solving the problem of toxic comments detection using deep learning technologies and neural networks. The author introduces two state-of-theart neural network architectures and also demonstrates how to use a contextual language representation model to detect toxicity. Furthermore, in this article will be presented the results of the developed algorithms, as well as the results of their ensemble, tested on a large training set, gathered and marked up by Google and Jigsaw.
Published: 2020

43. Natural language understanding in argumentative dialogue systems

Author: Shigehalli, Pavan Rajashekhar, Minker, Wolfgang, Wesner, Stefan, and Rach, Niklas
Subjects: Artificial intelligence, Digitale Sprachverarbeitung, Natürliche Sprache, WordNet, Word embedding, Automatic speech recognition, Semantic similarity measures, Computational linguistics, Natural language understanding, Semantics, Data processing, Dialogue game for argumentation, Natural language processing (Computer science), Künstliche Intelligenz, Utterance mapping, Automatische Sprachanalyse, Word2Vec, GloVe, ddc:004, DDC 004 / Data processing & computer science, Intent classification, RASA, BERT
Abstract: This thesis presents various techniques to implement the natural language understanding in argumentative dialog systems. We examine various models like the ones which exploit the linguistic properties of the English language, models which rely on the vector representation of words, with different similarity measurement techniques. In order to structure the user responses, we formalize the user interaction as an argument game. And we explore chat bot designs in order to understand the user intents in the game. The models are tested for the data obtained from Wikipedia pages. We also collect the real user responses and evaluate the model. The output indicates that, the training data, model architecture and the similarity measurement techniques play a significant role in the performance. In addition, the models based on linguistic properties perform better than the ones based on vector representation.
Published: 2020
Full Text: View/download PDF

44. The Evaluation of Word Embedding Models and Deep Learning Algorithms for Turkish Text Classification

Author: Zeynep Hilal Kilimci, Selim Akyokus, Doğuş Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, and Kilimci, Zeynep Hilal
Subjects: Word embedding, Text Categorization, Fasttext, Computer science, business.industry, Deep learning, Convolutional Neural Networks, Convolutional neural network, Statistical classification, Recurrent neural network, Categorization, Glove, Word2Vec, Word2vec, Artificial intelligence, business, Representation (mathematics), Long Short Term Memory, Algorithm, Recurrent Neural Networks
Abstract: Kilimci, Zeynep Hilal (Dogus Author) -- Conference full title: 4th International Conference on Computer Science and Engineering, UBMK 2019; Samsun; Turkey; 11 September 2019 through 15 September 2019. The use of word embedding models and deep learning algorithms are currently the most common and popular trends to enhance the overall performance of a text classification/categorization system. Word embedding models are vectors that provide a mapping of words with similar meaning to own a similar representation which is learned from a corpus. Deep learning algorithms successful produce more successful results in many areas of their applications when they are compared to the conventional machine learning algorithms. In this study, three different word embedding models Word2Vec, Glove, and FastText are employed for word representation. Instead of using conventional classification algorithms, three different deep learning architectures Recurrent Neural Networks (RNN), Long Short Term Memory Networks (LSTM) and Convolutional Neural Networks (CNN) are used for classification task by performing experiments on collections of different Turkish documents. Experimental results show that the usage of deep learning algorithms together with word embedding models advances the performance of text classification systems.
Published: 2019

45. Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods

Author: Maria Teresa Artese and Isabella Gagliardi
Subjects: word embedding models, Word embedding, Computer Networks and Communications, Computer science, Neuroscience (miscellaneous), Keyword extraction, Context (language use), 02 engineering and technology, Semantics, computer.software_genre, lcsh:Technology, 01 natural sciences, 010305 fluids & plasmas, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), unsupervised automatic keyword extraction, Word2vec, information retrieval, lcsh:Science, clustering algorithms, Cluster analysis, evaluation, lcsh:T, business.industry, word2vec, Computer Science Applications, Human-Computer Interaction, Metadata, w2v, Italian datasets, lcsh:Q, GloVe, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing
Abstract: Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.
Published: 2020

46. Classification of Large-Scale Biological Annotations Using Word Embeddings Derived from Corpora of Biomedical Research Literature

Author: Baćac, Adriano and Šikić, Mile
Subjects: TECHNICAL SCIENCES. Computing, TEHNIČKE ZNANOSTI. Računarstvo, Word2vec, GloVe, phenotype classification, klasifikacija fenotipova, specifičnost korpusa, vektorski prikaz riječi, LSTM, word embedding, RNN, corpus specificity
Abstract: Naučeni su vlastiti Word2vec i GloVe modeli prikaza riječi za znanstvenu literaturu u području biomedicine, kao i tri klasifikacijske metode za diskriminaciju fenotipova, dvije temeljene na agregaciji vektorskog prikaza riječi, i jedna na rekurentnim neuronskim mrežama. Prikazi riječi su trenirani na velikom korpusu znanstvenih članaka i njegovim tematski specifičnim podskupovima. Rezultati klasifikacije su testirani na 6 izvora dokumenata. Pokazano je da Word2vec postiže bolje rezultate kada je treniran na tematski specifičnom podskupu koji je sačinjen od 4.9% od ukupnih članaka nego kada je treniran na cijelom korpusu. Korištenje rekurentnih neuronskih mreža imalo je problema s prenaučenosti, moguće zbog predugačkih dokumenata ili premalog skupa za učenje. Iako predloženi modeli nisu bili bolji od stroja potpornih vektora koristeći prikaz vreće riječi, pokazano je da korištenje agregacijskih metoda uz bazni model povećava količinu ispravne klasifikacije manjinske klase kod nekih fenotipova za oko 10%. Custom Word2vec and GloVe embeddings for scientific literature in the biomedical domain were trained, as well as three classification methods for discriminating phenotype traits, two of which were based on aggregating word embeddings and one on recurrent neural networks. Word embeddings were trained on a large corpus of scientific articles and its more subject-specific subsets. Classification performance was tested on 6 document sources. It was shown that Word2vec achieves better performance when trained on a subject-specific subset corpus comprised of 4.9% articles, than when trained on the entire corpus. Using recurrent neural networks had an overfitting problem, possibly because the documents were too long or the training set too small. Although the proposed models did not outperform support vector machine using bag-of-words, it was shown that using the aggregation methods alongside the baseline model increases the amount of correctly classified minority class in some phenotype traits by around 10%.
Published: 2017

47. Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages.

Author: Yuan, Yang, Li, Xiao, and Yang, Ya-Ting
Subjects: *EMBEDDINGS (Mathematics), *PUNCTUATION, *CORPORA, *VOCABULARY, *LANGUAGE & languages
Abstract: To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

48. Word Embedding for Rhetorical Sentence Categorization on Scientific Articles

Author: Masayu Leylia Khodra, Ghoziyah Haitan Rachman, and Dwi H. Widyantoro
Subjects: Information Systems and Management, Word embedding, General Computer Science, Computer science, TK5101-6720, Information technology, computer.software_genre, Naive Bayes classifier, Classifier (linguistics), Rhetorical question, Word2vec, Electrical and Electronic Engineering, scientific article, 060201 languages & linguistics, business.industry, Principle of maximum entropy, 06 humanities and the arts, word embedding, T58.5-58.64, rhetorical sentence categorization, Word2Vec, ComputingMethodologies_PATTERNRECOGNITION, Categorization, 0602 languages and literature, Telecommunication, GloVe, Artificial intelligence, business, computer, Sentence, Natural language processing
Abstract: A common task in summarizing scientific articles is employing the rhetorical structure of sentences. Determining rhetorical sentences itself passes through the process of text categorization. In order to get good performance, some works in text categorization have been done by employing word embedding. This paper presents rhetorical sentence categorization of scientific articles by using word embedding to capture semantically similar words. A comparison of employing Word2Vec and GloVe is shown. First, two experiments are evaluated using five classifiers, namely NaÃ¯ve Bayes, Linear SVM, IBK, J48, and Maximum Entropy. Then, the best classifier from the first two experiments was employed. This research showed that Word2Vec CBOW performed better than Skip-Gram and GloVe. The best experimental result was from Word2Vec CBOW for 20,155 resource papers from ACL-ARC, features from Teufel and the previous label feature. In this experiment, Linear SVM produced the highest F-measure performance at 43.44%.
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

48 results on '"glove"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources