Descriptor: "Bag-of-words model" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bag-of-words model"' showing total 2,466 results

Start Over Descriptor "Bag-of-words model"

2,466 results on '"Bag-of-words model"'

201. A Comprehensive Look at Coding Techniques on Riemannian Manifolds

Author: Mehrtash Harandi, Fatih Porikli, and Masoud Faraki
Subjects: Theoretical computer science, Computer Networks and Communications, Computer science, Euclidean space, 010103 numerical & computational mathematics, 02 engineering and technology, Riemannian geometry, 01 natural sciences, Facial recognition system, Manifold, Computer Science Applications, symbols.namesake, Artificial Intelligence, Bag-of-words model, Euclidean geometry, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, 0101 mathematics, Neural coding, Software, Coding (social sciences)
Abstract: Core to many learning pipelines is visual recognition such as image and video classification. In such applications, having a compact yet rich and informative representation plays a pivotal role. An underlying assumption in traditional coding schemes [e.g., sparse coding (SC)] is that the data geometrically comply with the Euclidean space. In other words, the data are presented to the algorithm in vector form and Euclidean axioms are fulfilled. This is of course restrictive in machine learning, computer vision, and signal processing, as shown by a large number of recent studies. This paper takes a further step and provides a comprehensive mathematical framework to perform coding in curved and non-Euclidean spaces, i.e., Riemannian manifolds. To this end, we start by the simplest form of coding, namely, bag of words. Then, inspired by the success of vector of locally aggregated descriptors in addressing computer vision problems, we will introduce its Riemannian extensions. Finally, we study Riemannian form of SC, locality-constrained linear coding, and collaborative coding. Through rigorous tests, we demonstrate the superior performance of our Riemannian coding schemes against the state-of-the-art methods on several visual classification tasks, including head pose classification, video-based face recognition, and dynamic scene recognition.
Published: 2018

202. Predicting of anaphylaxis in big data EMR by exploring machine learning approaches

Author: Cristóbal Colón-Ruiz, Miguel A. Tejedor-Alonso, Isabel Segura-Bedmar, and Mar Moro-Moro
Subjects: Big Data, 0301 basic medicine, Word embedding, Computer science, Decision Making, Big data, Health Informatics, 02 engineering and technology, Machine learning, computer.software_genre, Convolutional neural network, Machine Learning, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Cluster Analysis, Electronic Health Records, Humans, Cluster analysis, Anaphylaxis, Language, business.industry, Computer Science Applications, Random forest, Identification (information), 030104 developmental biology, Bag-of-words model, Multilayer perceptron, Linear Models, Regression Analysis, 020201 artificial intelligence & image processing, Neural Networks, Computer, Artificial intelligence, business, computer, Algorithms, Medical Informatics
Abstract: Anaphylaxis is a life-threatening allergic reaction that occurs suddenly after contact with an allergen. Epidemiological studies about anaphylaxis are very important in planning and evaluating new strategies that prevent this reaction, but also in providing a guide to the treatment of patients who have just suffered an anaphylactic reaction. Electronic Medical Records (EMR) are one of the most effective and richest sources for the epidemiology of anaphylaxis, because they provide a low-cost way of accessing rich longitudinal data on large populations. However, a negative aspect is that researchers have to manually review a huge amount of information, which is a very costly and highly time consuming task. Therefore, our goal is to explore different machine learning techniques to process Big Data EMR, lessening the needed efforts for performing epidemiological studies about anaphylaxis. In particular, we aim to study the incidence of anaphylaxis by the automatic classification of EMR. To do this, we employ the most widely used and efficient classifiers in text classification and compare different document representations, which range from well-known methods such as Bag Of Words (BoW) to more recent ones based on word embedding models, such as a simple average of word embeddings or a bag of centroids of word embeddings. Because the identification of anaphylaxis cases in EMR is a class-imbalanced problem (less than 1% describe anaphylaxis cases), we employ a novel undersampling technique based on clustering to balance our dataset. In addition to classical machine learning algorithms, we also use a Convolutional Neural Network (CNN) to classify our dataset. In general, experiments show that the most classifiers and representations are effective (F1 above 90%). Logistic Regression, Linear SVM, Multilayer Perceptron and Random Forest achieve an F1 around 95%, however linear methods have considerably lower training times. CNN provides slightly better performance (F1 = 95.6%).
Published: 2018

203. Semantic text classification: A survey of past and recent advances

Author: Murat Can Ganiz and Berna Altinel
Subjects: business.industry, Computer science, Deep learning, Context (language use), 02 engineering and technology, Library and Information Sciences, Management Science and Operations Research, computer.software_genre, Semantics, Computer Science Applications, Statistical classification, Bag-of-words model, 020204 information systems, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Domain knowledge, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Sentence, Information Systems, Meaning (linguistics)
Abstract: Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.
Published: 2018

204. DETECTION OF SUSPICIOUS TERRORIST EMAILS USING TEXT CLASSIFICATION: A REVIEW

Author: Ram Gopal Raj, Ghulam Mujtaba, Liyana Shuib, and Roshan Gunalan
Subjects: Naive Bayes classifier, Statistical classification, Information retrieval, General Computer Science, Bag-of-words model, Computer science, Feature (computer vision), Feature extraction, Decision tree, Context (language use), Feature selection
Abstract: This paper provides a comprehensive review and analysis of the detection of suspicious terrorist electronic mails (emails) using various phases and methods of text classification. We explored, analyzed, and compared different datasets, features, feature extraction techniques, feature representation techniques, feature selection schemes, text classification techniques, and performance measurement metrics used in the detection of suspicious terrorist e-mails. 30 articles were retrieved from 6 well-known academic databases after rigorous selection. From the study, we found that researchers often generate their own e-mails dataset since there is no public dataset is available in the research area of detecting suspicious terrorist e-mails. In most of the studies, researchers used content and context-based features to detect terrorist e-mails. Our findings also show that the most commonly used feature extraction techniques are the bag of words and n-gram, the most typically applied feature representation schemes are binary representation and term frequency, the most usually adopted feature selection method is information gain,, the most common and most accurate text classification algorithms are naive bayes, decision trees, and support vector machines, and the widely employed performance measurement metrics are accuracy, precision, and recall. Open research challenges and research issues that involve significant research efforts are also summarized in this review for future researchers in the area of suspicious terrorist e-mail detection using text classification techniques where the critical analysis presented in this paper also provides valuable insights to guide these researchers. Finally, the indicated issues and challenges presented in this paper can be used as future research directions in this area.
Published: 2018

205. Text sentiment analysis based on CBOW model and deep learning in big data environment

Author: Bing Liu
Subjects: Training set, General Computer Science, business.industry, Computer science, Deep learning, 010102 general mathematics, Sentiment analysis, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, Convolutional neural network, Bag-of-words model, Softmax function, 0202 electrical engineering, electronic engineering, information engineering, Feedforward neural network, 020201 artificial intelligence & image processing, Artificial intelligence, Language model, 0101 mathematics, business, computer, Dropout (neural networks)
Abstract: For the issues that the accurate and rapid sentiment analysis of comment texts in the network big data environment, a text sentiment analysis method combining Bag of Words (CBOW) language model and deep learning is proposed. First, a vector representation of text is constructed by a CBOW language model based on feedforward neural networks. Then, the Convolutional Neural Network (CNN) is trained through the labeled training set to capture the semantic features of the text. Finally, the Dropout strategy is introduced in the Softmax classifier of traditional CNN, which can effectively prevent the model from over-fitting and has better classification ability. Experimental results on COAE2014 and IMDB datasets show that this method can accurately determine the emotional category of the text and is robust, the accuracy on the two datasets reached 90.5% and 87.2%, respectively.
Published: 2018

206. Signal discrimination using category-preserving bag-of-words model for condition monitoring

Author: Yu-Hsiang Hsiao
Subjects: 0209 industrial biotechnology, Computer science, business.industry, Euclidean space, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Codebook, Intelligent decision support system, Condition monitoring, Pattern recognition, 02 engineering and technology, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, Artificial Intelligence, Bag-of-words model, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Vector space model, Waveform, 020201 artificial intelligence & image processing, Artificial intelligence, Cluster analysis, business, Software
Abstract: Signal discrimination contributes to the development of machine–machine and human–machine interactive intelligent systems. In this study, a novel framework for signal discrimination was proposed. The proposed framework comprised three phases. In Phase I, a waveform shape-based feature extraction method was used for parameterizing signals. In Phase II, a novel category-preserving bag-of-words (CPBoW) model was proposed. In Phase III, signals were discriminated using a vector space model with term frequency–inverse document frequency. The bag-of-words model generally demonstrated promising performance for signal discrimination. However, the inherent connections among signals of homogeneous categories were considerably lost during signal framing and codebook generation processes. This was because the codebook was simply generated by clustering signal frame samples in the Euclidean space. In the proposed CPBoW model, Taguchi’s quality engineering method was used to develop a category-preserving distance metric for executing a clustering process to generate category-preserving codewords. This preserved category information in the codebook and consequently increased the effectiveness of the discrimination process. The proposed framework was verified through three condition monitoring applications that involved a musical instrument recognition problem, motor bearing fault recognition problem, and heart disease recognition problem. The results indicated the superior performance and effectiveness of the proposed framework.
Published: 2018

207. Visual analysis of asphalt pavement for detection and localization of potholes

Author: Fawad Hussain, Muhammad Haroon Yousaf, Fiza Murtaza, and Kanza Azhar
Subjects: Ground truth, business.industry, Computer science, 0211 other engineering and technologies, Scale-invariant feature transform, Pattern recognition, 02 engineering and technology, Support vector machine, Asphalt pavement, Artificial Intelligence, Bag-of-words model, Histogram, 021105 building & construction, 0202 electrical engineering, electronic engineering, information engineering, Pothole, 020201 artificial intelligence & image processing, Artificial intelligence, business, Precision and recall, Information Systems
Abstract: Identifying and restoring distresses in asphalt pavement have key significance in durability and long life of roads and highways. A vast number of accidents occurs on the roads and highways due to the pavement distresses. This paper aims to detect and localize one of the critical roadway distresses, the potholes, using computer vision. We have processed images of asphalt pavement for experimentation containing the pothole and non-pothole regions. We proposed a top-down scheme for the detection and localization of potholes in the pavement images. First, we classified pothole/non-pothole images using a bag of words (BoW) approach. We employed and computed famous scale-invariant feature transform (SIFT) features to establish the visual vocabulary of words to represent pavement surface. Support vector machine (SVM) is employed for the training and testing of histograms of words of pavement images. Secondly, we proposed graph cut segmentation scheme to localize the potholes in the labelled pothole images. This paper presents both, subjective and objective evaluation of potholes localization results with the ground truth. We evaluated the proposed scheme on a pavement surface dataset containing the wide-ranging pavement images in different scenarios. Experimentation results show that we achieved an accuracy of 95.7% for the identification of pothole images with significant precision and recall. Subjective evaluation of potholes localization results in high recall with relatively good accuracy. However, the objective assessment shows the 91.4% accuracy for localization of potholes.
Published: 2018

208. Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record

Author: Junyi Yuan, Xingxing Cen, Qunsheng Ma, and Xumin Hou
Subjects: Conditional random field, Multidisciplinary, Word embedding, Computer science, business.industry, Active learning (machine learning), Electronic medical record, computer.software_genre, Information extraction, Named-entity recognition, Bag-of-words model, Feature (machine learning), Artificial intelligence, business, computer, Natural language processing
Abstract: Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.
Published: 2021

209. Intelligent Detection of False Information in Arabic Tweets Utilizing Hybrid Harris Hawks Based Feature Selection and Machine Learning Models

Author: Hamouda Chantar, Hamza Turabieh, Mahmoud Saheb, and Thaer Thaher
Subjects: false information, natural language processing, machine learning, feature selection, meta-heuristics, Twitter, Physics and Astronomy (miscellaneous), Computer science, General Mathematics, Feature extraction, Feature selection, 02 engineering and technology, Machine learning, computer.software_genre, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Social media, Metaheuristic, business.industry, lcsh:Mathematics, Rank (computer programming), lcsh:QA1-939, Term (time), Chemistry (miscellaneous), Bag-of-words model, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Curse of dimensionality
Abstract: Fake or false information on social media platforms is a significant challenge that leads to deliberately misleading users due to the inclusion of rumors, propaganda, or deceptive information about a person, organization, or service. Twitter is one of the most widely used social media platforms, especially in the Arab region, where the number of users is steadily increasing, accompanied by an increase in the rate of fake news. This drew the attention of researchers to provide a safe online environment free of misleading information. This paper aims to propose a smart classification model for the early detection of fake news in Arabic tweets utilizing Natural Language Processing (NLP) techniques, Machine Learning (ML) models, and Harris Hawks Optimizer (HHO) as a wrapper-based feature selection approach. Arabic Twitter corpus composed of 1862 previously annotated tweets was utilized by this research to assess the efficiency of the proposed model. The Bag of Words (BoW) model is utilized using different term-weighting schemes for feature extraction. Eight well-known learning algorithms are investigated with varying combinations of features, including user-profile, content-based, and words-features. Reported results showed that the Logistic Regression (LR) with Term Frequency-Inverse Document Frequency (TF-IDF) model scores the best rank. Moreover, feature selection based on the binary HHO algorithm plays a vital role in reducing dimensionality, thereby enhancing the learning model’s performance for fake news detection. Interestingly, the proposed BHHO-LR model can yield a better enhancement of 5% compared with previous works on the same dataset.
Published: 2021
Full Text: View/download PDF

210. Efficient Creation of Japanese Tweet Emotion Dataset Using Sentence-Final Expressions

Author: Hidekatsu Ito, Masaki Ishii, Kohji Dohsaka, and Tatsuki Akahori
Subjects: business.industry, Computer science, Deep learning, computer.software_genre, Bag-of-words model, Classifier (linguistics), Task analysis, Emotional expression, Artificial intelligence, Language model, business, computer, Sentence, Natural language processing, Natural language
Abstract: Emotion recognition in natural language text is one of the critical technologies in the human-computer interface in a wide range of fields, including health and well-being, and labeled data plays a significant role in developing such technology. This paper presents a method for efficiently collecting Japanese emotion tweets carrying the first-person's emotion using emotional expressions and sentence-final expressions. By exploiting sentence-final expressions, we can identify the targeted tweets even though the subjects of sentences are often omitted, and first-person pronouns are often not explicitly in Japanese. By applying the method to Japanese tweet data, we constructed a Japanese tweet dataset comprising 2,234 tweets with labels of emotion types and intensities for two types of emotions: joy and sadness. The evaluation results show that the proposed method can improve the collection efficiency of targeted tweets and the reliability of data labels. We developed classifiers from the dataset that recognize emotion intensities. We show that a classifier using a deep learning-based language model outperforms conventional baseline methods using a Bag of Words model and that the Japanese tweet emotion dataset constructed by our method is useful for the emotion intensity recognition.
Published: 2021

211. Measuring Model Biases in the Absence of Ground Truth

Author: Ken Burke, Alex Bäuerle, Osman Aka, Christina Greer, and Margaret Mitchell
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Contextual image classification, business.industry, Computer science, Association (object-oriented programming), Computer Vision and Pattern Recognition (cs.CV), Rank (computer programming), Computer Science - Computer Vision and Pattern Recognition, Pointwise mutual information, computer.software_genre, Machine Learning (cs.LG), Information extraction, Bag-of-words model, Identity (object-oriented programming), Artificial intelligence, business, Set (psychology), computer, Natural language processing
Abstract: The measurement of bias in machine learning often focuses on model performance across identity subgroups (such as man and woman) with respect to groundtruth labels. However, these methods do not directly measure the associations that a model may have learned, for example between labels and identity subgroups. Further, measuring a model's bias requires a fully annotated evaluation dataset which may not be easily available in practice. We present an elegant mathematical solution that tackles both issues simultaneously, using image classification as a working example. By treating a classification model's predictions for a given image as a set of labels analogous to a "bag of words", we rank the biases that a model has learned with respect to different identity labels. We use man, woman as a concrete example of an identity label set (although this set need not be binary), and present rankings for the labels that are most biased towards one identity or the other. We demonstrate how the statistical properties of different association metrics can lead to different rankings of the most "gender biased" labels, and conclude that normalized pointwise mutual information (nPMI) is most useful in practice. Finally, we announce an open-sourced nPMI visualization tool using TensorBoard.
Published: 2021

212. User recommendation based on Hybrid filtering in Telegram messenger

Author: Mohammad Ali Zare Chahooki, Ali Hashemi, and Davod Karimpour
Subjects: Information retrieval, Social network, business.industry, Bag-of-words model, Group (mathematics), Process (engineering), Computer science, Feature (machine learning), Target audience, Graph (abstract data type), Cloud computing, business
Abstract: Over the past decade, social networks and messengers have found a special place in the creation and development of businesses. User recommendation is a very important feature in social networks that has attracted the attention of many users to these environments. Using this system in an instant messenger environment is very useful. Telegram is a cloud-based messenger with more than 400 million monthly active users. Telegram is used as a social network in Iran, but does not offer the most widely used features of social networks, such as recommending users. This feature is important for marketers to find target audience. This paper presents a hybrid filtering-based algorithm to recommend Telegram users. This method combines the membership graph of users with the profile of groups. The membership graph, models users based on their membership in groups. Also, the profile of each group includes the name and description of the group. We have created a bag of words for each group based on natural language processing methods to combine it with the membership graph. After combination process, users are recommended based on the list of groups obtained. The data used in this study is the information of more than 120 million users and 900,000 supergroups in Telegram. This data is obtained through Telegram API by Idekav system. The evaluation of the proposed method has been done separately on two categories of specialized supergroups. Each category includes 25 specialized supergroups in Telegram. Selected supergroups for evaluation have between 2,000 and 10,000 members. Experimental results show the integrity of the model and error reduction in RMSE.
Published: 2021

213. A Performance Comparison of Supervised Classifiers and Deep-learning Approaches for Predicting Toxicity in Thai Tweets

Author: Pree Thiengburanathum and Phasit Charoenkwan
Subjects: business.industry, Computer science, Deep learning, Sentiment analysis, Feature extraction, Machine learning, computer.software_genre, Convolutional neural network, Bag-of-words model, Test set, Classifier (linguistics), Feature (machine learning), Artificial intelligence, business, computer
Abstract: There are numerous tweeter user accounts in Thailand and many toxic comments are being generated every day on this platform. Sentimental Analysis can be used as a tool to identify toxic comments. In this study, two feature extraction techniques, including Bag of Words (BOW) and Term frequency-inverse document (TF-IDF), were investigated. Additionally, the performance of ten well-known traditional classifiers, along with three deep-learning approaches including Convolutional Neural Network (CNN), Long-short-term memory (LSTM) and pretrained Bidirectional Encoder Representations (BERT), were compared with the public Toxicity Thai tweeter corpus. The experiments reveal that by combining Bag of Words (BOW) with the Extra-Tree classifier, researchers were able to archive the highest F1-score of 0.72, classification accuracy rate of 72.27% and AUC value of 0.77 using the test set in contrast to other classifiers and other deep-learning techniques. Feature importance, correlation and impacts were also investigated through the use of SHapley Additive exPlanations (SHAP) diagram.
Published: 2021

214. Document Level Emotion Detection from Bangla Text Using Machine Learning Techniques

Author: Md. Khairul Hasan, Mobasshira Jabin, Sadia Afrin Purba, Sadia Tasnim, and Tahmim Hossen
Subjects: Word embedding, Artificial neural network, business.industry, Computer science, Deep learning, Feature extraction, Machine learning, computer.software_genre, Convolutional neural network, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Test set, Classifier (linguistics), Artificial intelligence, business, computer
Abstract: Understanding emotion from documents automatically is an interesting research topic in the machine learning field. Nowadays, many applications like email, blog, etc have the ability to suggest joyful or angry expressions from written documents. In spite of being a popular language, Bangla lacks a rich corpus with annotated emotion labels, so recognizing emotion from documents is still not developed as other languages. In this work, we have proposed a new dataset containing Bangla documents with annotation of three emotions- Happy, Sad and Angry. Two major feature extraction techniques - Bag of Words(BoW) and Word Embedding is used to extract features from the documents. BoW is used by Logistic Regression and Multinomial Naive Bayes classifiers. Word Embedding is used by Artificial Neural Network(ANN) and Convolutional Neural Network(CNN) classifiers. Among all, Multinomial Naive Bayes classifier has given the best performance on the test set and the accuracy is 68.27%. We have made our dataset11Dataset: https://doi.org/10.6084/m9.figshare.13052789.v1 available for all to be used in further research purposes.
Published: 2021

215. Fake News detection Using Machine Learning

Author: Abdelhamid Djeffal and Nihel Fatima Baarir
Subjects: business.industry, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Feature extraction, Machine learning, computer.software_genre, Term (time), Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Web mining, Bag-of-words model, Classifier (linguistics), Social media, Artificial intelligence, tf–idf, business, computer
Abstract: The phenomenon of Fake news is experiencing a rapid and growing progress with the evolution of the means of communication and Social media. Fake news detection is an emerging research area which is gaining big interest. It faces however some challenges due to the limited resources such as datasets and processing and analysing techniques. In this work, we propose a system for Fake news detection that uses machine learning techniques. We used term frequency-inverse document frequency (TF-IDF) of bag of words and n-grams as feature extraction technique, and Support Vector Machine (SVM) as a classifier. We propose also a dataset of fake and true news to train the proposed system. Obtained results show the efficiency of the system. In this work, we propose a system for Fake news detection that uses machine learning techniques. We used term frequency-inverse document frequency (TF-IDF) of bag of words and n-grams as feature extraction technique, and Support Vector Machine (SVM) as a classifier. We propose also a dataset of fake and true news to train the proposed system. Obtained results show the efficiency of the system.
Published: 2021

216. BOW-GBDT: A GBDT Classifier Combining With Artificial Neural Network for Identifying GPCR–Drug Interaction Based on Wordbook Learning From Sequences

Author: Zhe Lv, Jian-Hua Jia, Wangren Qiu, Xuan Xiao, and Yaoqiu Hong
Subjects: 0301 basic medicine, Discrete wavelet transform, Computer science, GPCR-drug interaction, Discrete Fourier transform, Cell and Developmental Biology, 03 medical and health sciences, 0302 clinical medicine, Dimension (vector space), Classifier (linguistics), Feature (machine learning), bag-of-words, lcsh:QH301-705.5, Original Research, discrete wavelet transform, Artificial neural network, business.industry, Pattern recognition, weighted silhouette coefficient, Cell Biology, Function (mathematics), 030104 developmental biology, lcsh:Biology (General), Bag-of-words model, 030220 oncology & carcinogenesis, Artificial intelligence, business, artificial neural network, Developmental Biology
Abstract: Background: As a class of membrane protein receptors, G protein-coupled receptors (GPCRs) are very important for cells to complete normal life function and have been proven to be a major drug target for widespread clinical application. Hence, it is of great significance to find GPCR targets that interact with drugs in the process of drug development. However, identifying the interaction of the GPCR–drug pairs by experimental methods is very expensive and time-consuming on a large scale. As more and more database about GPCR–drug pairs are opened, it is viable to develop machine learning models to accurately predict whether there is an interaction existing in a GPCR–drug pair.Methods: In this paper, the proposed model aims to improve the accuracy of predicting the interactions of GPCR–drug pairs. For GPCRs, the work extracts protein sequence features based on a novel bag-of-words (BOW) model improved with weighted Silhouette Coefficient and has been confirmed that it can extract more pattern information and limit the dimension of feature. For drug molecules, discrete wavelet transform (DWT) is used to extract features from the original molecular fingerprints. Subsequently, the above-mentioned two types of features are contacted, and SMOTE algorithm is selected to balance the training dataset. Then, artificial neural network is used to extract features further. Finally, a gradient boosting decision tree (GBDT) model is trained with the selected features. In this paper, the proposed model is named as BOW-GBDT.Results: D92M and Check390 are selected for testing BOW-GBDT. D92M is used for a cross-validation dataset which contains 635 interactive GPCR–drug pairs and 1,225 non-interactive pairs. Check390 is used for an independent test dataset which consists of 130 interactive GPCR–drug pairs and 260 non-interactive GPCR–drug pairs, and each element in Check390 cannot be found in D92M. According to the results, the proposed model has a better performance in generation ability compared with the existing machine learning models.Conclusion: The proposed predictor improves the accuracy of the interactions of GPCR–drug pairs. In order to facilitate more researchers to use the BOW-GBDT, the predictor has been settled into a brand-new server, which is available at http://www.jci-bioinfo.cn/bowgbdt.
Published: 2021

217. Comparison of Deep Learning Approaches for Sentiment Classification

Author: C.S. Kanimozhiselvi, S. Uma, and K. S. Kalaivani
Subjects: 0209 industrial biotechnology, business.industry, Computer science, Deep learning, Sentiment analysis, Feature extraction, 02 engineering and technology, Machine learning, computer.software_genre, Convolutional neural network, Field (computer science), 020901 industrial engineering & automation, Recurrent neural network, Bag-of-words model, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer
Abstract: Word embeddings are used to convert the unstructured text to numerical values for further analysis. Nowadays, prediction based embedding models like Continuous Bag Of Words (CBOW) and Skip grams are used in comparison to frequency based embeddings. Unlike frequency based embeddings, prediction based embeddings are able to model the semantics of the terms present in a sentence. Sentiment Analysis (SA) is a field of study that aims to automatically extract opinions from the data and to further classify them as positive and negative. The application of sentiment analysis in almost all the domains stands as a motivating factor for this work. It suffers from the problem of non-availability of sufficient labeled data to train the model. Due to the scalability and ability of deep learning models to perform automatic feature extraction from the data, they can be introduced to address this problem. They are also used for various applications due to its capability to extract hierarchical structures from complex data. Keras is a Deep Learning (DL) framework that provides an embedding layer to produce the vector representation of words present in the document. The objective of this work is to analyze the performance of three deep learning models namely Convolutional Neural Network (CNN), Simple Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) for classifying the book reviews. From the experiments conducted, it is found that LS TM model performs better than CNN and simple RNN for sentiment classification.
Published: 2021

218. Phrases based Document Classification from Semi Supervised Hierarchical LDA

Author: Rohit Agarwal
Subjects: Phrase, Computer science, business.industry, Document classification, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, computer.software_genre, Semantics, Latent Dirichlet allocation, Support vector machine, Naive Bayes classifier, symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, symbols, Vector space model, Artificial intelligence, business, computer, Natural language processing
Abstract: Different state-of-the-art document classification models are based on bag of words model such as Support Vector Machine, Naive Bayes and Neural Network. These models do not contain the word's semantic meaning. In any document, meaning of the words can be demonstrated by their presence and vicinity of particular words. Bag of Phrases is one technique by which author can preserve the vicinity of the words. This model is proficient to distinguish the capability of phrases in document classification. In this paper author proposes Semi-Supervised Hierarchical Latent Dirichlet Allocation (SSHLDA) model which uses the outstanding theme to isolate the phrases from the corpus. The proposed model incorporates the phrases in vector space model for document classification. Experiment performs on the organic document with Bag of Phrase technique and show the effective classification. When compare with state-of-the-models.
Published: 2021

219. Fine Grained Sentiment Analysis of Malayalam Tweets Using Lexicon Based and Machine Learning Based Approaches

Author: Soumya S and Pramod K
Subjects: business.industry, Computer science, Feature vector, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Sentiment analysis, Machine learning, computer.software_genre, Random forest, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Kernel (statistics), Feature (machine learning), Artificial intelligence, business, tf–idf, computer
Abstract: Fine-Grained Sentiment Analysis (FGSA) of Malayalam Tweets have been implemented in this work. The tweets are classified into positive, strongly positive, negative, strongly negative, and neutral sentiments. Both lexicon-based and machine learning-based approaches are used for sentiment classification of Malayalam Tweets. Lexicon based approach uses both dictionary-based and corpus-based approach. The dictionary-based approach is used in this work. The machine learning algorithms such as Support Vector Machine (SVM) and Random Forest (RF) classifiers are used for sentiment classification of the dataset. Bag of Words (BoW), Term-Frequency vs. Inverse Document Frequency (TF-IDF), and Sentiwordnet feature matrices are used to vectorize the input dataset. Lexicon based approach got an accuracy of 84.8%. In machine learning algorithms, the SVM (kernel = linear), SVM (kernel = RBF) and RF with the Sentiwordnet feature vector got an accuracy of 92.6%, 92.9%, and 93.4%, respectively.
Published: 2021

220. Significant Trajectories and Locality Constrained Linear Coding for Hand Gesture Representation

Author: Viet Sang Dinh, Tien Hai Nguyen, and Thanh-Hai Tran
Subjects: Computer science, business.industry, 05 social sciences, Locality, 050801 communication & media studies, Context (language use), Pattern recognition, Convolutional neural network, Support vector machine, 0508 media and communications, Bag-of-words model, Gesture recognition, 0502 economics and business, 050211 marketing, Artificial intelligence, Representation (mathematics), business, Gesture
Abstract: Recently, action recognition gains a lot of attention of researchers thank to its potential applications in real life. Particularly, hand gestures, which are actions performed by human hand, have been widely studied and started to be deployed as an efficient mean of human machine interaction (HMI). In this paper, we focus on hand gestures recognition in the context of HMI which requires to balance the trade-off between recognition accuracy and computation time. While convolutional neural network (CNN) has been shown to be very effective in many tasks, it requires powerful computer and huge training data which are not always available in common use. In this paper, we study a method based on hand crafted features (i.e. dense trajectories for hand gesture representation). We then select the most significant trajectories and compute a descriptor for each of them. For final representation of a gesture, we utilize locality constrained linear coding (LLC) and compare it with Bag of Words (BoW0 model. Finally, Support Vector Machine (SVM) is deployed to classify gestures. We test the proposed method on a dataset of hand gestures captured from different viewpoints and study the impact of viewpoint changes on such dataset. Experiments show that the proposed method keeps a balance between accuracy and computational time and comparable with CNN based method.
Published: 2021

221. SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP Techniques

Author: Jaouhar Fattahi and Mohamed Mejri
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, business.industry, Computer science, Detector, Cryptography, 02 engineering and technology, computer.software_genre, Ensemble learning, Term (time), Set (abstract data type), ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, 020204 information systems, Classifier (linguistics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Cryptography and Security (cs.CR), computer, Natural language processing
Abstract: In this paper, we put forward a new tool, called SpaML, for spam detection using a set of supervised and unsupervised classifiers, and two techniques imbued with Natural Language Processing (NLP), namely Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). We first present the NLP techniques used. Then, we present our classifiers and their performance on each of these techniques. Then, we present our overall Ensemble Learning classifier and the strategy we are using to combine them. Finally, we present the interesting results shown by SpaML in terms of accuracy and precision., This paper was accepted, on October 13, 2020, for pulication and oral presentation at the 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP 2021) to be held in Zhuhai, China during January 8-10, 2021 and hosted by Beijing Normal University (Zhuhai)
Published: 2021

222. The Temporal Dictionary Ensemble (TDE) Classifier for Time Series Classification

Author: Gavin C. Cawley, Anthony J. Bagnall, James Large, and Matthew Middlehurst
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, business.industry, Deep learning, Pattern recognition, 02 engineering and technology, Machine Learning (cs.LG), Tree (data structure), symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, Transformation (function), Boss, Bag-of-words model, 020204 information systems, Histogram, Classifier (linguistics), 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, business, Gaussian process
Abstract: Using bag of words representations of time series is a popular approach to time series classification. These algorithms involve approximating and discretising windows over a series to form words, then forming a count of words over a given dictionary. Classifiers are constructed on the resulting histograms of word counts. A 2017 evaluation of a range of time series classifiers found the bag of symbolic-fourier approximation symbols (BOSS) ensemble the best of the dictionary based classifiers. It forms one of the components of hierarchical vote collective of transformation-based ensembles (HIVE-COTE), which represents the current state of the art. Since then, several new dictionary based algorithms have been proposed that are more accurate or more scalable (or both) than BOSS. We propose a further extension of these dictionary based classifiers that combines the best elements of the others combined with a novel approach to constructing ensemble members based on an adaptive Gaussian process model of the parameter space. We demonstrate that the temporal dictionary ensemble (TDE) is more accurate than other dictionary based approaches. Furthermore, unlike the other classifiers, if we replace BOSS in HIVE-COTE with TDE, HIVE-COTE is significantly more accurate. We also show this new version of HIVE-COTE is significantly more accurate than the current best deep learning approach, a recently proposed hybrid tree ensemble and a recently introduced competitive classifier making use of highly randomised convolutional kernels. This advance represents a new state of the art for time series classification., arXiv admin note: text overlap with arXiv:1911.12008
Published: 2021

223. Detection of COVID-19 Using Textual Clinical Data: A Machine Learning Approach

Author: Manish Mahajan, Virendra Kumar Shrivastava, Reenu Batra, and Amit Kumar Goel
Subjects: Feature engineering, Computer science, business.industry, Machine learning, computer.software_genre, Ensemble learning, Field (computer science), Projection (relational algebra), Naive Bayes classifier, Bag-of-words model, Artificial intelligence, tf–idf, F1 score, business, computer
Abstract: Rapid innovation in technology results in fast growth in every field. The field may be a non-medical field, medical field, or any other field of life. In the current century, human society witnessed at least five pandemics. Apart from these five, COVID-19 has also declared a pandemic by the World Health Organization (W.H.O) that has brought a worldwide threat to humankind. The Corona Virus COVID-19 pandemic was originated from Wuhan city of China. Due to the deadly and unpredictable behavior of this COVID-19 virus, until date 2, 00, 26,209 positive cases are reported and the death toll has reached up to 7, 34,025 in 213 countries. To drive in this back-breaking situation, artificial intelligence (AI) is playing an extremely important role. As in the diagnosis of many diseases various AI tools can be used. As of the situation, there is a need for well strong that will help in the detection of the Corona Virus. To make a strong fight against COVID-19, AI can be used in different areas, namely, (a) alarming/alerts (b) tracing and forecasting (c) data console/dashboard (d) diagnosis and prognosis (e) projection and foresight. Machine learning which mainly a sub-part of artificial intelligence (AI) is can also be helpful against the fight with COVID-19. Some machine learning algorithms can be used in the classification of clinical disease reports. Machine learning algorithms may be either classical algorithms or ensemble algorithms. A classification of textual clinical reports can be done thereafter machine learning itself can be applied in the form of feature engineering. Feature engineering can be applied to upgrade the execution of ML algorithms. Term frequency/inverse document frequency (TF/IDF), Bag of Words (BOW) and Report length (RL) are three different mechanisms by which feature engineering can be applied to textual clinical reports. In the case of accuracy measurement, Naive Bayes gains high accuracy than other ML algorithms. A study for comparison of accuracy measurement can also be done for classical and ensemble machine learning algorithms. Besides accuracy, a comparison on the precision, recall, and F1 score can also be done of various machine learning algorithms.
Published: 2021

224. Labeling News Article’s Subject Using Uncertainty Based Active Learning

Author: Meet Parekh and Yash Patel
Subjects: Text corpus, Active learning (machine learning), business.industry, Computer science, computer.software_genre, Task (project management), Support vector machine, Naive Bayes classifier, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Classifier (linguistics), Artificial intelligence, business, computer, Natural language processing, Test data
Abstract: In Natural Language Processing, labeling a text corpus is often an expensive task that requires a lot of human efforts and cost. Whereas unlabeled text corpora in varying domains are readily available. For a couple of decades, research efforts have concentrated on algorithms that can be used for labeling the corpus, thus minimizing the number of articles required to be labeled manually. Semi-Supervised Learning and Active Learning have been a great promise for labeling the articles using a trained model. Also, Semi-Supervised learning algorithms and Active learning algorithms have strong theoretical guarantees. This study aims to tag 1183 articles from The New York Times and The Wall Street Journal with the subject (i.e. primary organization related to news articles) employing Active Learning algorithm. We used Active Learning algorithm which uses Random Sampling along with Uncertainty Based Querying. This Active Learning approach is used to train Naive Bayes classifier using Bag of Words features. This classifier is used to tag 1183 articles of which only 167 required manual review, thus achieving reduction of 85.89% with 78.18% accuracy. Also, for verifying quality of labeled corpus, SVM classifier using same features was trained on labeled corpus giving accuracy of 74.45% on test data.
Published: 2021

225. Sentiment Analysis of Bangla Language Using Deep Learning Approaches

Author: Promila Haque, Mohammed Nazim Uddin, and Muntasir Hoq
Subjects: Computer science, business.industry, Deep learning, Sentiment analysis, computer.software_genre, Convolutional neural network, Support vector machine, Naive Bayes classifier, Bag-of-words model, Word2vec, Artificial intelligence, F1 score, business, computer, Natural language processing
Abstract: Emotion is the most important gear for human textual communication with each other via social media. Nowadays, people use text for reviewing or recommending things, sharing opinions, rating their choices or unlikeness, providing feedback for different services, and so on. Bangladeshi people use Bangla to express their emotions. Current research based on sentiment analysis has got low-performance output by using several approaches on detecting sentiment polarity and emotion from Bangla texts. In this study, we have developed four models with the hybrid of Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) with various Word Embeddings including Embedding Layer, Word2Vec, Global Vectors (Glove), and Continuous Bag of Words (CBOW) to detect emotion from Bangla texts (words, sentences). Our models can define the basic three emotions; happiness, anger, and sadness. It will make interaction lively and interesting. Our comparisons are bestowed against CNN, LSTM with different Word Embeddings, and also against some previous researches with the same dataset based on classical Machine Learning techniques such as Support Vector Machine (SVM), Naive Bayes, and K-Nearest Neighbors (K-NN). In our proposed study, we have used Facebook Bangla comments for a suitable dataset. In our study, we have tried to detect the exact emotion from the text. And in result, the best model integrating Word2Vec embedding layer with a hybrid of CNN-LSTM detected emotions from raw textual data with an accuracy of 90.49% and F1 score of 92.83%.
Published: 2021

226. Imprecision of Semantic Meaning in a Natural Language

Author: Farida Huseynova
Subjects: Scheme (programming language), 0209 industrial biotechnology, business.industry, Computer science, 02 engineering and technology, Semantics, computer.software_genre, Fuzzy logic, Set (abstract data type), 020901 industrial engineering & automation, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Cluster analysis, computer, Natural language processing, Natural language, Meaning (linguistics), computer.programming_language
Abstract: Most of linguistic computing models cause loss of information due to the approximation processes and imprecision in the results. The BoW scheme is a simple and popular scheme, but it suffers from numerous drawbacks. Traditionally, the Bag of Words (BoW) representation is used to model the documents in a vector space. It also fails to capture the semantics contained in the documents properly as automatic classification and clustering are carried out in the most common operations. We tried to use Fuzzy logic and its extension computing with words to analyze semantical meanings of a set of terms given in a natural language.
Published: 2021

227. Video Categorization Based on Sentiment Analysis of YouTube Comments

Author: Monika Verma, Shraddha Mantri, Debabrata Swain, Anirudha Kulkarni, and Sayali Phadke
Subjects: Hazard (logic), Scheme (programming language), Computer science, business.industry, Lemmatisation, Sentiment analysis, computer.software_genre, Constructive, Categorization, Bag-of-words model, Artificial intelligence, Data pre-processing, business, computer, Natural language processing, computer.programming_language
Abstract: With recent development in digital technologies, the amount of multimedia statistics is increasing everyday. Abusive video constitutes a hazard to public safety and thus constructive detection algorithms are in urgent need. In order to improve the detection accuracy here, Sentiment analysis-based video classification is proposed. Sentiment analysis-based video classification system is used to classify video content into two different categories, i.e., Abusive videos, nonabusive videos. We are using YouTube comments of a video as source of input, which is analyzed by our sentiment analysis model and the model determines the category to which that particular video belongs. Many techniques such as Bag of Words, Lemmatization, logistic regression and NLP are used. The proposed scheme obtains competitive results on abusive content detection. The empirical outcome shows that our method is elementary and productive.
Published: 2021

228. Identifying bot activity in GitHub pull request and issue comments

Author: Eleni Constantinou, Mehdi Golzadeh, Tom Mens, and Alexandre Decan
Subjects: FOS: Computer and information sciences, Code review, Information retrieval, business.industry, Computer science, Software development, Commit, ENCODE, computer.software_genre, Software Engineering (cs.SE), Identification (information), Computer Science - Software Engineering, Bag-of-words model, Test set, Classifier (linguistics), business, computer
Abstract: Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows preventing bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result on the pull request and issue comments, further work is required to generalize the model on other types of activities, like commit messages and code reviews., Comment: 4 pages, 1 page of reference, 1 figure, 3 tabels
Published: 2021
Full Text: View/download PDF

229. Minimalist Fitted Bayesian Classifier-Based on Likelihood Estimations and Bag-of-Words

Author: Elias de Oliveira, Jean-Rémi Bourguet, and Wesley Silva
Subjects: Feature engineering, Class (computer programming), Computer science, business.industry, Autonomous agent, Big data, Machine learning, computer.software_genre, Metadata, Naive Bayes classifier, Bag-of-words model, Statistical inference, Artificial intelligence, business, computer
Abstract: The expansion of institutional repositories involves new challenges for autonomous agents that control the quality of semantic annotations in large amounts of scholarly knowledge. While evaluating metadata integrity in documents was already widely tackled in the literature, a majority of the frameworks are intractable when confronted with a big data environment. In this paper, we propose an optimal strategy based on feature engineering to identify spurious objects in large academic repositories. Through an application case dealing with a Brazilian institutional repository containing objects like PhD theses and MSc dissertations, we use maximum likelihood estimations and bag-of-words techniques to fit a minimalist Bayesian classifier that can quickly detect inconsistencies in class assertions guaranteeing approximately 94% of accuracy.
Published: 2021

230. SVMBPI: Support Vector Machine-Based Propaganda Identification

Author: Qamar Rayees Khan, Syed Tanzeel Rabani, and Akib Mohi Ud Din Khanday
Subjects: Feature engineering, World Wide Web, Support vector machine, Identification (information), Bag-of-words model, business.industry, Computer science, Deep learning, Disinformation, Card stacking, Artificial intelligence, tf–idf, business
Abstract: Online social networks are being used to express and freely communicate the information. Some of the popular social networking sites (SNS) used for this purpose are Facebook, Twitter, Instagram, etc. Most of the people/bots use these SNS for spreading hoaxes, misinformation, disinformation and propaganda. Propaganda is the latest trend that is used mainly to gain religious and political influence by the help of various techniques like bandwagon, card stacking and glittering. In this research paper, efforts were made to differentiate propagandist text from non-propagandist text using supervised machine learning algorithm. Data was collected from the news sources from July 2018–August 2018. After annotating the text, feature engineering was done using various techniques like term frequency/inverse document frequency (TF/IDF) and bag of words (BOW). These features were supplied to support vector machine classifier (SVM) which showed a good accuracy having an F1-score of 0.81 for non-propagandist text and 0.58 for propagandist text. This paper will act as a base for researchers to use various other machine and deep learning techniques in differentiating the propagandist text from non-propagandist text.
Published: 2021

231. Improved Retrieval of Programming Solutions With Code Examples Using a Multi-featured Score

Author: Chanchal K. Roy, Mohammad Masudur Rahman, Rodrigo Fernandes Gomes da Silva, Marcelo de Almeida Maia, Foutse Khomh, and Carlos Eduardo de Carvalho Dantas
Subjects: FOS: Computer and information sciences, Information retrieval, Computer science, Ranking (information retrieval), Set (abstract data type), Software Engineering (cs.SE), Computer Science - Software Engineering, Task (computing), Hardware and Architecture, Bag-of-words model, Code (cryptography), Vocabulary mismatch, Mean reciprocal rank, Software, Information Systems, Semantic gap
Abstract: Developers often depend on code search engines to obtain solutions for their programming tasks. However, finding an expected solution containing code examples along with their explanations is challenging due to several issues. There is a vocabulary mismatch between the search keywords (the query) and the appropriate solutions. Semantic gap may increase for similar bag of words due to antonyms and negation. Moreover, documents retrieved by search engines might not contain solutions containing both code examples and their explanations. So, we propose CRAR (Crowd Answer Recommender) to circumvent those issues aiming at improving retrieval of relevant answers from Stack Overflow containing not only the expected code examples for the given task but also their explanations. Given a programming task, we investigate the effectiveness of combining information retrieval techniques along with a set of features to enhance the ranking of important threads (i.e., the units containing questions along with their answers) for the given task and then selects relevant answers contained in those threads, including semantic features, like word embeddings and sentence embeddings, for instance, a Convolutional Neural Network (CNN). CRAR also leverages social aspects of Stack Overflow discussions like popularity to select relevant answers for the tasks. Our experimental evaluation shows that the combination of the different features performs better than each one individually. We also compare the retrieval performance with the state-of-art CROKAGE (Crowd Knowledge Answer Generator), which is also a system aimed at retrieving relevant answers from Stack Overflow. We show that CRAR outperforms CROKAGE in Mean Reciprocal Rank and Mean Recall with small and medium effect sizes, respectively., Comment: 31 pages, 5 figures, 9 tables
Published: 2021
Full Text: View/download PDF

232. VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words

Author: Tiancheng Zhao, Xiaopeng Lu, and Kyusong Lee
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Speedup, Matching (graph theory), business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Inverted index, Image (mathematics), Machine Learning (cs.LG), Acceleration, Bag-of-words model, Scalability, Artificial intelligence, business, Computation and Language (cs.CL), Transformer (machine learning model)
Abstract: Text-to-image retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant images from a large and unlabelled dataset given textual queries. In this paper, we propose VisualSparta, a novel (Visual-text Sparse Transformer Matching) model that shows significant improvement in terms of both accuracy and efficiency. VisualSparta is capable of outperforming previous state-of-the-art scalable methods in MSCOCO and Flickr30K. We also show that it achieves substantial retrieving speed advantages, i.e., for a 1 million image index, VisualSparta using CPU gets ~391X speedup compared to CPU vector search and ~5.4X speedup compared to vector search with GPU acceleration. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods., Comment: Accepted to ACL2021 (10 pages)
Published: 2021
Full Text: View/download PDF

233. Conference Paper Acceptance Prediction: Using Machine Learning

Author: Ajinkya Kulkarni, Siddharth Patil, Ishwari Kulkarni, Deepali Joshi, Riya Pande, and Nikhil Saini
Subjects: Computer science, business.industry, Decision tree, Logistic regression, Machine learning, computer.software_genre, Domain (software engineering), Random forest, Support vector machine, Bag-of-words model, Artificial intelligence, Representation (mathematics), business, computer
Abstract: The paper presents a model that will predict the acceptance of the paper for a particular conference. The model is designed for the conferences, which accept researches done researches in the Machine Learning domain. The dataset that is used to develop this model is the ICLR 2017 (International Conference Of Learning Representation). The model gives its prediction based on extracted features. The features that most of the conferences consider are Number of References, Number of Figures, Number of Tables, Bag of words for ML related terms, etc. Some more features are taken into consideration to give better results such as Length of Title, Frequency of ML related words, Number of ML algorithms, Average length of sentences. The model is trained on the above-mentioned dataset, which contains 70 accepted and 100 rejected papers. For the prediction, different Machine Learning algorithms are used. The model is trained by applying algorithms such as Logistic Regression, Decision Tree, Random Forest, KNN, and SVM. The comparative study of different algorithms on the dataset gives the result that Decision Tress works effectively by providing 85% accuracy.
Published: 2021

234. Text Visualization Using t-Distributed Stochastic Neighborhood Embedding (t-SNE)

Author: Kamal Kumar and Chelimilla Natraj Naveen
Subjects: Perplexity, business.industry, Computer science, computer.software_genre, Visualization, Data visualization, Bag-of-words model, Embedding, Word2vec, Image tracing, Data mining, business, tf–idf, computer
Abstract: Data visualization is most important task to be done before classification and building a model. By data visualization we can easily know whether problem is directly classifiable or not. Below paper contains presentation of t-SNE visualization for donors choose data vectorized by different techniques like Term frequency-inverse document frequency, Bag of words, Tf-Idf weighted Word2Vec. Data visualization here means indirectly reducing dimensions of our data. Different data visualizations in below paper are done using different perplexity values of t-SNE. Pre-processing and vectorization are done accordingly for different features and numerical features are standardized before giving it to t-SNE.
Published: 2021

235. A Comprehensive Survey of Sentiment Analysis: Word Embeddings Approach, Research Challenges and Opportunities

Author: Shameer Bashir and Arvind Selwal
Subjects: Word embedding, business.industry, Computer science, Sentiment analysis, Overfitting, computer.software_genre, Bag-of-words model, Word2vec, Artificial intelligence, tf–idf, business, computer, Natural language processing, Sentence, Word (computer architecture)
Abstract: In this paper, we present a review of sentiment analysis along with the concept of word embeddings, natural language processing and crucial aspects that are essential for model for sentiment analysis. First, we discuss the basic steps for collecting the corpus of sentences from different sentiments or opinions such as movie reviews, Twitter data, and etc., Usually, the corpus of data is optimized with regard to the only the desirable or subjective part of the sentence and it is retained whereas the rest is discarded. The resultant dataset is split into train and test segments in a balanced ratio so that the problem of overfitting of classifiers is overcome. We also discuss various techniques such as Bag-Of-Words, TF/IDF, Word Embedding like word2vec, BERT (Latest one), etc. to convert the entire dataset into machine-readable form i.e., numerical. These vectors are fed as input to the machine learning or deep learning classifiers to predict the polarity of the subjective sentences. In our study we explain the techniques from very basic bag-of-words to the latest word embeddings BERT. In the last, we identified few research issues that are open to research community in this active field of sentiment analysis. One of the major challenges is to design a domain independent model for sentiment analysis using word embeddings. Further, an additional issue is related to use word embedding for translation in the local languages such as Dogri, Kashmiri, and many more.
Published: 2021

236. Characterising Players of a Cube Puzzle Game with a Two-level Bag of Words

Author: Angeles López, Pablo Sanahuja, V. Javier Traver, José Ribelles, and Xavier Anadón
Subjects: Theoretical computer science, Computer science, media_common.quotation_subject, Space (commercial competition), videogame, performance prediction, player characterisation, bag of words, machine learning, Bag-of-words model, Histogram, Perception, Feature (machine learning), Embedding, Cube, Cluster analysis, media_common, clustering
Abstract: Ponencia presentada en UMAP '21: Adjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, Utrecht (Netherlands), June 21 - 25, 2021 This work explores an unsupervised approach for modelling players of a 2D cube puzzle game with the ultimate goal of customising the game for particular players based solely on their interaction data. To that end, user interactions when solving puzzles are coded as images. Then, a feature embedding is learned for each puzzle with a convolutional network trained to regress the players’ comple tion effort in terms of time and number of clicks. Next, the known bag-of-words technique is used at two levels. First, sets of puzzles are represented using the puzzle feature embeddings as the input space. Second, the resulting first-level histograms are used as input space for characterising players. As a result, new players can be characterised in terms of the resulting second-level histograms. Preliminary results indicate that the approach is effective for char acterising players in terms of performance. It is also tentatively observed that other personal perceptions and preferences, beyond performance, are somehow implicitly captured from behavioural data.
Published: 2021

237. MARE: Self-supervised multi-attention REsu-net for semantic segmentation in remote sensing

Author: Nikos Komodakis, Simone Scardapane, and Valerio Marsocci
Subjects: Self supervised learning, Artificial neural network, Computer science, Science, deep learning, Land cover, Net (mathematics), vaihingen dataset, semantic segmentation, Task (project management), linear attention, self-supervised learning, remote sensing, Bag-of-words model, Remote sensing (archaeology), General Earth and Planetary Sciences, Segmentation, Remote sensing
Abstract: Scene understanding of satellite and aerial images is a pivotal task in various remote sensing (RS) practices, such as land cover and urban development monitoring. In recent years, neural networks have become a de-facto standard in many of these applications. However, semantic segmentation still remains a challenging task. With respect to other computer vision (CV) areas, in RS large labeled datasets are not very often available, due to their large cost and to the required manpower. On the other hand, self-supervised learning (SSL) is earning more and more interest in CV, reaching state-of-the-art in several tasks. In spite of this, most SSL models, pretrained on huge datasets like ImageNet, do not perform particularly well on RS data. For this reason, we propose a combination of a SSL algorithm (particularly, Online Bag of Words) and a semantic segmentation algorithm, shaped for aerial images (namely, Multistage Attention ResU-Net), to show new encouraging results (i.e., 81.76% mIoU with ResNet-18 backbone) on the ISPRS Vaihingen dataset.
Published: 2021

238. Interpretation of Learning-Based Automatic Source Code Vulnerability Detection Model Using LIME

Author: Huiqiang Wang, Feng Yang, Meikang Qiu, Lianxiao Meng, Shuangyin Ren, Weipeng Cao, Gaigai Tang, Long Zhang, and Lin Yang
Subjects: Source code, business.industry, Computer science, media_common.quotation_subject, Deep learning, Machine learning, computer.software_genre, Field (computer science), Bag-of-words model, Credibility, Preprocessor, Artificial intelligence, Representation (mathematics), business, computer, Vulnerability (computing), media_common
Abstract: The existing advanced automatic vulnerability detection methods based on source code are mainly learning-based, such as machine learning and deep learning. These models can capture the vulnerability pattern through learning, which is more automatic and intelligent. However, the outputs of many learning-based vulnerability detection models are unexplainable, even though they usually show high accuracy. It’s meaningful to verify the credibility of the models so that we can better understand and use them in practice. To alleviate the above issue, we use an interpretation method called LIME to explain the learning-based automatic vulnerability detection model. For one thing, the preprocessing methods are all interpretable, including symbolization and vector representation, where the Bag of words model is chosen for source code vector representation. For another, the vulnerability detection models we select are based on Logistic Regression and Bi-LSTM. The former is interpretable, which is used to verify the effectiveness of LIME in the field of source code vulnerability detection. The latter is unexplained that is interpreted by LIME to its credibility on source code vulnerability detection. The experimental results show that LIME can effectively explain the learning-based automatic vulnerability detection model. Moreover, we find that under the condition of local interpretation, the predictions of the model based on Bi-LSTM are credible.
Published: 2021

239. Automatic Classification of Zingiberales from RGB Images

Author: Manuel G. Forero, Christian González-Santos, and Carlos E. Beltrán
Subjects: Support vector machine, GrabCut, Contextual image classification, Bag-of-words model, Computer science, business.industry, Color balance, RGB color model, Segmentation, Pattern recognition, Visual Word, Artificial intelligence, business
Abstract: Colombia is the country with the largest number of plant species in the world. Within it, Zingiberaceae plays an important ecological role within ecosystems, acting as pioneers in the process of natural regeneration of vegetation and restoration of degraded soils. In addition, they maintain important coevolutionary relationships with other animal and plant species, becoming an important element within the complex web of life in the tropics. Manual classification is time consuming, expensive and requires experts who often have limited availability. To address these problems, three image classification methods SVM, KNN with Euclidean and intersection distances were used in this work. The database used for training, testing and validation of the methods comprises RGB images taken in the natural habitat of the Zingiberales, from their germination to their optimal cutting time. The images were pre-processed, making an adjustment of white balance, contrast and color temperature. To separate the Zingiberales from the background, a graphical segmentation technique using GrabCut was used. The descriptors were obtained using the technique known as BoW, finding that the number of visual words most suitable for classification was between 20 and 40. It was found that a better classification result was obtained by separating the flowers of a species into two subclasses, due to their different coloration. The best results were obtained with the KNN method, using the three closest neighbors, obtaining an accuracy of 97%.
Published: 2021

240. The Relevance of Preprocessing in Text Classification

Author: Manisha Mali and Mohammad Atique
Subjects: Naive Bayes classifier, ComputingMethodologies_PATTERNRECOGNITION, Stochastic gradient descent, Knowledge extraction, Computer science, Bag-of-words model, Classifier (linguistics), Preprocessor, Unstructured data, Relevance (information retrieval), Data mining, computer.software_genre, computer
Abstract: The tremendous growth of information emerging from diverse sources has resulted in digital information overwork. The vast information availability has initiated the need for automatic text classification for managing, organizing massive data, and knowledge discovery. Most of these data is in an unstructured form. The main hindrance in achieving good classification accuracy is the natural form of data, i.e., unstructured form. Such issues can be handled by applying preprocessing steps so that unstructured data can be transformed into structured data. In this paper, we have applied different preprocessing steps like bag of words, stemming, and lemmatization on raw data to study the impact of preprocessing in the text classification process. We have verified this by providing preprocessed data to various classifiers like multinominal Naive Bayes, logistic regression, stochastic gradient descent classifier, and k-nearest neighbors and found improvement in classification accuracy results. We have carried out experimentation on the 20-newsgroup dataset.
Published: 2021

241. Sentiment analysis and classification of Indian farmers’ protest using twitter data

Author: Ram Krishn Mishra, Yogesh K. Dwivedi, Ashwin Sanjay Neogi, and Kirti Anilkumar Garg
Subjects: Government, Microblogging, Bag-of-words, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Sentiment analysis, TF-IDF, Decision tree, Advertising, Information technology, T58.5-58.64, Farmers’ protest, Random forest, Naive Bayes classifier, Bag-of-words model, Machine learning, Social media, Sociology
Abstract: Protests are an integral part of democracy and an important source for citizens to convey their demands and/or dissatisfaction to the government. As citizens become more aware of their rights, there has been an increasing number of protests all over the world for various reasons. With the advancement of technology, there has also been an exponential rise in the use of social media to exchange information and ideas. In this research, we gathered data from the microblogging website Twitter concerning farmers’ protest to understand the sentiments that the public shared on an international level. We used models to categorize and analyze the sentiments based on a collection of around 20,000 tweets on the protest. We conducted our analysis using Bag of Words and TF-IDF and discovered that Bag of Words performed better than TF-IDF. In addition, we also used Naive Bayes, Decision Trees, Random Forests, and Support Vector Machines and also discovered that Random Forest had the highest classification accuracy.
Published: 2021

242. Learning to Organize a Bag of Words into Sentences with Neural Networks: An Empirical Study

Author: Chongyang Tao, Dongyan Zhao, Shen Gao, Juntao Li, Rui Yan, and Yansong Feng
Subjects: Sequence, Artificial neural network, Computer science, business.industry, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Convolutional neural network, Recurrent neural network, Empirical research, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Natural language, Sentence, 0105 earth and related environmental sciences
Abstract: Sequential information, a.k.a., orders, is assumed to be essential for processing a sequence with recurrent neural network or convolutional neural network based encoders. However, is it possible to encode natural languages without orders? Given a bag of words from a disordered sentence, humans may still be able to understand what those words mean by reordering or reconstructing them. Inspired by such an intuition, in this paper, we perform a study to investigate how “order” information takes effects in natural language learning. By running comprehensive comparisons, we quantitatively compare the ability of several representative neural models to organize sentences from a bag of words under three typical scenarios, and summarize some empirical findings and challenges, which can shed light on future research on this line of work.
Published: 2021

243. A Deep Learning Model Based on Neural Bag-of-Words Attention for Sentiment Analysis

Author: Jing Liao and Zhixiang Yi
Subjects: business.industry, Bag-of-words model, Computer science, Deep learning, Sentiment analysis, Benchmark (computing), Artificial intelligence, business, Machine learning, computer.software_genre, computer, Field (computer science)
Abstract: In the field of Natural Language Processing, sentiment analysis is one of core research directions. The hot issue of sentiment analysis is how to avoid the shortcoming of using fixed vector to calculate attention distribution. In this paper, we proposed a novel sentiment analysis model based on neural bag-of-words attention, which utilizes Bidirectional Long Short-Term Memory (BiLSTM) to capture the deep semantic features of text, and fusion these features by attention distribution based on neural bag-of-words. The experimental results show that the proposed method has improved 2.53%–6.46% accuracy compared with the benchmark.
Published: 2021

244. UniParma at SemEval-2021 Task 5: Toxic Spans Detection Using CharacterBERT and Bag-of-Words Model

Author: Akbar Karimi, Andrea Prati, and Leonardo Rossi
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Character (computing), business.industry, Computer science, Context (language use), computer.software_genre, Spelling, SemEval, Bag-of-words model, Code (cryptography), Language model, Artificial intelligence, business, computer, Computation and Language (cs.CL), Word (computer architecture), Natural language processing
Abstract: With the ever-increasing availability of digital information, toxic content is also on the rise. Therefore, the detection of this type of language is of paramount importance. We tackle this problem utilizing a combination of a state-of-the-art pre-trained language model (CharacterBERT) and a traditional bag-of-words technique. Since the content is full of toxic words that have not been written according to their dictionary spelling, attendance to individual characters is crucial. Therefore, we use CharacterBERT to extract features based on the word characters. It consists of a CharacterCNN module that learns character embeddings from the context. These are, then, fed into the well-known BERT architecture. The bag-of-words method, on the other hand, further improves upon that by making sure that some frequently used toxic words get labeled accordingly. With a ∼4 percent difference from the first team, our system ranked 36 th in the competition. The code is available for further research and reproduction of the results.
Published: 2021
Full Text: View/download PDF

245. An ensemble predictive analytics of COVID-19 infodemic tweets using bag of words

Author: Taiwo Olaleye, Adebayo Abayomi-Alli, A.K. Adesemowo, and Oluwasefunmi 'Tale Arogundade
Subjects: Information retrieval, Computer science, Bag-of-words model, Classification rule, Social media, Decision stump, Predictive analytics, Perceptron, Classifier (UML), Cross-validation
Abstract: Fake COVID-19 tweets appear as legitimate and appealing to unsuspecting internet users because of lack of prior knowledge of the novel pandemic. Such news could be misleading, counterproductive, unethical, unprofessional, and sometimes, constitute a log in the wheel of global efforts toward flattening the virus spread curve. Therefore, aside the COVID-19 pandemic, dealing with fake news and myths about the virus constitute an infodemic issue which must be tackled to ensure that only valid information is consumed by the public. Following the research approach, this chapter aims at a predictive analytics of COVID-19 infodemic tweets that generates a classification rule and validates genuine information from verified accredited health institutions/sources. On deployment of classifier Vote ensembles formed by base classifiers SMO, Voted Perceptron, Liblinear, Reptree, and Decision Stump on dataset of tokenized 81,456 Bag of Words which encapsulate 2964 COVID-19 tweet instances and 3169 extracted numeric vector attributes, experimental result shows a novel 99.93% prediction accuracy on 10-fold cross validation while the information gain of each 3169 extracted attributes is ranked to ascertain the most significant COVID-19 tweet-words for the detection system. Other performance metrics including ROC area and Relief-F validates the reliability of the model and returns SMO as the most efficient base classifier. The thrust of the model centered more on the trustworthiness of COVID-19 tweet source than the truthfulness of the tweet which underscores the prominence of verified health institutions as well as it contributes to discourse on inhibition and impact of fake news especially on societal pandemics. The COVID-19 infodemic detection algorithm provides insight into new spin on fake news in the age of social media and era of pandemics.
Published: 2021

246. Deep Learning and Natural Language Processing for fake news detection: A Survey

Author: Aditya Chokshi and Rejo Mathew
Subjects: Fallacy, Recurrent neural network, business.industry, Computer science, Bag-of-words model, Deep learning, Internet privacy, The Internet, Artificial intelligence, Broadcasting, business, Construct (philosophy), Convolutional neural network
Abstract: Fake news is a term which deals with fallacy in information, content or some sort of statistics or facts revealed to public for some sort of attention, to abuse someone as a means for acquiring some benefits harming the other entity or to construct a territory of bloodshed among mankind. A survey found that 86% of users have been tricked upon by fake news out of which supreme disseminator is Facebook. Latest victim to this field was during the Citizen Amendment act (CAA) in 2020 where a survey by a team of news reporters showed that almost 95% of the protestors didn’t knew about that act and were indulged to think that their citizenships would be snatched and were a victim to a fake news. Various deep learning methods have been mentioned in this paper which focusses on breaking off the broadcasting of the vague news over the internet. Deep learning architectures have been touched upon as fake news detection accords with colossal amount of data. Several architectures like the artificial neural network which concentrates on classifying the text based news, convolutional neural network which deals with the text or image grouping of the updates the people receive online. They can be also used to verify the information based on the title or the source of data, other architectures like recurrent neural network can be looked upon to sight some unrecognizable patterns in the content with the aid of its segment long short term memory (LSTM).
Published: 2021

247. Cricket Stroke Recognition Using Hard and Soft Assignment Based Bag of Visual Words

Author: Ashish Karel, Arpan Gupta, and Sakthi Balan Muthiah
Subjects: Computer science, business.industry, Orientation (computer vision), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Optical flow, Pattern recognition, Activity recognition, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Bag-of-words model in computer vision, Histogram, Feature (machine learning), Noise (video), Artificial intelligence, business
Abstract: In this work, we deal with the problem of activity recognition in Cricket telecast videos. We present a supervised approach for recognizing Cricket stroke categories using two variants of Bag of Visual words (BoV) model applied on dense optical flow based motion feature i.e., grid-based flattened vectors and orientation histograms, 3D ResNet extracted features and 2D ResNet extracted spatial features. These globally extracted features, in spite of the noise due to camera motion, give good results on the Cricket strokes dataset having 562 trimmed stroke videos. We independently labeled the strokes based on the direction of stroke play and the direction of camera motion, into five and three categories. respectively, and provide experimental analysis on Hard Assignment (HA) and Soft Assignment (SA) based BoV methods.
Published: 2021

248. A Novel Image Classification Algorithm

Author: Shu-jian Shi
Subjects: Contextual image classification, Computer science, business.industry, Feature vector, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Corner detection, Visual dictionary, Scale-invariant feature transform, Pattern recognition, Fuzzy logic, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Histogram, Artificial intelligence, business
Abstract: In this paper the bag-of-words model is applied to image classification and improves the existing problems of the traditional bag-of-words method. We propose a method of combination of corner detection and graph theory for ROI region extraction and fuzzy membership degree. First using corner detection for images, then the ROI region is defined by the method of graph theory. Then the SIFT features of the ROI region are extracted and the visual dictionary is generated. The visual dictionary can be more accurate to describe the image features, which can reduce the influence of background information and other interference information. Secondly, the concept of fuzzy membership function and information of feature space is introduced to improve the image of the visual histogram. Finally, support vector machine classifier is used to classify. Through the experiment of the Caltech 100 database, the result shows that the method improves the accuracy of classification compared with the traditional method.
Published: 2021

249. Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing

Author: Henning Müller, Manfredo Atzori, Anjani Dhrangadhariya, and Sebastian Otálora
Subjects: Pathology, medicine.medical_specialty, Computer science, Paragraph embeddings, computer.software_genre, ENCODE, Logistic regression, 01 natural sciences, 010309 optics, 03 medical and health sciences, 0302 clinical medicine, Natural language processing, Pathology reports, 0103 physical sciences, Classifier (linguistics), medicine, Interpretability, Structure (mathematical logic), Clinical pathology, business.industry, Bag-of-words model, 030220 oncology & carcinogenesis, Artificial intelligence, Paragraph, business, computer
Abstract: Free-text reporting has been the main approach in clinical pathology practice for decades. Pathology reports are an essential information source to guide the treatment of cancer patients and for cancer registries, which process high volumes of free-text reports annually. Information coding and extraction are usually performed manually and it is an expensive and time-consuming process, since reports vary widely between institutions, usually contain noise and do not have a standard structure. This paper presents strategies based on natural language processing (NLP) models to classify noisy free-text pathology reports of high and low-grade prostate cancer from the open-source repository TCGA (The Cancer Genome Atlas). We used paragraph vectors to encode the reports and compared them with n-grams and TF-IDF representations. The best representation based on distributed bag of words of paragraph vectors obtained an \(f_{1}\)-score of 0.858 and an AUC of 0.854 using a logistic regression classifier. We investigate the classifier’s more relevant words in each case using the LIME interpretability tool, confirming the classifiers’ usefulness to select relevant diagnostic words. Our results show the feasibility of using paragraph embeddings to represent and classify pathology reports.
Published: 2021

250. A Review on Word Embedding Techniques for Text Classification

Author: R. Kanniga Devi and S. Selva Birunda
Subjects: Word embedding, Artificial neural network, business.industry, Computer science, Space (commercial competition), computer.software_genre, Bag-of-words model, Artificial intelligence, business, computer, Scope (computer science), Natural language processing, Word (computer architecture), Sentence, Transformer (machine learning model)
Abstract: Word embeddings are fundamentally a form of word representation that links the human understanding of knowledge meaningfully to the understanding of a machine. The representations can be a set of real numbers (a vector). Word embeddings are scattered depiction of a text in an n-dimensional space, which tries to capture the word meanings. This paper aims to provide an overview of the different types of word embedding techniques. It is found from the review that there exist three dominant word embeddings namely, Traditional word embedding, Static word embedding, and Contextualized word embedding. BERT is a bidirectional transformer-based Contextualized word embedding which is more efficient as it can be pre-trained and fine-tuned. As a future scope, this word embedding along with the neural network models can be used to increase the model accuracy and it excels in sentiment classification, text classification, next sentence prediction, and other Natural Language Processing tasks. Some of the open issues are also discussed and future research scope for the improvement of word representation.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

2,466 results on '"Bag-of-words model"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources