Descriptor: "Bag-of-words model" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bag-of-words model"' showing total 2,466 results

Start Over Descriptor "Bag-of-words model"

2,466 results on '"Bag-of-words model"'

351. LDA filter: A Latent Dirichlet Allocation preprocess method for Weka

Author: A. Seara Vieira, Lourdes Borrajo, P. Celard, and Eva Lorenzo Iglesias
Subjects: Vocabulary, Support Vector Machine, Muscle Physiology, Muscle Functions, Medical Journals, Computer science, Physiology, Information Storage and Retrieval, Social Sciences, 02 engineering and technology, Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Medicine and Health Sciences, Preprocessor, Cluster Analysis, media_common, Multidisciplinary, Applied Mathematics, Simulation and Modeling, 05 social sciences, Software Engineering, Semantics, 1203.04 Inteligencia Artificial, Physical Sciences, symbols, Medicine, Engineering and Technology, 050904 information & library sciences, Algorithms, Research Article, Computer and Information Sciences, Science, media_common.quotation_subject, Research and Analysis Methods, Latent Dirichlet allocation, Computer Software, symbols.namesake, Naive Bayes classifier, Machine Learning Algorithms, Text mining, Artificial Intelligence, 020204 information systems, Humans, Preprocessing, business.industry, Biology and Life Sciences, Pattern recognition, Bayes Theorem, Linguistics, Support vector machine, Statistical classification, ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, Filter (video), Artificial intelligence, 0509 other social sciences, business, Medical Humanities, Mathematics
Abstract: This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times. Xunta de Galicia | Ref. ED431C2018/55
Published: 2020

352. Categorization of Dissertation using Machine Learning Techniques

Author: Manish Jain and Loveleen Kumar
Subjects: DBSCAN, Computer science, business.industry, Search engine indexing, k-means clustering, 020207 software engineering, 02 engineering and technology, Machine learning, computer.software_genre, Domain (software engineering), Index (publishing), Categorization, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Cluster analysis, computer
Abstract: Machine learning techniques are widely used to take intelligent decisions in industrial and educational domains. In the educational domain, when a research scholar submits a dissertation, then it has to be indexed and classified. The number of dissertations that are submitted in an educational institute is usually high and if done manually, it becomes difficult to index and classify correctly. This study applies machine learning techniques to automate the indexing and categorization of dissertations. We have focused on dissertations from the Engineering, Medical, Social Science, and General Science fields. We used the Bag of Words (BoW) method to extract features and K-means, Density-based spatial clustering of applications with noise (DBSCAN) and Expectation-Maximisation (EM) to train our model. Our experimental results reveal that the proposed K- means technique for indexing and categorization leads to higher accuracy and significant reduction in negative predictions as compared to DBSCAN and Expectation-Maximisation (EM).
Published: 2020

353. Artificial Intelligence based Semantic Text Similarity for RAP Lyrics

Author: Chandra J, Alwin Joseph, and Akshay Santhanam
Subjects: Computer science, business.industry, Context (language use), Word search, computer.software_genre, Data set, Identification (information), Bag-of-words model, Similarity (psychology), Word2vec, Artificial intelligence, business, computer, Natural language processing, Word (computer architecture)
Abstract: Data mining is the primary method of gathering large volumes of knowledge. To make use of such data to implementation requires the use of effective machine learning strategies. Semantic Textual Similarity is one of the primary machine learning strategies. At its core, semantic textual similarity is the identification of words with similar context. Initial work in STS involved text reuse, word search among others. The proposed research work uses a specific method of determining textual similarity using Google’s Word2Vec framework and the Continuous-bag-of-words algorithm for identifying word similarity in rap records. A large data set consisting of over 50,000 rap records is used. The key aspect of proposed methodology is to determine the words with similar context and cluster them into different word clusters also called bags. To achieve the desired result, the dataset is first processed to obtain the features. Once the features are selected, a model is generated by passing the data onto the Word2Vec framework. The research work on semantic textual similarity was repeated across three different runs, with the data set size changing in every run. At the end of each the accuracy of similarity obtained by the model was determined. The current research work has achieved average accuracy as 85%.
Published: 2020

354. Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy

Author: Samer Abdulateef, Xuequn Shang, Naseer Ahmed Khan, and Bolin Chen
Subjects: Computer science, 02 engineering and technology, computer.software_genre, Arabic text summarization, 0202 electrical engineering, electronic engineering, information engineering, Word2vec, Cluster analysis, multidocument text summarization, lcsh:T58.5-58.64, lcsh:Information technology, business.industry, word2vec, 020206 networking & telecommunications, Document clustering, Automatic summarization, Readability, Bag-of-words model, Vector space model, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, text clustering, Sentence, Natural language processing, Information Systems
Abstract: Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences&rsquo, encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.
Published: 2020
Full Text: View/download PDF

355. A Framework for Generating Annotated Social Media Corpora with Demographics, Stance, Civility, and Topicality

Author: Daniel A. Collier and Shubhanshu Mishra
Subjects: Text corpus, Higher education, business.industry, Sample (statistics), computer.software_genre, Politics, Civility, Bag-of-words model, Social media, Artificial intelligence, business, Psychology, computer, Natural language processing, Student loan
Abstract: In this paper we introduce a framework for annotating a social media text corpora for various categories. Since, social media data is generated via individuals, it is important to annotate the text for the individuals demographic attributes to enable a socio-technical analysis of the corpora. Furthermore, when analyzing a large data-set we can often annotate a small sample of data and then train a prediction model using this sample to annotate the full data for the relevant categories. We use a case study of a Facebook comment corpora on student loan discussion which was annotated for gender, military affiliation, age-group, political leaning, race, stance, topicalilty, neoliberlistic views and civility of the comment.
Published: 2020

356. Grey Relational Classification of Consumers' Textual Evaluations in E-Commerce

Author: Hüseyin Fidan
Subjects: Information retrieval, Text mining, business.industry, Computer science, Grey relational classification, 02 engineering and technology, E-commerce, General Business, Management and Accounting, Computer Science Applications, Weighting, Support vector machine, Statistical classification, Naive Bayes classifier, Consumer relationships management, Consumer evaluation analysis, Categorization, Bag-of-words model, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Vector space model, Grey system theory, business
Abstract: Companies have gained important advantages by the development of electronic commerce. Consumer evaluations in electronic environment offer great possibilities for analysis. The fact that the consumer opinions are comprised of textual data, analyzes have complicated and challenging process. In recent years, it is seen that text mining methods are used in analyzes in the literature. However, the evaluations of consumers which are formed by short texts make it necessary to realize the analysis with insufficient data. The weighting methods such as Term Frequency and Term Frequency-Inverse Document Frequency as well as common used classification algorithms such as Naïve Bayes and Support Vector Machine have some inadequacies in short text analysis. In this study, a grey relational classification model based on Vector Space Model and Bag of Words has been developed. The model was first applied to the positive-negative categorization of the evaluations, then, applied to the classification of negative evaluations. It was determined that the accuracy level of the model is higher than the classification algorithms commonly used in short text. According to the results of the research, 9637 negative evaluations in 24479 consumer opinion were determined, and 50.4% of the negative evaluations were found to have the most problems related to product.
Published: 2020

357. Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora?

Author: Emmanuel Morin, Philippe Langlais, Martin Laville, and Mérième Bouhandi
Subjects: Space (punctuation), Computer science, business.industry, 02 engineering and technology, Translation (geometry), computer.software_genre, Task (project management), Bilingual lexicon, 03 medical and health sciences, 0302 clinical medicine, Bag-of-words model, Similarity (psychology), 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Word (computer architecture)
Abstract: Methods for bilingual lexicon induction are often based on word embeddings (WE) similarity. These methods must be able to project the WE to the same space. Uncontextualized WE proved to be useful for this task. We compare them to contextualized WE and Bag of Words, using specialized and general datasets. We also evaluate the impact of seed lexicons and check the existing reference lists validity, claiming that extracting the translation of some words in those lists is not useful and confirming the need to have more fine-grained reference lists.
Published: 2020

358. An Algorithm to Detect Variations in Writing Styles of Columnists After Major Political Changes

Author: Erik Molino-Minero-Re, Antonio Neme, Luis Alfonso Ruiz Juárez, and Rodolfo Escobar
Subjects: History, Presidential system, 020206 networking & telecommunications, 02 engineering and technology, Maturity (finance), Newspaper, Style (sociolinguistics), Writing style, Politics, Bag-of-words model, Subject (grammar), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Algorithm
Abstract: Writers tend to follow a certain style that can be detected or at least sketched by an appropriate algorithm. Columnists in newspapers, being also writers, follow their specific style. The style tends to be stable once writers reach maturity, but it is subject to change when internal or external circumstances differ. Here, we apply a bag-of-words approach to approximate the style of several journalists working in Mexican newspapers, and we track their style for a long period of time with the aim of detecting changes when external circumstances, in particular political ones, change. This provided us with an environment for detecting variations in stylomics, which is the closest we can get to an experiment. In particular, we collected hundreds of writings of ten Mexican columnists from different newspapers, both previous to the Presidential Mexican elections of 2018 and posterior to it. We processed these documents on different supervised and not supervised learning algorithms, such as random forest, principal component analysis, and k-means. Likewise, we implemented different validation procedures. As a result, we detected that the style in all studied columnists suffered tangible changes in the frequency of use of some particular words, particularly at specific times, some of which may be related to the 2018 Mexican presidential elections.
Published: 2020

359. Web Text Categorization: A LSTM-RNN Approach

Author: Kaushik Roy, Sk Md Obaidullah, Niladri Sekhar Dash, Himadri Mukherjee, K. C. Santosh, and Ankita Dhar
Subjects: Computer science, business.industry, 02 engineering and technology, computer.software_genre, Graph, language.human_language, 030507 speech-language pathology & audiology, 03 medical and health sciences, Text categorization, Recurrent neural network, Bengali, Categorization, Bag-of-words model, 020204 information systems, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, 0202 electrical engineering, electronic engineering, information engineering, language, Graph (abstract data type), Artificial intelligence, 0305 other medical science, business, computer, Natural language processing
Abstract: Categorization of text documents has become a prime task with plenty of applications. It is a process that assigns the text documents to their individual categories based on the contents. In this paper, a long-short-term memory (LSTM) recurrent neural network (RNN) based system along with handcrafted graph-based features is proposed for text categorization. In comparison to the traditional bag of words model for representation of documents, this graph-based model represents each document by a weighted graph for determining the relationships among the tokens. The significance of a token to a text document is determined by the graph-based feature. Experiments were tested on 14,373 Bangla text documents (around 57,22,569 tokens) and a maximum accuracy of 99.21% has been obtained.
Published: 2020

360. Reed at SemEval-2020 Task 9: Fine-Tuning and Bag-of-Words Approaches to Code-Mixed Sentiment Analysis

Author: Vinay Gopalan and Mark Hopkins
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, business.industry, Sentiment analysis, computer.software_genre, SemEval, Task (project management), Competition (economics), Bag-of-words model, Code (cryptography), Feedforward neural network, Artificial intelligence, Transfer of learning, business, computer, Computation and Language (cs.CL), Natural language processing
Abstract: We explore the task of sentiment analysis on Hinglish (code-mixed Hindi-English) tweets as participants of Task 9 of the SemEval-2020 competition, known as the SentiMix task. We had two main approaches: 1) applying transfer learning by fine-tuning pre-trained BERT models and 2) training feedforward neural networks on bag-of-words representations. During the evaluation phase of the competition, we obtained an F-score of 71.3% with our best model, which placed $4^{th}$ out of 62 entries in the official system rankings.
Published: 2020
Full Text: View/download PDF

361. Crash narrative classification: Identifying agricultural crashes using machine learning with curated keywords

Author: Jisung Kim, Hye-Chung Kum, Amber B. Trueblood, and Eva M. Shipp
Subjects: Support Vector Machine, Computer science, Applied psychology, Poison control, Crash, Suicide prevention, Occupational safety and health, Machine Learning, 0502 economics and business, Injury prevention, Humans, 0501 psychology and cognitive sciences, Narrative, 050107 human factors, 050210 logistics & transportation, 05 social sciences, Public Health, Environmental and Occupational Health, Accidents, Traffic, Human factors and ergonomics, Agriculture, Bayes Theorem, Louisiana, Texas, Bag-of-words model, Safety Research, Algorithms, Forecasting
Abstract: Traditionally, structured or coded data fields from a crash report are the basis for identifying crashes involving different types of vehicles, such as farm equipment. However, using only the structured data can lead to misclassification of vehicle or crash type. The objective of the current article is to examine the use of machine learning methods for identifying agricultural crashes based on the crash narrative and to transfer the application of models to different settings (e.g., future years of data, other states). Different data representations (e.g., bag-of-words [BoW], bag-of-keywords [BoK]) and document classification algorithms (e.g., support vector machine [SVM], multinomial naïve Bayes classifier [MNB]) were explored using Texas and Louisiana crash narratives across different time periods. The BoK-support vector classifier (SVC), BoK-MNB, and BoW-SVC models trained with Texas data were better predictive models than the baseline rule-based algorithm on the future year test data, with F1 scores of 0.88, 0.89, 0.85 vs. 0.84. The BoK-MNB trained with Louisiana data performed the closest to the baseline rule-based algorithm on the future year test data (F1 scores, 0.91 baseline rule-based algorithm vs. 0.89 BoK-MNB). The BoK-SVC and BoK-MNB models trained with Texas and Louisiana data were better productive models for Texas future year test data with F1 scores 0.89 and 0.90 vs. 0.84. The BoK-MNB model trained with both states’ data was a better predictive model for the Louisiana future year test data, F1 score 0.94 vs. 0.91. The findings of this study support that machine learning methodologies can potentially reduce the amount of human power required to develop key word lists and manually review narratives.
Published: 2020
Full Text: View/download PDF

362. Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Author: Ammar Mohsin Butt, Saima Nazir, Muhammad Haroon Yousaf, Serestina Viriri, Sergio A. Velastin, and Fiza Murtaza
Subjects: Computer science, Feature vector, Bag-of-words, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, lcsh:Technology, Convolutional neural network, Action recognition, Clustering, lcsh:Chemistry, Discriminative model, Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, bag-of-words, General Materials Science, Cluster analysis, Deep residual networks, lcsh:QH301-705.5, Instrumentation, Fluid Flow and Transfer Processes, Informática, action recognition, lcsh:T, business.industry, Process Chemistry and Technology, Feature encoding, General Engineering, Codebook, Pattern recognition, Classification, lcsh:QC1-999, Computer Science Applications, Hierarchical clustering, ComputingMethodologies_PATTERNRECOGNITION, lcsh:Biology (General), lcsh:QD1-999, classification, lcsh:TA1-2040, Bag-of-words model, deep residual networks, 020201 artificial intelligence & image processing, feature encoding, Artificial intelligence, lcsh:Engineering (General). Civil engineering (General), business, lcsh:Physics, clustering
Abstract: Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.
Published: 2020

363. A Comparative Study on Various Text Classification Methods

Author: Samarth Khanna, Priyanka Das, Asit Kumar Das, and Bishnu Tiwari
Subjects: business.industry, Computer science, Machine learning, computer.software_genre, Task (project management), Exponential growth, Bag-of-words model, Classification methods, Word2vec, Image tracing, Artificial intelligence, business, computer, Information exchange
Abstract: With the exponential growth in the enhancement of modes of information exchange, the spread of text has become not only substantially faster, but also widespread. Due to this, text has become an indispensable part of all kinds of decision-making. Hence, it has become imperative to analyse the methods that can help make sense of this text as efficiently as possible. We shall make an attempt at the same by discussing various tools to make this very task increasingly productive. We shall try to analyse the relationship between the way an algorithm works and how it performs on various sets of data having different types of featurization. We shall analyse featurization techniques such as bag of words/N-grams, Tf-Idf vectorization, average Word2Vec and Tf-Idf Word2Vec.
Published: 2020

364. Intent Classification for Dialogue Utterances

Author: Flavius Frasincar, Jetze Schuurmans, Erik Cambria, Econometrics, and Business Intelligence
Subjects: Structure (mathematical logic), Computer Networks and Communications, business.industry, Computer science, Sentiment analysis, Intelligent decision support system, 02 engineering and technology, Machine learning, computer.software_genre, Support vector machine, Naive Bayes classifier, Artificial Intelligence, Bag-of-words model, Taxonomy (general), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Affective computing, computer
Abstract: In this work, we investigate several machine learning methods to tackle the problem of intent classification for dialogue utterances. We start with bag-of-words in combination with Naive Bayes. After that, we employ continuous bag-of-words coupled with support vector machines (SVM). Then, we follow long short-term memory (LSTM) networks, which are made bidirectional. The best performing model is hierarchical, such that it can take advantage of the natural taxonomy within classes. The main experiments are a comparison between these methods on an open sourced academic dataset. In the first experiment, we consider the full dataset. We also consider the given subsets of data separately, in order to compare our results with state-of-the-art vendor solutions. In general we find that the SVM models outperform the LSTM models. The former models achieve the highest macro-F1 for the full dataset, and in most of the individual datasets. We also found out that the incorporation of the hierarchical structure in the intents improves the performance.
Published: 2020

365. Texture Extraction Techniques for the Classification of Vegetation Species in Hyperspectral Imagery: Bag of Words Approach Based on Superpixels

Author: Francisco Argüello, Sergio R. Blanco, Dora B. Heras, Universidade de Santiago de Compostela. Centro de Investigación en Tecnoloxías da Información, and Universidade de Santiago de Compostela. Departamento de Electrónica e Computación
Subjects: texture extraction, Computer science, Science, BoW, SVM, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 0211 other engineering and technologies, superpixel, Image processing, 02 engineering and technology, Texture (geology), Field (computer science), vegetation, 0202 electrical engineering, electronic engineering, information engineering, Superpixel, 021101 geological & geomatics engineering, Vegetation, business.industry, Codebook, Hyperspectral imaging, Pattern recognition, Texture extraction, Random forest, Support vector machine, hyperspectral, Hyperspectral, Bag-of-words model, General Earth and Planetary Sciences, 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: Texture information allows characterizing the regions of interest in a scene. It refers to the spatial organization of the fundamental microstructures in natural images. Texture extraction has been a challenging problem in the field of image processing for decades. In this paper, different techniques based on the classic Bag of Words (BoW) approach for solving the texture extraction problem in the case of hyperspectral images of the Earth surface are proposed. In all cases the texture extraction is performed inside regions of the scene called superpixels and the algorithms profit from the information available in all the bands of the image. The main contribution is the use of superpixel segmentation to obtain irregular patches from the images prior to texture extraction. Texture descriptors are extracted from each superpixel. Three schemes for texture extraction are proposed: codebook-based, descriptor-based, and spectral-enhanced descriptor-based. The first one is based on a codebook generator algorithm, while the other two include additional stages of keypoint detection and description. The evaluation is performed by analyzing the results of a supervised classification using Support Vector Machines (SVM), Random Forest (RF), and Extreme Learning Machines (ELM) after the texture extraction. The results show that the extraction of textures inside superpixels increases the accuracy of the obtained classification map. The proposed techniques are analyzed over different multi and hyperspectral datasets focusing on vegetation species identification. The best classification results for each image in terms of Overall Accuracy (OA) range from 81.07% to 93.77% for images taken at a river area in Galicia (Spain), and from 79.63% to 95.79% for a vast rural region in China with reasonable computation times This work was supported in part by the Civil Program UAVs Initiative, promoted by the Xunta de Galicia and developed in partnership with the Babcock Company to promote the use of unmanned technologies in civil services. We also have to acknowledge the support by Ministerio de Ciencia e Innovación, Government of Spain (grant number PID2019-104834GB-I00), and Consellería de Educación, Universidade e Formación Profesional (ED431C 2018/19, and accreditation 2019-2022 ED431G-2019/04). All are cofunded by the European Regional Development Fund (ERDF) SI
Published: 2020

366. Similar Cluster Based Continuous Bag-of-Words for Word Vector Training

Author: Shiyi Zhang, Shenghong Li, Weikai Sun, and Yinghua Ma
Subjects: Basis (linear algebra), Computer science, business.industry, media_common.quotation_subject, 05 social sciences, Training (meteorology), 050801 communication & media studies, computer.software_genre, Set (abstract data type), Task (computing), 0508 media and communications, Semantic similarity, Bag-of-words model, 0502 economics and business, 050211 marketing, Quality (business), Artificial intelligence, business, computer, Word (computer architecture), Natural language processing, media_common
Abstract: With the increasing use of natural language processing, how to build a word vector which contains more semantic information becomes a top priority. Word vector is used to represent the most basic unit in the language-word, and is the basis of the neural natural language processing model. Therefore the quality of word vectors directly affects the performance of various applications. In continuous bag-of-words model, limited by the frequency of occurrence, some words do not get enough training. At the same time, based on the set of minimum frequency, some low-frequency words are ignored by the model. In this paper, we build similar clusters from the semantic dictionary and integrate it into CBOW model with the help of multi-classifier. We improve word vectors and use it to complete a semantic similarity comparison task. Compared with the original word vectors built by CBOW, the method we proposed got higher accuracy. It shows some semantic information are integrated and the word vectors of low–frequency words are improved.
Published: 2020

367. A Novel Spatial Layout Representation for Object Recognition

Author: Noha M. Elfiky
Subjects: 0209 industrial biotechnology, Contextual image classification, business.industry, Computer science, Cognitive neuroscience of visual object recognition, Representation (systemics), Pattern recognition, Context (language use), 02 engineering and technology, Grid, 020901 industrial engineering & automation, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: Most successful approaches on image classification apply the Bag-of-Words (BoW) approach in the context of category-level image classification. To incorporate spatial image information in the BoW model, Spatial Pyramids (SPs) are used. However, spatial pyramids are rigid in nature and are based on predefined grid configurations. As a consequence, they often fail to coincide with the underlying spatial structure of images from different categories which may negatively affect the classification accuracy.
Published: 2020

368. Factors Affecting the Cost to Accuracy Balance for Real-Time Video-Based Action Recognition

Author: Divina Govender and Jules-Raymond Tapamo
Subjects: Flexibility (engineering), 0209 industrial biotechnology, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Process (computing), Optical flow, Sampling (statistics), 02 engineering and technology, Machine learning, computer.software_genre, Power (physics), 020901 industrial engineering & automation, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, Artificial intelligence, Focus (optics), business, computer
Abstract: For successful real-time action recognition in videos, the compromise between computational cost and accuracy must be carefully considered. To explore this balance, we focus on the popular Bag-of-Words (BoW) framework. Although computationally efficient, the BoW has weak classification power. Thus, many variants have been developed. These variants aim to increase classification power whilst maintaining computational efficiency; achieving the ideal cost-to-accuracy balance. Four factors affecting the computational cost vs accuracy balance were identified: ‘Sampling’ strategy, ‘Optical Flow’ algorithm, ‘Saliency’ of extracted features and overall algorithm ‘Flexibility’. The practical effects of these factors were experimentally evaluated using the Dense Trajectories feature framework and the KTH and HMDB51 (a reduced version) datasets. The ‘Saliency’ of extracted information is found to be the most vital factor - spending computational resources to process large amounts of non-salient information can decrease accuracy.
Published: 2020

369. Topic Analysis by Exploring Headline Information

Author: Guanglai Gao and Rong Yan
Subjects: Topic model, Scheme (programming language), Perplexity, Information retrieval, Computer science, Headline, 02 engineering and technology, Representation (arts), Latent Dirichlet allocation, symbols.namesake, Order (exchange), Bag-of-words model, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, computer, computer.programming_language
Abstract: As for the topic representation in standard topic models, the words that appear in a document are considered with the same weight under the assumption of ‘bag of words’. The word-topic assignment will lean to the high-frequency words and ignore the influence of the low-frequency words. As a result, it will ultimately impact on the performance of topic representation. Generally, the statistical information obtained from the whole document collection can be used to improve this situation. In addition, headlines of some kind of documents, such as news articles, usually summarize the important elements in the document, and the words in headlines are more appropriate to represent the topics. However, few previous studies consider the headline rich information, which is significant for topic modeling. In this paper, we propose a new headline-based topic model in order to accomplish a well-formed topic description. Experimental results on three widely used datasets show that the proposed headline-based modeling scheme achieves lower perplexity.
Published: 2020

370. Sentiment Analysis: Predicting Product Reviews’ Ratings using Online Customer Reviews

Author: Ankit Taparia and Tanmay Bagla
Subjects: Data set, Information retrieval, Computer science, Bag-of-words model, business.industry, Classifier (linguistics), Sentiment analysis, Subject (documents), The Internet, Tag cloud, business, tf–idf
Abstract: In the present era, most of the data on the internet is in the form of raw text. These gold mines of data are invaluable since it contains lots of underlying information which can be extracted using natural language processing or text analytics techniques. The data from these text-based documents disclose users’ sentiments and opinions about a particular subject. In this paper, customer reviews from Amazon.com are pre-processed, analyzed using our proposed framework, and how these textual reviews justify the star ratings is studied. Features derived from textual reviews are used to predict its corresponding star ratings. To accomplish it, the prediction problem is transformed into a multi-class classification task to classify reviews to one of the five classes corresponding to its star rating. The performances of various classifiers used are evaluated and compared. Evaluation results on ground-truth data set show that Logistic Regression Classifier outperformed other models. Our study also reveals that among the various factors, polarity of the review and length of the review showed a significant effect on its rating.
Published: 2020

371. Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes

Author: Tony C. Smith, Michael Mayo, Vithya Yogarajan, Bernhard Pfahringer, and Henry Gouk
Subjects: 0303 health sciences, Computer science, business.industry, Context (language use), 02 engineering and technology, computer.software_genre, 03 medical and health sciences, Binary classification, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Embedding, 020201 artificial intelligence & image processing, Artificial intelligence, Representation (mathematics), business, computer, Natural language processing, Word (computer architecture), 030304 developmental biology, Curse of dimensionality
Abstract: Word embeddings are a useful tool for extracting knowledge from the free-form text contained in electronic health records, but it has become commonplace to train such word embeddings on data that do not accurately reflect how language is used in a healthcare context. We use prediction of medical codes as an example application to compare the accuracy of word embeddings trained on health corpora to those trained on more general collections of text. It is shown that both an increase in embedding dimensionality and an increase in the volume of health-related training data improves prediction accuracy. We also present a comparison to the traditional bag-of-words feature representation, demonstrating that in many cases, this conceptually simple method for representing text results in superior accuracy to that of word embeddings.
Published: 2020

372. The Linguistic Properties of Award-winning Annual Reports

Author: Jacqueline Gagnon, Steven Young, and Paulo Alves
Subjects: business.industry, media_common.quotation_subject, Cognition, computer.software_genre, Test (assessment), Bag-of-words model, Corpus linguistics, Reading (process), Quality (business), Artificial intelligence, Computational linguistics, business, Psychology, Set (psychology), computer, Natural language processing, media_common
Abstract: We develop and test a model of high quality annual report discourse. The model is trained and evaluated on reports published between 2007 and 2018 by London Stock Exchange-listed firms shortlisted for an award by corporate reporting experts. We use methods from computational linguistics to identify an initial set of 19 features that distinguish quality according to what management say (i.e.: content) and how they say it (i.e.: language structure). We supplement these features with popular bag-of words proxies drawn from extant research (document length, reading ease, net tone, forward-looking content, and uncertainty). Stepwise regression yields a parsimonious quality model comprising 10 features that suggest more strategy-related commentary, less focus on growth, and greater language accessibility that promotes cognitive processing (evidenced by more relevancy markers, greater connectivity, more exclusive forms of language, and fewer grammatical words). The model predicts over 70% of shortlisting cases in out-of-sample tests and outperforms a baseline model comprising popular bag-of-words features.
Published: 2020

373. Sparse Lifting of Dense Vectors: A Unified Approach to Word and Sentence Representations

Author: Wenye Li and Senyue Hao
Subjects: 021103 operations research, Theoretical computer science, Computer science, 0211 other engineering and technologies, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Matrix decomposition, Bag-of-words model, Face (geometry), Representation (mathematics), Sentence, Word (computer architecture), 0105 earth and related environmental sciences
Abstract: As the first step in automated natural language processing, representing words and sentences is of central importance and has attracted significant research attention. Despite the successful results that have been achieved in the recent distributional dense and sparse vector representations, such vectors face nontrivial challenge in both memory and computational requirement in practical applications. In this paper, we designed a novel representation model that projects dense vectors into a higher dimensional space and favors a highly sparse and binary representation of vectors, while trying to maintain pairwise inner products between original vectors as much as possible. Our model can be relaxed as a symmetric non-negative matrix factorization problem which admits a fast yet effective solution. In a series of empirical evaluations, the proposed model reported consistent improvement in both accuracy and running speed in downstream applications and exhibited high potential in practical applications.
Published: 2020

374. Genre-Based Indian Viewer Movie Reviews—A Sentiment Analysis Classification of Text and Emoticons with a Supervised Machine Learning Approach

Author: Elizabeth L. George and Ashish Modi
Subjects: Phrase, Computer science, business.industry, Supervised learning, Sentiment analysis, Unstructured data, Animation, computer.software_genre, Naive Bayes classifier, Text processing, Bag-of-words model, Artificial intelligence, business, computer, Natural language processing
Abstract: Indian movie reviews, sentiments play an important role in knowing the feedback of a movie comment that will identify whether the movie is good, bad or average. This research paper collects dataset of online movie review comments from a popular movie booking website BookMyShow for purely research purpose where the dataset of the movie reviews are based on the five genres such as: Animation, Comedy, Action, Romance, and Epics for a varied age group of Kids, Teens, Young adults, Adults and Family. This research study analyses live movie review comments of five movies to classify the sentiment of each movie review comment, emoticons and rating stars to calculate an overall polarity percentage. In our experimental approach, we have taken the unstructured data of movie reviews from a popular booking website BookMyShow using a Python script called Grabber.py which is later converted into structured form with a defined bag of words. Further, it goes for data preprocessing by extracting the corpus using syntactic feature selection method and using phrase patterns for positive and negative sentiment orientation, a supervised learning approach is applied on the movie comments for text processing and emoticons with SVM classification and Naive Bayes classifier for classifying the polarity of the movie review comments.
Published: 2020

375. A Comparative Analysis of Different Violence Detection Algorithms from Videos

Author: Piyush Vashistha, Mohd. Aamir Khan, and Juginder Pal Singh
Subjects: Activity recognition, ALARM, Bag-of-words model, Computer science, Optical flow, ComputingMilieux_COMPUTERSANDSOCIETY, Violence detection, Data science, Object (philosophy), Statistic, Field (computer science)
Abstract: There are different methods or techniques used for identifying violence from video, such as hitting some object, kicking, fighting, and punching someone but still there is a big challenge for us to identify violence. However, some of the earlier mechanism generally extract descriptors around the spatiotemporal interesting points (STIP) or extract statistic features but there is limited effectiveness in detecting video-based violence. Therefore, the objective is to develop a better violence identification system that identifies the violence and triggers an alarm so that prompt assistance will be provided. This paper helps researchers who wish to study violent activity recognition and gather different insights on the main challenges and issues to solve in this emerging field.
Published: 2020

376. Sentiment Analysis of a Product based on User Reviews using Random Forests Algorithm

Author: Twinkle Sarraf and Shailendra Narayan Singh
Subjects: Measure (data warehouse), Information retrieval, Computer science, Bag-of-words model, Reading (process), media_common.quotation_subject, Scale (chemistry), Sentiment analysis, Product (category theory), Space (commercial competition), Test data, media_common
Abstract: After many sentiment analysis as well as many types of methods classify the reviews that is based on test data and reviewer’s ratings which uses training., after reading reviews it is seen that star rating of reviewer do not always give a precise measure of his sentiment. This paper primarily focuses on analyzing customer reviews from the e-commerce space. Upon surveying popular e-commerce websites it can be observed that in several instances the product rating given by a customer is not consistent with the product review written by him/her. The problem is made complex by the fact that there is no standard scale to measure the rating that the user gives and the rating of the product are instinctive to the customers’ view. In several cases it is seen that a product is rated 4 out of 5. However, the reviews detail that the customer’s experience with the product is not favourable. Indeed, text reviews are a true picture of the product. To get rid of this problem, the stated system will give a boolean result i.e. whether the product is good or bad and the user does not need to read all the reviews to analyze the product.
Published: 2020

377. Sentiment Analysis of Amazon Mobile Reviews

Author: Arkav Banerjee, Nishi Intwala, Vidya Sawant, and Meenakshi
Subjects: Statistical classification, Naive Bayes classifier, Word embedding, Information retrieval, Sarcasm, Computer science, Bag-of-words model, media_common.quotation_subject, Sentiment analysis, Product (category theory), Popularity, media_common
Abstract: Sentiment analysis is used to derive the emotion/opinion that is being conveyed in a text. This helps in determining whether the author’s intent is positive or negative. Its applications are vast and help in analyzing product reviews, popularity of a brand, and in determining people’s opinions on any subject. Due to the complexities involved in the human language such as subjectivity, metaphors, sarcasm, and multiple sentiments, it becomes difficult to categorize our opinions, computationally. The goal of this project is to conduct sentiment analysis on Amazon product reviews using various natural language processing (NLP) techniques and classification algorithms. The dataset consists of 400,000 reviews of unlocked mobile phones sold on Amazon.com. We will achieve the result by preprocessing the reviews and converting them to clean reviews, after which using word embedding, the word reviews were converted into numerical representations. Then, we finally fit the numerical representations of reviews to the Naive Bayes, logistic regression, and random forest algorithm. The results and accuracy of all these classifiers are compared in this paper. This will be helpful for a brand/company to understand the general opinion toward their product which in turn will help them in evaluating the improvements required.
Published: 2020

378. Movie Review Summarization Using Supervised Learning and Graph-Based Ranking Algorithm

Author: Asim Zeb, R. R. Biswal, Muhammad Adnan Gul, Atif Khan, Mahdi Zareei, Naomie Salim, Muhammad Naeem, and Yousaf Saeed
Subjects: Article Subject, General Computer Science, Computer science, General Mathematics, Feature vector, Motion Pictures, Computer applications to medicine. Medical informatics, R858-859.7, Neurosciences. Biological psychiatry. Neuropsychiatry, 02 engineering and technology, Naive Bayes classifier, Semantic similarity, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Humans, Language, Natural Language Processing, General Neuroscience, Supervised learning, General Medicine, Automatic summarization, Bag-of-words model, Vector space model, Graph (abstract data type), 020201 artificial intelligence & image processing, Supervised Machine Learning, Algorithm, Algorithms, Research Article, RC321-571
Abstract: With the growing information on web, online movie review is becoming a significant information resource for Internet users. However, online users post thousands of movie reviews on daily basis and it is hard for them to manually summarize the reviews. Movie review mining and summarization is one of the challenging tasks in natural language processing. Therefore, an automatic approach is desirable to summarize the lengthy movie reviews, and it will allow users to quickly recognize the positive and negative aspects of a movie. This study employs a feature extraction technique called bag of words (BoW) to extract features from movie reviews and represent the reviews as a vector space model or feature vector. The next phase uses Naïve Bayes machine learning algorithm to classify the movie reviews (represented as feature vector) into positive and negative. Next, an undirected weighted graph is constructed from the pairwise semantic similarities between classified review sentences in such a way that the graph nodes represent review sentences, while the edges of graph indicate semantic similarity weight. The weighted graph-based ranking algorithm (WGRA) is applied to compute the rank score for each review sentence in the graph. Finally, the top ranked sentences (graph nodes) are chosen based on highest rank scores to produce the extractive summary. Experimental results reveal that the proposed approach is superior to other state-of-the-art approaches.
Published: 2020

379. Sentiment Analysis to Recognize Emotional Distress Through Facebook Status Updates

Author: Nisheeth Joshi, Kanak Saxena, and Swarnangini Sinha
Subjects: Hindi, Discourse relation, media_common.quotation_subject, Sentiment analysis, Punctuation, language.human_language, Negation, Bag-of-words model, language, Social media, Affect (linguistics), Psychology, Social psychology, media_common
Abstract: Youngsters during their adolescent experience various changes in their life. Some youngsters could not handle pressure created due to these changes and tend to feel isolated and distressed. Social media is a popular medium of communication amongst youngsters to remain connected with their friends. Facebook is one of the most preferred social media sites which stores the gigantic amount of data which can be explored for sentiment analysis. We have studied Facebook status updates in English and Hindi to determine the degree of emotional distress using bag-of-words model enhanced with weighted affect words, negation, discourse relation, emoticons, and punctuation marks. SVM classifier has given a promising result with an accuracy of 90.49%.
Published: 2020

380. Features of Interest Points Based Human Interaction Prediction

Author: Xuan Xie, Xiaofei Ji, and Chenyu Li
Subjects: Imagination, Current (mathematics), business.industry, Computer science, Gaussian, media_common.quotation_subject, Pattern recognition, symbols.namesake, Search engine, Action (philosophy), Bag-of-words model, symbols, Artificial intelligence, business, Representation (mathematics), Predictive modelling, media_common
Abstract: The recognition and prediction of human interaction based on video has a broad application prospect in intelligent video monitoring and other fields, but the current integration algorithm of it is not mature, which greatly limits the application of the algorithm. A prediction method of human interaction based on the statistical characteristics of interest points is proposed, which established an integrated framework to the recognition and prediction of human interaction initially. First, the spatio-temporal interest points are extracted and performed as 3D-SIFT feature description, then the bag of words is used to represent the action video. In the training stage, Gaussian models are used to establish the action model for each action at different time scales. In the prediction stage, the bag of words representation is extracted and compared with the established prediction models of different time lengths to obtain similar prediction probabilities between the models for an action video of unknown length. Finally, the predictive probability of each action at different time scales is fused by weighted probabilities completing the recognition and prediction of interaction prediction. The experimental result on UT-interaction dataset demonstrated that the proposed approach is easy to implement and has a good predictive effect.
Published: 2020

381. SOM-Based Visualization of Potential Technical Solutions with Fuzzy Bag-of-Words Utilizing Multi-view Information

Author: Yasushi Nishida and Katsuhiro Honda
Subjects: Patent office, Information retrieval, Computer science, media_common.quotation_subject, 02 engineering and technology, Semantics, 01 natural sciences, Fuzzy logic, Visualization, 010104 statistics & probability, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), 020201 artificial intelligence & image processing, Quality (business), 0101 mathematics, Representation (mathematics), media_common
Abstract: SOM-based visualization is a promising approach for revealing potential technical solutions varied in patent documents. In this paper, we try to improve the quality of visual representation by constructing Fuzzy Bag-of-Words (FBoW) matrices through utilization of multi-view information. F-term is the special theme code given by the examiners of Japan Patent Office (JPO) for Japanese patent documents and is expected to be a potential candidate of second-view information. Document $\times $ F-term co-occurrence information is utilized for improving the quality of FBoW representation such that semantical similarities among words are measured by considering F-term semantics. The advantage of utilizing F-term semantics in constructing FBoW is demonstrated through analysis of patent document data.
Published: 2020

382. Phrase Extraction Using Pattern-Based Bootstrapping Approach

Author: Bhanu Prakash Battula and Duraisamy Balaganesh
Subjects: Information retrieval, business.industry, Computer science, NumPy, Deep learning, Digital data, Data classification, Python (programming language), Data set, Software deployment, Bag-of-words model, Artificial intelligence, business, computer, computer.programming_language
Abstract: From the past decade, there has been drastic development and deployment of digital data warehoused in electronic health record (EHR). Initially, it is intended for getting patient general info and accomplishment healthcare tasks like billing, but researchers focused on secondary and most important use of these data for innumerable clinical solicitations. In this paper, we addressed the use of deep learning-based clinical note multi-label multi-class approach using ensemble approach based on CNN and bag-of-words approach. And we map those classes for multi-classes. And we perform experiments with Python, and we used libraries of Keras, TensorFlow, NumPy, matplotlib, and we use MIMIC-III data set. And we made comparison with existing works CNN, skip-gram, n-gram and bag of words. The performance results show that proposed framework performed good while classifying the text notes.
Published: 2020

383. DsOn: Ontology-Driven Model for Symptom and Drug Knowledge Extraction on Social Media

Author: FarahnazGolrooy Motlagh
Subjects: Parsing, Information retrieval, Recall, biology, Unified Medical Language System, biology.organism_classification, computer.software_genre, Knowledge extraction, Bag-of-words model, Domain knowledge, Social media, Cannabis, Psychology, computer
Abstract: Social media has extensive content volunteered by drug users, who also mentions drug side-effects. Such mentions may indicate side-effects that are heretofore unknown to the medical professionals and researchers, pharmaceutical companies, and regulators. To study and monitor the drug and its potential symptoms and side-effects for epidemiological surveillance of drug trends, the research needed a higher level of knowledge extraction and improvement in domain knowledge. Our aim is to methodically classify the tweets that mention drug side-effects, and by applying our model we improved the medical knowledge extraction from our data-set. We crawled 7 million tweets related to marijuana concentrate and cannabis (known as marijuana) the tweet text mentioned about marijuana usage, marijuana consumption, the medicinal value of marijuana and symptoms or side-effects of marijuana. It is essential to identify the drug side-effects in the tweets in order to create a bag of words relating to each category of marijuana. This will assist us in the classification of tweets under different categories of marijuana-based on side-effects. We manually created multi-domain knowledge, modeled in DsOn(Drug and Symptom Otology), to facilitate the extraction of semantic parsing from people opinion on Social Media, through a combination of medical taxonomy dictionary, and semantics relationship based techniques from UMLS, PubMed, and DBpedia. Experiment results showed an improvement in recall from 0.14 (without ontology model) to 0.23 (with ontology model).
Published: 2020

384. Detection of SMS Spam Using Machine-Learning Algorithms

Author: Mohamed Adnane Mahraz, Hamid Tairi, Jamal Riffi, Ali Yahyaouy, and Fatima Zohra El hlouli
Subjects: Short Message Service, business.industry, Computer science, Feature vector, Machine learning, computer.software_genre, Random forest, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Mobile phone, Bag-of-words model, Multilayer perceptron, Artificial intelligence, tf–idf, business, computer
Abstract: Short message service (SMS) is considered as one of the most popular means of communication, it allows to the mobile phone users to exchange a short text message with a low cost. Its growing popularity and its dependence on mobile phone has increased the number of attacks, caused by sending an unsolicited message like SMS spam. In this paper, we address a comparative study, between multilayer perceptron (MLP), support vector machine (SVM), random forest and k-nearest neighbors (KNN). For extracting the feature vectors the bag-of-words (BOW) and the TF-IDF methods are applied. These feature vectors are used as input for training and testing the different machine-learning classifiers mentioned above. The results of different machine-learning classifiers, based on their accuracy, precision, recall, F-measure, and ROC (receiver operating characteristic) curve have shown that the MLP outperforms SVM, random forest and KNN in SMS spam detection. Although the MLP has achieved the highest accuracy by using the BOW, than by using the TF-IDF method.
Published: 2020

385. Exploiting CBOW and LSTM Models to Generate Trace Representation for Process Mining

Author: Hong-Nhung Bui, Tri-Thanh Nguyen, Quang-Thuy Ha, Hien-Hanh Nguyen, and Trong-Sinh Vu
Subjects: 050101 languages & linguistics, Exploit, Computer science, business.industry, Deep learning, 05 social sciences, Representation (systemics), Process mining, 02 engineering and technology, Machine learning, computer.software_genre, Field (computer science), Dimension (vector space), Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Artificial intelligence, business, computer, TRACE (psycholinguistics)
Abstract: In the field of process mining, one of the challenges of the trace representation problem is to exploit a lot of potentially useful information within the traces while keeping a low dimension of the corresponding vector space. Motivated by the initial results of applying the deep neural networks for producing trace representation, in this paper, we continue to study and apply two more advanced models of deep learning, i.e., Continuous Bag of Words and Long short-term memory, for generating the trace representation. The experimental results have achieved significant improvement, i.e., not only showing the close relationship between the activities in a trace but also helping to reduce the dimension of trace representation.
Published: 2020

386. Food Recognition: Can Deep Learning or Bag-of-Words Match Humans?

Author: Pedro Furtado
Subjects: Food recognition, Bag-of-words model, Computer science, business.industry, Deep learning, Artificial intelligence, computer.software_genre, business, computer, Natural language processing
Published: 2020

387. Plant Species Identification using Discriminant Bag of Words (DBoW)

Author: Muhammad Haroon Yousaf, Fiza Murtaza, Serestina Viriri, and Umber Saba
Subjects: Discriminant, Bag-of-words model, business.industry, Computer science, Plant species identification, Pattern recognition, Artificial intelligence, business
Published: 2020

388. Evolving a Weighted Combination of Text Similarities for Authorship Attribution

Author: Youssef Keyrouz, Dany Mezher, Cyril Fonlupt, Rafic Faddoul, and Denis Robilliard
Subjects: business.industry, Computer science, Syntactic methods, Feature extraction, Semantics, computer.software_genre, Weighting, Set (abstract data type), Identification (information), Bag-of-words model, Similarity (psychology), Artificial intelligence, business, computer, Natural language processing
Abstract: Authorship Attribution (AA) also known as Authorship Identification is the problem of identifying the author of an anonymous text based on its characteristics or features. Among notable features extraction methods used to this end, one can cite, the bag of words methods (BOW) and the semantic and syntactic methods (SSM). BOW methods consider the text as a sequence of tokens and disregard the semantics of the language, whereas SSM rely on advanced natural language processing (NLP) techniques. The features extracted from an anonymous text are compared to features extracted from a corpus of texts written by known authors using several similarity measures. In this paper, we combine multiple results generated using conventional methods (chosen from the literature) and we use a genetic algorithm (GA) to find the optimal weighting distribution. The optimal combination obtained by the GA is then applied, and the author attributed to the anonymous text is selected among a set of known authors based on the highest similarity. The fitness of our GA is the resulting accuracy of the authorship attribution task. A numerical application on a corpus consisting of 3036 books written by 142 authors shows that the proposed method has higher accuracy than conventional methods and achieved satisfying performance.
Published: 2020

389. Writer Identification Based on Combination of Bag of Words Model and Multiple Classifiers

Author: Ayixiamu Litifu, Jinsheng Xiao, Yuchen Yan, Jihua Wang, Weiqing Yao, and Hao Jiang
Subjects: 021110 strategic, defence & security studies, Feature fusion, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 0211 other engineering and technologies, Pattern recognition, 02 engineering and technology, Image (mathematics), Moment (mathematics), Identification (information), Task (computing), ComputingMethodologies_PATTERNRECOGNITION, Index (publishing), Handwriting, Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: In this paper, an efficient approach for text-independent writer identification using bag of words model and the combination of multiple classifiers is proposed. First of all, a bag of words model is established by extracting sub-images from the original handwriting image. Then, features are extracted by moment method, direction index histogram method and simplified Wigner method respectively to calculate the distance between the sub images having the same labels. Finally, the handwriting classification task is completed by means of feature fusion and multi-classifier combination. To evaluate this approach, writer identification is conducted on IAM English database. Experimental results revealed that the proposed writer identification algorithm with small number of characters and unconstrained contents achieves interesting results as compared to those reported by the existing writer recognition systems.
Published: 2020

390. When Sentiment Is News: The Polarity Pattern Approach

Author: Nazanin Babolmorad and Nadia Massoud
Subjects: business.industry, Computer science, Deep learning, Sentiment analysis, Context (language use), computer.software_genre, Syntax, Tone (literature), Bag-of-words model, Artificial intelligence, Explanatory power, business, Set (psychology), computer, Natural language processing
Abstract: We present a new approach and set of dictionaries for the analysis of sentiments in financial and business news headlines at the firm level. The proposed Polarity Pattern Approach (PPA) captures the deep structure of context and syntax and reduces measurement errors evident in previous approaches for classifying the tone of words and sentences in news headlines. We demonstrate the superior predictive validity of PPA using an out-of-sample dataset. As expected, negative news headlines are negatively related to stock returns and positively related to trading volumes, while positive news headlines are positively related to both, especially for finance news. Prior studies report a non-significant or, counter intuitively, negative relationship between positive news and stock returns, which we attribute to measurement error in previous approaches. Additionally, using market activities, we jointly estimate cross-sectional regression to compare the explanatory power of the PPA approach with commonly used methods. We show how PPA’s tone assignment significantly outperforms other dictionaries and approaches, including the commercial package (RavenPack) that uses both machine learning and deep learning. The PPA, including dictionaries, provides finance researchers with a new, more valid method of sentiment analysis for media news. Link to PPA tone assignment dataset: https://www.whensentimentisnews.com
Published: 2020

391. Mining Textual Collections

Author: Gábor Mihály Tóth and Barbara McGillivray
Subjects: Topic model, Computer science, business.industry, Feature vector, Dimensionality reduction, Cosine similarity, Feature selection, computer.software_genre, Feature (computer vision), Bag-of-words model, Similarity (psychology), Artificial intelligence, business, computer, Natural language processing
Abstract: This chapter introduces the representation of texts as elements of feature spaces, as well as various exploratory tools to study such representations. It investigates how students of humanities can discover groups of topically similar texts in a large textual collection and how recurring themes giving rise to similarity can be detected. Concepts including feature space, bag of words, cosine similarity, document collection, document vector, and document–term matrix are explained through a number of simplified examples; key engineering procedures (feature selection, feature scoring) in the construction of feature spaces are also introduced. The more complex application example of the study of the Anglo-Saxon Chronicle demonstrates how to detect similar records and recurring themes using explorative methods such as dimensionality reduction, clustering and topic modelling. The chapter concludes by pointing out the limitation of feature space representation and by defining what topical similarity means in the context of language technology.
Published: 2020

392. The Influence of Feature Selection on Job Clustering for an E-recruitment Recommender System

Author: Fabrício Gomes Vilasbôas, Leandro Nunes de Castro, and Joel J. S. Junior
Subjects: business.industry, Process (engineering), Computer science, 05 social sciences, Feature selection, 02 engineering and technology, Recommender system, Machine learning, computer.software_genre, e-recruitment, Task (computing), Feature (computer vision), Bag-of-words model, 020204 information systems, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, 050211 marketing, Artificial intelligence, business, Cluster analysis, computer
Abstract: Recommender systems aim to effectively recommend items to the user based on their profile. An online recruitment system recommends jobs for a candidate according to his profile and can also act in reverse, recommending more qualified candidates for a particular job. Defining which variables will be used impacts directly the recommendation quality so that, when using the most important variables, we have a better assertiveness in the process. The goal of this work is to select the most important features of an online recruitment database using feature selection techniques. More specifically, we used the algorithms of Mitra, SUD and ACA to perform feature selection. The datasets used were derived from the original dataset assuming three distinct scenarios: the dataset containing the attributes related with the jobs’ features; the dataset containing the bag of words of the description feature of the jobs; and the dataset resulting from the union of the two previous ones. The features’ subsets selected in each of the above scenarios had their performance evaluated in a clustering task. The results obtained in each scenario show a performance gain of the clustering process when feature selection is made over the original data. Also, it was observed that the jobs’ features result in better performance than the other two cases.
Published: 2020

393. Research on Human Interaction Recognition Algorithm Based on Interest Point of Depth Information Fusion

Author: Xiaofei Ji, Zhuangzhuang Jin, and Yangyang Wang
Subjects: Computer science, business.industry, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scale-invariant feature transform, Pattern recognition, Support vector machine, Bag-of-words model, RGB color model, Artificial intelligence, business, Projection (set theory), Cluster analysis, Representation (mathematics)
Abstract: The traditional feature description methods of human interaction action based on RGB video are greatly affected by illumination change, object occlusion and environmental change. In this paper, depth and color information of the image are fused in the processing of feature extraction, and a novel human interaction recognition algorithm based on interest point of depth information fusion is proposed. Firstly, the spatio-temporal interest points (STIP) are extracted from the video, and the spatio-temporal interest points are processed hierarchically by the corresponding depth information. Secondly, three-dimensional scale invariant feature transform (3D SIFT) are used to obtain the feature description of hierarchical interest points, the model of the bag of words is constructed by using K-means clustering and Gaussian mixture clustering, and the representation of the training video is obtained based on dictionary projection. Finally, the support vector machine is used to classify different layers of video features respectively, then human interaction recognition is achieved by decision-level fusion of recognition probability for different layers of a video. The experimental results show that the accuracy of the human interaction recognition is improved by combing the depth information, and the proposed algorithm achieves 88.75% recognition accuracy in the SBU Kinect interaction dataset. The validity of the proposed method is proved.
Published: 2020

394. Comparison of Traditional Machine Learning and Deep Learning Approaches for Sentiment Analysis

Author: Dhvani Kansara and Vinaya Sawant
Subjects: Computer science, business.industry, Deep learning, Sentiment analysis, Machine learning, computer.software_genre, Convolutional neural network, Random forest, Bayes' theorem, Recurrent neural network, Bag-of-words model, Benchmark (computing), Artificial intelligence, business, computer
Abstract: With the advent of deep learning paradigms, many new algorithms have been flourishing which can be leveraged for the task of sentiment analysis. In this paper, we present an experimental comparison of traditional classification algorithms—Naive Bayes, Logistic Regression, Random Forest using bag of words model along with modern deep learning algorithms like Recurrent Neural Networks with Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) on reviews and tweets using three different datasets to classify a binary output, based on positive or negative polarity of the sentiment. We compare the results to find which model provides the best accuracy and analyze reason behind the differences in accuracies. The goal is to explore natural language processing (NLP) models trained on a certain dataset and then benchmark their respective performances, thereby ultimately demonstrating that deep learning algorithms prove to be more effective. We also state how the deep learning models outperform the traditional models and analyze them in detail.
Published: 2020

395. Performance Analysis of Most Prominent Machine Learning and Deep Learning Algorithms In Classifying Bangla Crime News Articles

Author: Ariful Islam, Mun Yea Mahafi Taz Zahara, Salma Tabashum, Fahmida Naznin Fami, and Md. Mamun Hossain
Subjects: Crime news, business.industry, Computer science, Deep learning, Machine learning, computer.software_genre, language.human_language, Support vector machine, Bengali, Bag-of-words model, language, Word2vec, Artificial intelligence, tf–idf, business, Classifier (UML), Algorithm, computer
Abstract: This work is dedicated to Bangla Crime Type Classification. As very few works had been done for Bangla crime classifier. To carry out this research, first we have developed a Bangla crime dataset which contains around 24,295 news articles and made most of them publicly available at github. Then we have built our crime classifier model and trained the classifier with our own dataset. We have analyzed word vectors like bag of words, TF-IDF in state-of-art machine learning algorithms as well as most promising semantic and syntactic word embeddings like Word2Vec, GloVe, fast-Text in both shallow and deep CNN and RNN to select best word embeddings for our classifier module. Finally we have summarized the experimental result in tabular form. We can see that significant improved accuracy can be achieved using deep learning algorithms over state-of-art machine learning algorithms in classifying Bangla crime data. The final experimental result shows that using shallow CNN with fastText,proposed model is able to achieve 93.70% accuracy.
Published: 2020

396. Gender and Usage in the Units of Spoken Discourse

Author: Michael Barlow and Vaclav Brezina
Subjects: Variation (linguistics), Bag-of-words model, Corpus linguistics, General Arts and Humanities, media_common.quotation_subject, British National Corpus, Selection (linguistics), Conversation, Keyword analysis, Psychology, Linguistics, Natural language, media_common
Abstract: In this article we examine gender differences in the spoken usage using a selection of files from the British National Corpus (BNC). Our aim is twofold. First, to report on some similarities and differences in the words and phrases used by men and women in conversation. Secondly, we address some methodological issues related to the study of gender and to corpus linguistics research in general. In particular, we aim to address what we call the “bag of words, bag of people’ problem. In many studies a corpus is treated as a bag of words in common techniques such as a keyword analysis. Such frequency-based analyses have led to many discoveries about the nature of language, but the backgrounding of discourse and text structure is problematic in obscuring some patterns of language usage. In addition, corpora are necessarily compiled using the language output of many individuals---a bag of people---and the individual contributions, and hence variation in usage, are often overlooked. These issues are explored with reference to some linguistic elements known to potentially sensitive to gender variation.
Published: 2018

397. Automated News Classification using N-gram Model and Key Features of Nepali Language

Author: Rupesh Dahi Shrestha, Dinesh Dangol, and Arun K. Timalsina
Subjects: Nepali, business.industry, Computer science, computer.software_genre, language.human_language, Focus (linguistics), Task (project management), n-gram, Text processing, Bag-of-words model, Vector space model, language, Key (cryptography), Artificial intelligence, business, computer, Natural language processing
Abstract: With an increasing trend of publishing news online on website, automatic text processing becomes more and more important. Automatic text classification has been a focus of many researchers in different languages for decades. There is a huge amount of research repository on features of English language and their uses on automated text processing. This research implements Nepali language key features for automatic text classification of Nepali news. In particular, the study on impact of Nepali language based features, which are extremely different than English language is more challenging because of the higher level of complexity to be resolved. The research experiment using vector space model, n-gram model and key feature based processing specific to Nepali language shows promising result compared to bag-of-words model for the task of automated Nepali news classification.
Published: 2018

398. Dynamic Classification for Web Documents Using Semantic Knowledge (DBpedia)

Author: Dina Eid Eldemerdash and Passent El-Kafrawy
Subjects: Information retrieval, Computer science, Document classification, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Dynamic web page, Ontology (information science), computer.software_genre, Class (biology), Set (abstract data type), ComputingMethodologies_PATTERNRECOGNITION, Bag-of-words model, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Semantic memory, computer, Classifier (UML)
Abstract: we present a dynamic Web document Classification using semantic knowledge (DBpedia). We present a method for a dynamic Web document Classification and automatic classification. The proposed approach required only a domain ontology and a set of user predefined categories. Currently, most approaches to text classification represent document as (bag of words) and training the large set of documents to train the classifier. Our approach doesn't require a training set of documents. In our proposed method, we use DBpedia ontology as the main classifier, representing documents as (bag of concepts). We extract the terms from the document, extract their resources from DBpedia Spotlight, use Sparqle query to determine class ontology and map them to their concepts then we determine the best category.
Published: 2018

399. Learning bag-of-embedded-words representations for textual information retrieval

Author: Nikolaos Passalis and Anastasios Tefas
Subjects: Word embedding, Computer science, business.industry, Feature extraction, Cosine similarity, Relevance feedback, 02 engineering and technology, computer.software_genre, Weighting, Artificial Intelligence, Bag-of-words model, 020204 information systems, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Entropy (information theory), 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, business, computer, Software, Natural language processing
Abstract: Word embedding models are able to accurately model the semantic content of words. The process of extracting a set of word embedding vectors from a text document is similar to the feature extraction step of the Bag-of-Features (BoF) model, which is usually used in computer vision tasks. This gives rise to the proposed Bag-of-Embedded Words (BoEW) model that can efficiently represent text documents overcoming the limitations of previously predominantly used techniques, such as the textual Bag-of-Words model. The proposed method extends the regular BoF model by a) incorporating a weighting mask that allows for altering the importance of each learned codeword and b) by optimizing the model end-to-end (from the word embeddings to the weighting mask). Furthermore, the BoEW model also provides a fast way to fine-tune the learned representation towards the information need of the user using relevance feedback techniques. Finally, a novel spherical entropy objective function is proposed to optimize the learned representation for retrieval using the cosine similarity metric.
Published: 2018

400. Review of Policy Measures Taken by the Government

Author: Fahmida Khatun and Syed Yusuf Saadat
Subjects: Government, Focus (computing), business.industry, Bag-of-words model, Five year plan, Political science, Public relations, business, Budget allocation
Abstract: This chapter reviews some of the policy measures taken by the Government of Bangladesh (GoB) to generate employment opportunities for youth. The National Youth Policy (NYP) 2017 is reviewed, and its text is analysed using a simple bag of words model. Other government documents such as the 7th Five Year Plan (7FYP) of Bangladesh are also reviewed with a focus on youth. Some brief discussion on budget allocation for youth is included.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

2,466 results on '"Bag-of-words model"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources