1,184 results on '"TRIGRAM"'
Search Results
2. Context-Based Clustering of Assamese Words using N-gram Model
- Author
-
Bhuyan, M. P., Sarma, S. K., Sarma, P., Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Sengodan, Thangaprakash, editor, Murugappan, M., editor, and Misra, Sanjay, editor
- Published
- 2021
- Full Text
- View/download PDF
3. Text Predictor for RTL Languages (Urdu, Arabic, Persian, and similar) of Middle East.
- Author
-
Ahmad, Ashfaq, Idrees, Muhammad, and Danish, Hafiz Muhammad
- Abstract
It is difficult to predict, especially about the future. But, prediction about some future words like predicting the next few words, someone is going to say, seems to be easier. For example, "please come-"; in this "in" and "on" are more likely to come as next words. The next coming suitable word predictor is an important tool for content writing in any language, especially in which a correct word is anticipated in the current context. Writing content in some other languages using the English keyboard is a tedious task. However, prediction of the next words can make this task simple, easier, and interesting. A significant amount of time and human effort can also be saved. In this paper, an Urdu Text Predictor is introduced in which Statistical Language N-Models are used to perform this task of building a word/text prediction system. [ABSTRACT FROM AUTHOR]
- Published
- 2022
4. Online Textual Symptomatic Assessment Chatbot Based on Q&A Weighted Scoring for Female Breast Cancer Prescreening.
- Author
-
Chen, Jen-Hui, Agbodike, Obinna, Kuo, Wen-Ling, Wang, Lei, Huang, Chiao-Hua, Shen, Yu-Shian, and Chen, Bing-Hong
- Subjects
MEDICAL personnel ,ARTIFICIAL intelligence ,BREAST cancer ,MEDICAL consultation ,SYMPTOMS ,NATURAL language processing ,MEDICAL communication - Abstract
The increasing number of female breast cancer (FBC) incidences in the East predominated by Chinese language speakers has generated concerns over women's medicare. To minimize the mortality rate associated with FBC in the region, governments and health experts are jointly encouraging women to undergo mammography screening at the earliest suspicion of FBC symptoms. However, studies show that a huge number of women affected by FBC tend to delay medical consultation at its early stage as a result of factors such as complacency due to unawareness of FBC symptoms, procrastination due to lifestyle, and the feeling of embarrassment in discussing private matters especially with medical personnel of the opposite gender. To address these issues, we propose a symptomatic assessment chatbot (SAC) based on artificial intelligence (AI) designed to prescreen women for FBC symptoms via a textual question-and-answer (Q&A) approach. The purpose of our chatbot is to assist women in engaging in communication regarding FBC symptoms, so as to subsequently initiate formal medical consultations for early FBC diagnosis and treatment. We implemented the SAC systematically with some of the latest natural language processing (NLP) techniques suitable for Chinese word segmentation (CWS) and trained the model with real-world FBC Q&A data obtained from a major hospital in Taiwan. The results from our experiments showed that the SAC achieved very high accuracy in FBC assessment scoring in comparison to FBC patients' screening benchmark scores obtained from doctors. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
5. Enhanced DSSM (deep semantic structure modelling) technique for job recommendation
- Author
-
Sheetal Rathi and Ravita Mishra
- Subjects
Structure (mathematical logic) ,General Computer Science ,Computer science ,business.industry ,Job description ,Recommender system ,Machine learning ,computer.software_genre ,Information overload ,Cold start ,Scalability ,Collaborative filtering ,Trigram ,Artificial intelligence ,business ,computer - Abstract
Now a day’s recommendation system take care of the issue of the massive amount of information overload problem and it provides the services to the candidates to concentrate on relevant information on job domain only. The job recommender system plays an important role in the recruitment process of fresher as well as experienced today. Existing job recommender system mainly focuses on content-based filtering to extricate profile content and on collaborative filtering to capture the behaviour of the user in the form of rating. Dynamic nature of job market leads cold start and scalability issues. This problem can be addressed by item-based collaborative filtering with a machine learning technique, it learns job embedding vector and finds similar jobs content-wise. Existing model in job recommender domain uses the confining model to address the cold start and scalability issue and provide better recommendation, but they fail to accept the complex relationships between job description and candidate profile. In this paper, we are proposing a Deep Semantic Structure Algorithm that overcome the issue of the existing system. Deep semantic structure modelling (DSSM) system uses the semantic representation of sparse data and it represent the job description and skill entities in character trigram format which increases the efficacy of the system. We are comparing the results to three variation of DSSM model with two different dataset (Naukari.com and CareerBuilder. com) and it gives satisfactory results. Experimental results shows that the DSSM Embedding model and its other variants are provides promising results in solving cold start problem in comparison with several variants of embedding model. We used Xavier initializer to initialise the model parameter and Adam optimizer to optimize the system performance.
- Published
- 2022
- Full Text
- View/download PDF
6. AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
- Author
-
Md. Tarek Habib, Md. Masudul Haque, and Md. Mokhlesur Rahman
- Subjects
FOS: Computer and information sciences ,Computer science ,Speech recognition ,Bigram ,computer.software_genre ,deleted interpolation ,stochastic model ,Typing ,natural language processing ,Computer Science - Computation and Language ,backoff method ,business.industry ,corpus, N-gram ,Word prediction ,language.human_language ,Bengali ,language ,Trigram ,Language model ,Artificial intelligence ,business ,computer ,Computation and Language (cs.CL) ,Natural language processing ,Word (computer architecture) ,Sentence - Abstract
Word completion and word prediction are two important phenomena in typing that benefit users who type using keyboard or other similar devices. They can have profound impact on the typing of disable people. Our work is based on word prediction on Bangla sentence by using stochastic, i.e. N-gram language model such as unigram, bigram, trigram, deleted Interpolation and backoff models for auto completing a sentence by predicting a correct word in a sentence which saves time and keystrokes of typing and also reduces misspelling. We use large data corpus of Bangla language of different word types to predict correct word with the accuracy as much as possible. We have found promising results. We hope that our work will impact on the baseline for automated Bangla typing., Comment: in International Journal in Foundations of Computer Science & Technology (IJFCST) Vol.5, No.6, November 2015
- Published
- 2023
- Full Text
- View/download PDF
7. MIRD: Trigram-Based Malicious URL Detection Implanted with Random Domain Name Recognition
- Author
-
Xiong, Cuiwen, Li, Pengxiao, Zhang, Peng, Liu, Qingyun, Tan, Jianlong, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Liu, Ting, Series editor, Kotenko, Igor, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Niu, Wenjia, editor, Li, Gang, editor, Liu, Jiqiang, editor, Tan, Jianlong, editor, Guo, Li, editor, Han, Zhen, editor, and Batten, Lynn, editor
- Published
- 2015
- Full Text
- View/download PDF
8. Analysis of Public Sentiment on COVID-19 Vaccination Using Twitter
- Author
-
Vinay Kumar, Binod Kumar Singh, Gutti Gowri Jayasurya, and Sanjay Kumar
- Subjects
Computer science ,business.industry ,Bigram ,Sentiment analysis ,Logistic regression ,computer.software_genre ,Human-Computer Interaction ,Naive Bayes classifier ,Modeling and Simulation ,Classifier (linguistics) ,Trigram ,Social media ,Artificial intelligence ,business ,computer ,Social Sciences (miscellaneous) ,Natural language processing ,Natural language - Abstract
Social media has become a vital platform for individuals, organizations, and governments worldwide to communicate and express their views. During the coronavirus disease 2019 (COVID-19) pandemic, social media sites play a crucial role in people communicating, sharing, and expressing their perceptions on various topics. Analyzing such textual data can improve the response time of governments and organizations to act on alarming issues. This study aims to perform sentiment analysis on the subject of COVID-19 vaccination, perform temporal and spatial analyses of the textual data, and find the most frequently discussed topics that may help organizations bring awareness to those topics. In this work, the sentiment analysis of tweets was performed using 14 different machine learning classifiers and natural language processing (NLP). Lexicon-based TextBlob and Vader are used for annotating the data. A natural language toolkit is used for preprocessing of textual data. Our analysis observed that unigram models outperform bigram and trigram models for all four datasets. Models using term frequency-inverse document frequency (TF-IDF) have higher accuracy than models using count vectorizer. In the count vectorizer class, logistic regression has the best average accuracy with 91.925%. In the TF-IDF class, logistic regression has the best average accuracy of 92%; logistic regression has the highest average recall, F1-score, and ten cross-validation scores, and a ridge classifier has the highest average precision. The unigram models show a standard deviation (SD) of less than 1 for all classifiers except for the Gaussian Naive Bayes showing 1.18. The experimental results reveal the dates and times in which most positive, negative, and neutral tweets are posted.
- Published
- 2022
- Full Text
- View/download PDF
9. Online Textual Symptomatic Assessment Chatbot Based on Q&A Weighted Scoring for Female Breast Cancer Prescreening
- Author
-
Jen-Hui Chen, Obinna Agbodike, Wen-Ling Kuo, Lei Wang, Chiao-Hua Huang, Yu-Shian Shen, and Bing-Hong Chen
- Subjects
female breast cancer (FBC) ,chatbot ,patient-centric healthcare ,NLP ,trigram ,CWS ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
The increasing number of female breast cancer (FBC) incidences in the East predominated by Chinese language speakers has generated concerns over women’s medicare. To minimize the mortality rate associated with FBC in the region, governments and health experts are jointly encouraging women to undergo mammography screening at the earliest suspicion of FBC symptoms. However, studies show that a huge number of women affected by FBC tend to delay medical consultation at its early stage as a result of factors such as complacency due to unawareness of FBC symptoms, procrastination due to lifestyle, and the feeling of embarrassment in discussing private matters especially with medical personnel of the opposite gender. To address these issues, we propose a symptomatic assessment chatbot (SAC) based on artificial intelligence (AI) designed to prescreen women for FBC symptoms via a textual question-and-answer (Q&A) approach. The purpose of our chatbot is to assist women in engaging in communication regarding FBC symptoms, so as to subsequently initiate formal medical consultations for early FBC diagnosis and treatment. We implemented the SAC systematically with some of the latest natural language processing (NLP) techniques suitable for Chinese word segmentation (CWS) and trained the model with real-world FBC Q&A data obtained from a major hospital in Taiwan. The results from our experiments showed that the SAC achieved very high accuracy in FBC assessment scoring in comparison to FBC patients’ screening benchmark scores obtained from doctors.
- Published
- 2021
- Full Text
- View/download PDF
10. Neurovisual training (TRIGRAM) in young patients with visual-perceptive dyslexia
- Author
-
Fernanda Pacella, Raffaele Migliorini, Chiara Marchegiani, Alessandro Segnalini, Paolo Turchetti, Sandra Cinzia Carlesimo, and Elena Pacella
- Subjects
Dyslexia ,visual-perceptive ,protocol TETRA ,Neurovisual training ,TRIGRAM ,children ,Medicine ,Science - Abstract
Dyslexia is a language-based learning disability. Although this condition is characterized by anatomical malformation of the brain, it seems that the typical reading pattern of dyslexic may be also related to more complex sensory deficits. Among them, visual-perceptive deficits have been described in a subtype of dyslexia, called visual-perceptive dyslexia. The distinctive feature of a patient suffering from visual-perceptive dyslexia form is marked by effortlessly recognize the characteristics of each individual stimulus. The Tetra protocol is a visual-perceptual evaluation protocol that was introduced for the diagnostic phase and the rehabilitation of visual-perceptive dyslexia. The diagnostic tests include: the eidomorphometry test, designed to evaluate the perception of spatial relationships; the contrast sensitivity threshold test, especially at low spatial frequencies; and the REPORT TEST words, to assess the speed and the reading efficiency. In addition, the rehabilitation phase is carried out with the visual neuro-enhancement program TRIGRAM, a visual training proposal designed to reduce the lateral masking phenomenon in visual-perceptive dyslexic. Thus, in this study we used the diagnostic tests of TETRA® Protocol to determine presence of visual-perceptual abnormalities in children with dyslexia. Proven time the presence of these visual-perceptual alterations, the patients were also subjected to the rehabilitation sessions of TRIGRAM, in order to investigate whether this visual training may improve the pattern of reading. At the end of the program (t1) and after three months (t2), the same subjects underwent the same diagnostic tests of TETRA® Protocol to evaluate and confirm the results obtained during rehabilitation program. The results showed a significant increase in contrast sensitivity at low and high spatial frequencies. Moreover, the same improvements in the visual system's ability to discriminate the contours of an object within the field of view, have been maintained three months after the end of treatment. We also observed a significant improvement in the perception of spatial relationships, with reduction of SRA value. In conclusion, this study demonstrates that the visual rehabilitation training (TRIGRAM) is able to improve the perception of spatial relationships, and increase contrast sensitivity in young patients affected by "visual dyslexia". Nonetheless, these data need to be confirmed in larger cohort of subjects in order to establish whether these effects can also increase lexical ability (increased reading speed and reduce errors during the lexical task).
- Published
- 2017
- Full Text
- View/download PDF
11. Comparative analysis of machine learning-based classification models using sentiment classification of tweets related to COVID-19 pandemic
- Author
-
Kamal Gulati, S. Saravana Kumar, Raja Sarath Kumar Boddu, Ketan Sarvakar, Dilip Kumar Sharma, and M.Z.M. Nomani
- Subjects
010302 applied physics ,Data collection ,business.industry ,Computer science ,Bigram ,Sentiment analysis ,02 engineering and technology ,021001 nanoscience & nanotechnology ,Perceptron ,Lexicon ,Machine learning ,computer.software_genre ,01 natural sciences ,ComputingMethodologies_PATTERNRECOGNITION ,0103 physical sciences ,Classifier (linguistics) ,Trigram ,AdaBoost ,Artificial intelligence ,0210 nano-technology ,business ,computer - Abstract
Sentiment Analysis (SA) is the area of research to find useful information using the sentiments of people shared on social networking platforms like Twitter, Facebook, etc. Such kinds of analysis are useful to make classification of sentiments as positive, negative, or neutral. The process of classification of sentiments can be done with the help of a traditional lexicon-based approach or machine learning techniques-based approach. In this research paper, we are presenting a comparative analysis of popular machine learning-based classifiers. We have made experimentations using the tweet datasets related to the COVID-19 epidemic. We have used seven machine learning-based classifiers. These classifiers are applied to more than 72,000 tweets related to COVID-19. We have performed experimentations using three modes i.e. Unigram, Bigram, and Trigram. As per the results, Linear SVC, Perceptron, Passive Aggressive Classifier, and Logistic Regression able to achieve more than 98% maximum accuracy score in classification (unigram, bigram, trigram) and are very close to each other in terms of performance. The average accuracy achieved by Linear SVC, Perceptron, Passive Aggressive Classifier, and Logistic Regression are 0.981573613, 0.976506357, 0.981573613, and 0.976690621. Ada Boost Classifier performs worst among all other classifiers with 0.731435416 average accuracies. The details regarding data collection, experimentations, and results are presented in the research paper.
- Published
- 2022
- Full Text
- View/download PDF
12. Performance Analysis of Case Based Word Sense Disambiguation with Minimal Features Using Neural Network
- Author
-
Tamilselvi, P., Srivatsa, S. K., Krishna, P. Venkata, editor, Babu, M. Rajasekhara, editor, and Ariwa, Ezendu, editor
- Published
- 2012
- Full Text
- View/download PDF
13. Real-Word Error Correction with Trigrams: Correcting Multiple Errors in a Sentence
- Author
-
Seyed Mohammadsadegh Dashti
- Subjects
FOS: Computer and information sciences ,Linguistics and Language ,Computer science ,Computer Science - Artificial Intelligence ,media_common.quotation_subject ,Speech recognition ,WordNet ,02 engineering and technology ,Library and Information Sciences ,computer.software_genre ,Language and Linguistics ,Education ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,media_common ,Computer Science - Computation and Language ,Grammar ,business.industry ,Probabilistic logic ,Spelling ,Artificial Intelligence (cs.AI) ,020201 artificial intelligence & image processing ,Trigram ,Language model ,Artificial intelligence ,Computational linguistics ,business ,Error detection and correction ,Computation and Language (cs.CL) ,computer ,Natural language processing ,Sentence - Abstract
Spelling correction is a fundamental task in Text Mining. In this study, we assess the real-word error correction model proposed by Mays, Damerau and Mercer and describe several drawbacks of the model. We propose a new variation which focuses on detecting and correcting multiple real-word errors in a sentence, by manipulating a Probabilistic Context-Free Grammar (PCFG) to discriminate between items in the search space. We test our approach on the Wall Street Journal corpus and show that it outperforms Hirst and Budanitsky's WordNet-based method and Wilcox-O'Hearn, Hirst, and Budanitsky's fixed windows size method.-O'Hearn, Hirst, and Budanitsky's fixed windows size method.
- Published
- 2023
14. Improving Trigram Language Modeling with the World Wide Web
- Author
-
Roni Rosenfeld and Xiaojin Zhu
- Subjects
FOS: Computer and information sciences ,Information retrieval ,Phrase ,Computer science ,business.industry ,Word error rate ,computer.software_genre ,World Wide Web ,Trigram tagger ,Test set ,Web page ,89999 Information and Computing Sciences not elsewhere classified ,Trigram ,Language model ,Artificial intelligence ,business ,computer ,80107 Natural Language Processing ,Natural language ,Natural language processing - Abstract
We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical lan- guage modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set.
- Published
- 2023
- Full Text
- View/download PDF
15. Performance Study of N-grams in the Analysis of Sentiments
- Author
-
A. Gelbukh, O. O. Adebanji, Hiram Calvo, and O. E. Ojo
- Subjects
Computer science ,business.industry ,Physics ,QC1-999 ,General Mathematics ,Deep learning ,media_common.quotation_subject ,Bigram ,Sentiment analysis ,deep learning ,General Physics and Astronomy ,General Chemistry ,computer.software_genre ,Domain (software engineering) ,economic texts ,machine learning ,sentiment analysis ,Trigram ,ngrams ,Artificial intelligence ,Function (engineering) ,business ,computer ,Natural language processing ,media_common - Abstract
In this work, a study investigation was carried out using n-grams to classify sentiments with different machine learning and deep learning methods. We used this approach, which combines existing techniques, with the problem of predicting sequence tags to understand the advantages and problems confronted with using unigrams, bigrams and trigrams to analyse economic texts. Our study aims to fill the gap by evaluating the performance of these n-grams features on different texts in the economic domain using nine sentiment analysis techniques and found more insights. We show that by comparing the performance of these features on different datasets and using multiple learning techniques, we extracted useful intelligence. The evaluation involves assessing the precision, recall, f1-score and accuracy of the function output of the several machine learning algorithms proposed. The methods were tested using Amazon, IMDB, Reuters, and Yelp economic review datasets and our comprehensive experiment shows the effectiveness of n-grams in the analysis of sentiments.
- Published
- 2021
- Full Text
- View/download PDF
16. Sentence Classification Using N-Grams in Urdu Language Text
- Author
-
Sikandar Ali, Malik Daler Ali Awan, Ali Samad, Nadeem Iqbal, Malik Muhammad Saad Missen, and Niamat Ullah
- Subjects
Learning classifier system ,Article Subject ,business.industry ,Computer science ,Bigram ,computer.software_genre ,language.human_language ,Computer Science Applications ,QA76.75-76.765 ,language ,The Internet ,Social media ,Trigram ,Computer software ,Local language ,Urdu ,Artificial intelligence ,business ,computer ,Software ,Natural language processing ,Sentence - Abstract
The usage of local languages is being common in social media and news channels. The people share the worthy insights about various topics related to their lives in different languages. A bulk of text in various local languages exists on the Internet that contains invaluable information. The analysis of such type of stuff (local language’s text) will certainly help improve a number of Natural Language Processing (NLP) tasks. The information extracted from local languages can be used to develop various applications to add new milestone in the field of NLP. In this paper, we presented an applied research task, “multiclass sentence classification for Urdu language text at sentence level existing on the social networks, i.e., Twitter, Facebook, and news channels by using N-grams features.” Our dataset consists of more than 1,00000 instances of twelve (12) different types of topics. A famous machine learning classifier Random Forest is used to classify the sentences. It showed 80.15%, 76.88%, and 64.41% accuracy for unigram, bigram, and trigram features, respectively.
- Published
- 2021
- Full Text
- View/download PDF
17. Kyd, Shakespeare, and Arden of Faversham: Rerunning a Two-Horse Race
- Author
-
Darren Freebury-Jones
- Subjects
Cultural Studies ,Literature ,History ,Race (biology) ,Literature and Literary Theory ,business.industry ,Trigram ,business - Abstract
In a general essay published in the Times Literary Supplement in 2008, Brian Vickers combined close study of trigrams (sequences of three or more words) highlighted by anti-plagiarism software with...
- Published
- 2021
- Full Text
- View/download PDF
18. Kūn’s ‘body’: the many appearances and meanings of the pure even-numbered trigram in the early Yìjīng and related texts
- Author
-
Adam Schwartz
- Subjects
Chemistry ,General Engineering ,Trigram ,Linguistics - Abstract
One of the main reasons why words (i.e., ‘images’) in the Yìjīng and Guīcáng might appear so enigmatic is because they have become detached from the ‘pictures’ (guàhuà 卦畫) or ‘bodies’ (guàtǐ 卦體), as divination results, in which diviners first recognized them. This paper has two objectives. The first, as part of a larger database project, uses early Chinese excavated materials to reconstruct and reimage the many configurations and appearances of trigram Kūn’s ‘body’ (Kūn tǐ 坤體). Seeing and thinking about the pure even-numbered, yīn trigram in its original configurations leads us toward a deeper appreciation and understanding of the complexity of this early system of divination, and doing so is integral to investigating, as a thought experiment, complex relationships between divination results (i.e., trigrams and hexagrams) and numbers, numbers and images, and images and predictions. Users of the Changes should no longer visualize Kūn’s ‘body’ as one-dimensional ☷ and . The second, examines images of trigram Kūn in the Yìjīng, with a starting point being the images in the canonical commentaries, and the Shuō guà commentary in particular, by using hermeneutic principles in the ‘numbers and images’ tradition. The Shuō guà presents images either found in or to be extrapolated from the base text within a structured and highly interpretive system that creates ‘image programs’ for each of the eight trigrams. I argue the Shuō guà’s image programs have a defined architecture, and its images are not random lists of words collected without an agenda and devoid of relationships and mutual interaction with others.
- Published
- 2021
- Full Text
- View/download PDF
19. Fonts of wider letter shapes improve letter recognition in parafovea and periphery
- Author
-
Chiron A. T. Oderkerk and Sofie Beier
- Subjects
Computer science ,Speech recognition ,Parafovea ,Recognition, Psychology ,Physical Therapy, Sports Therapy and Rehabilitation ,Human Factors and Ergonomics ,Legibility ,Pattern Recognition, Visual ,Reading ,Typography ,Typeface ,Font ,Fixation (visual) ,Humans ,Trigram ,Set (psychology) - Abstract
Most text on modern electronic displays is set in fonts of regular letter width. Little is known about whether this is the optimal font width for letter recognition. We tested three variants of the font family Helvetica Neue (Condensed, Standard, and Extended). We ran two separate experiments at different distances and different retinal locations. In Experiment 1, the stimuli were presented in the parafovea at 2° eccentricity; in Experiment 2, the stimuli were presented in the periphery at 9° eccentricity. In both experiments, we employed a short-exposure single-report trigram paradigm in which a string of three letters was presented left or right off-centre. Participants were instructed to report the middle letter while maintaining fixation on the fixation cross. Wider fonts resulted in better recognition and fewer misreadings for neighbouring letters than narrower fonts, which demonstrated that wider letter shapes improve recognition at glance reading in the peripheral visual view. Practitioner summary: Most of the text is set in fonts of regular letter width. In two single-target trigram letter recognition experiments, we showed that wider letter shapes facilitate better recognition than narrower letter shapes. This indicates that when letter identification is a priority, it is beneficial to choose fonts of wider letter shapes.
- Published
- 2021
- Full Text
- View/download PDF
20. The Auditory Consonant Trigram (ACT) Test: A norm updating study for university students
- Author
-
Banu Cangöz-Tavat, Özlem Ertan-Kaya, Funda Salman, Müge Kademli, and Zeynel Baran
- Subjects
Consonant ,medicine.medical_specialty ,Working memory ,Validity ,Sample (statistics) ,Audiology ,Test (assessment) ,Interval (music) ,Neuropsychology and Physiological Psychology ,Developmental and Educational Psychology ,medicine ,Trigram ,Young adult ,Psychology - Abstract
The Auditory Consonant Trigram (ACT) Test is accepted as a pure measurement of verbal working memory, but its norm study and psychometric properties have not been sufficiently researched. This study aims to update the norm data of the ACT, validity and reliability studies of which have been previously conducted on an adult Turkish sample, on a broader young sample and in a way that would end some methodological limitations. For this purpose, the data is collected from 304 voluntary healthy young adults (aged 18-26, 152 females-152 males). According to the results, a difference is found among all delay intervals. While the test scores decrease in females as delay interval increases, there is no difference in males between the delay intervals of 9 and 18 sec. While there is no difference between the genders for very short delay intervals (0-3 sec), males show a more successful performance than females as the delay interval increases (9-18 sec). Males are also more successful than females in terms of total test scores of the ACT. In this respect, it is concluded that the ACT measurement of working memory with a total score reliability coefficient of 0.75 is reliable.
- Published
- 2021
- Full Text
- View/download PDF
21. Гадание по хулилам среди монголов
- Author
-
Anna D. Tsendina and p. , Staraya Basmannaya St., Moscow, Russian Federation
- Subjects
Linguistics and Language ,Archeology ,History ,Buddhism ,Subject (documents) ,Character (symbol) ,Language and Linguistics ,Realia ,Divination ,Anthropology ,Trigram ,China ,Classics ,Period (music) - Abstract
Introduction. Various collections of Mongolian xylographs and manuscripts may contain works on divination practice with eight khulils. What does the word khulil mean? Why does one use eight khulils? What are the texts devoted to the khulil divination? This article deals with the practice of khulil divination in Mongolia, while introducing a Mongolian text devoted to this form of divination. Results. The divination practice goes back to the oldest Chinese source on divination Yijing (I Ching, Book of Changes, about the seventh century BC). Divination is carried out with the help of the trigram, or the three dashes, which are the result of casting coins or of some other method. A combination of trigrams means a particular future. These three lines are called khulil in Mongolian (gua in Chinese). Divination by 8 gua, or 8 khulils, and 64 (8 × 8) or 512 (8 × 8 × 8) combinations is the most common form of divination in China. Later, each trigram was represented by a year of the 12-year animal cycle so that the ninth year was the beginning of the next cycle. Thus, each of the 8 years symbolizes a certain trigram, or khulil, according to the ordinal number of the latter. Granted the number of Mongolian manuscripts on khulil divination in various collections, this divination form was widely practiced by Mongolians. By way of introducing the literature on the subject, the present article presents the Russian translation of the initial fragment of manuscript MN 1145 originating from Ts. Damdinsuren museum in Ulaanbaatar. This is a Mongolian translation from Chinese made relatively late that has few traces of Mongolization or efforts of adaptation to nomadic realia. Besides concerns for the illnesses of relatives or issues of choosing a son-in-law or a bride, which are of a universal character, the most popular topics are questions about farming, such as: should one expect rain? what will be the harvest of grain and raw silk? Also, there are many questions related to promotion and career, e.g., passing exams for the degree of an official. The text contains numerous Sinicisms, including idioms, expressions, and names of Chinese astrological signs; there is also a reference to buying a jins, which points to the Manchu period. Notably, neither Tibetan items nor Buddhist deities are mentioned in the text.
- Published
- 2021
- Full Text
- View/download PDF
22. Malaria Trigram: improving the visualization of recurrence data for malaria elimination
- Author
-
Kayo Henrique de Carvalho Monteiro, Vanderson de Souza Sampaio, Jose Diego Brito-Sousa, Patricia Takako Endo, Cleber Matos de Morais, Wuelton Marcelo Monteiro, and Judith Kelner
- Subjects
medicine.medical_specialty ,Computer science ,Elimination ,Plasmodium vivax ,RC955-962 ,Psychological intervention ,Infectious and parasitic diseases ,RC109-216 ,Recurrence ,Malaria elimination ,Arctic medicine. Tropical medicine ,parasitic diseases ,Malaria, Vivax ,medicine ,Humans ,Disease Eradication ,Visualization ,Data collection ,Surveillance ,biology ,Research ,Data Visualization ,Public health ,biology.organism_classification ,medicine.disease ,Malaria ,Infectious Diseases ,Population Surveillance ,Epidemiological Monitoring ,Parasitology ,Trigram ,Medical emergency ,Brazil - Abstract
Background Although considerable success in reducing the incidence of malaria has been achieved in Brazil in recent years, an increase in the proportion of cases caused by the harder-to-eliminate Plasmodium vivax parasite can be noted. Recurrences in P. vivax malaria cases are due to new mosquito-bite infections, drug resistance or especially from relapses arising from hypnozoites. As such, new innovative surveillance strategies are needed. The aim of this study was to develop an infographic visualization tool to improve individual-level malaria surveillance focused on malaria elimination in the Brazilian Amazon. Methods Action Research methodology was employed to deal with the complex malaria surveillance problem in the Amazon region. Iterative cycles were used, totalling four cycles with a formal validation of an operational version of the Malaria Trigram tool at the end of the process. Further probabilistic data linkage was carried out so that information on the same patients could be linked, allowing for follow-up analysis since the official system was not planned in such way that includes this purpose. Results An infographic user interface was developed for the Malaria Trigram that incorporates all the visual and descriptive power of the Trigram concept. It is a multidimensional and interactive historical representation of malaria cases per patient over time and provides visual input to decision-makers on recurrences of malaria. Conclusions The Malaria Trigram is aimed to help public health professionals and policy makers to recognise and analyse different types of patterns in malaria events, including recurrences and reinfections, based on the current Brazilian health surveillance system, the SIVEP-Malária system, with no additional primary data collection or change in the current process. By using the Malaria Trigram, it is possible to plan and coordinate interventions for malaria elimination that are integrated with other parallel actions in the Brazilian Amazon region, such as vector control management, effective drug and vaccine deployment strategies.
- Published
- 2021
23. Stochastic Analysis of Lexical and Semantic Enhanced Structural Language Model
- Author
-
Wang, Shaojun, Wang, Shaomin, Cheng, Li, Greiner, Russell, Schuurmans, Dale, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Sakakibara, Yasubumi, editor, Kobayashi, Satoshi, editor, Sato, Kengo, editor, Nishino, Tetsuro, editor, and Tomita, Etsuji, editor
- Published
- 2006
- Full Text
- View/download PDF
24. Language models, surprisal and fantasy in Slavic intercomprehension
- Author
-
Tania Avgustinova, Andrea K. Fischer, Irina Stenger, and Klara Jagrova
- Subjects
Czech ,Think-aloud protocols ,Computer science ,491.8 ,Context (language use) ,02 engineering and technology ,01 natural sciences ,Theoretical Computer Science ,Slavic languages ,Statistical language modelling ,Polish ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Sentential context ,Multilingualism ,010301 acoustics ,020206 networking & telecommunications ,Linguistic distance ,language.human_language ,Linguistics ,Receptive multilingualism ,Human-Computer Interaction ,Reading ,Surprisal ,language ,Trigram ,Language model ,Intercomprehension ,Software ,Sentence - Abstract
In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.
- Published
- 2022
- Full Text
- View/download PDF
25. ImplementasiAlgoritma Naïve Bayes Classifier (NBC) Untuk Analisis Sentimen Komentar Kebijakan Full Day School
- Author
-
Muhammad Halmi Dar, Volvo Sihombing, and Yarma Agustya Dewi Utami
- Subjects
Computer science ,business.industry ,Sentiment analysis ,Feature selection ,Bayes classifier ,Lexicon ,Machine learning ,computer.software_genre ,Naive Bayes classifier ,Trigram ,Artificial intelligence ,business ,computer ,Selection (genetic algorithm) ,Test data - Abstract
Sentiment analysis is an important research topic and is currently being developed. Sentiment analysis is carried out to see the opinion or tendency of a person's opinion on a problem or object, whether it tends to have a negative or positive view. The main purpose of this research is to find out public sentiment towards the Full Day school policy comments from the Facebook Page of the Ministry of Education and Culture of the Republic of Indonesia and to determine the performance of the Na-ïve Bayes Classifier Algorithm. The results of this study indicate that the public's negative sentiment towards the Full Day School policy is higher than positive or neutral sentiment. The highest accuracy value is the Naïve Bayes Classifier algorithm with the trigram feature selection of the 300 data training model with a value of 80%. This simulation has proven that the larger the training data and the selection of features used in the NBC Algorithm affect the accuracy of the results. Meanwhile, the simulation results from 10 test data with 5 different NBC and Lexicon algorithms also show that the Full Day School Policy proposed by the Indonesian Minister of Education and Culture has a higher negative sentiment than positive or neutral by most Facebook users who express opinions through comments. The highest accuracy value is the Naïve Bayes Classifier algorithm with the trigram feature selection of the 300 data training model with a value of 80%. This simulation has proven that the larger the training data and the selection of features used in the NBC Algorithm affect the accuracy of the results. Meanwhile, the simulation results from 10 test data with 5 different NBC and Lexicon algorithms also show that the Full Day School Policy proposed by the Indonesian Minister of Education and Culture has a higher negative sentiment than positive or neutral by most users. Facebook that expresses opinions through comments. The highest accuracy value is the Naïve Bayes Classifier algorithm with the tri-gram feature selection of the 300 data training model with a value of 80%. This simulation has proven that the larger the training data and the selection of features used in the NBC Algorithm affect the accuracy results.
- Published
- 2021
- Full Text
- View/download PDF
26. An in-depth exploration of Bangla blog post classification
- Author
-
Md. Ismail Jabiullah, Md. Tarek Habib, Md. Mehedee Zaman Khan, Ashik Iqbal Prince, and Tanvirul Islam
- Subjects
Control and Optimization ,Computer Networks and Communications ,Computer science ,Bigram ,Decision tree ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,Unigram ,010305 fluids & plasmas ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Supervised machine learning ,Electrical and Electronic Engineering ,tf–idf ,Instrumentation ,business.industry ,Supervised learning ,TF-IDF ,Trigram ,Bangla text classification ,Perceptron ,Random forest ,Support vector machine ,ComputingMethodologies_PATTERNRECOGNITION ,Hardware and Architecture ,Control and Systems Engineering ,Bangla blog ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Information Systems - Abstract
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.
- Published
- 2021
- Full Text
- View/download PDF
27. Algorithms for correcting recognition results using N-grams.
- Author
-
Manzhikov, T., Slavin, O., Faradjev, I., and Janiszewski, I.
- Abstract
This paper studies the application of N-grams for correcting the results of pattern recognition of words in documents based on the example of recognition of passport fields of a citizen of the Russian Federation. Three algorithms for correcting recognition results are given for trigrams. One of them is based on the use of trigram probabilities in combination with evaluation of recognition. The other algorithms are based on the definition of marginal distributions and computations by means of graphical probability models. The results of experiments on the application of the algorithms and comparison of the characteristics of the algorithms are presented. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
28. The Use of Hidden Markov Model in Natural ARABIC Language Processing: a survey.
- Author
-
Suleiman, Dima, Awajan, Arafat, and Al Etaiwi, Wael
- Subjects
HIDDEN Markov models ,NATURAL language processing ,SEMANTIC computing ,ARABIC language ,COMPARATIVE studies - Abstract
Hidden Markov Model is an empirical tool that can be used in many applications related to natural language processing. In this paper a comparative study was conducted between different applications in natural Arabic language processing that uses Hidden Markov Model such as morphological analysis, part of speech tagging, text classification, and name entity recognition. Comparative results showed that HMM can be used in different layers of natural language processing, but mainly in pre-processing phase such as: part of speech tagging, morphological analysis and syntactic structure; however in high level applications text classification their use is limited to certain number of researches. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
29. Prosody prediction for arabic via the open-source boundary-annotated qur’an corpus
- Author
-
Eric Atwell, C Brierley, and Majdi Sawalha
- Subjects
Phrase ,Computer science ,business.industry ,Speech recognition ,Intonation (linguistics) ,computer.software_genre ,language.human_language ,Annotation ,ComputingMethodologies_PATTERNRECOGNITION ,Trigram tagger ,Classifier (linguistics) ,Modern Standard Arabic ,language ,Trigram ,Artificial intelligence ,Prosody ,business ,computer ,Natural language processing - Abstract
A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud by humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-of-speech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert linguist. For Arabic, there are no existing suitable resources. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We then use this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabic phrase break prediction, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with a trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via the Balanced Classification Rate metric. This is initial work on a long-term research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.
- Published
- 2021
- Full Text
- View/download PDF
30. Development of Bangla Spell and Grammar Checkers: Resource Creation and Evaluation
- Author
-
Nahid M. Hossain, Salekul Islam, and Mohammad Nurul Huda
- Subjects
General Computer Science ,Computer science ,grammar checker ,Bigram ,media_common.quotation_subject ,corpus ,Lexicon ,computer.software_genre ,Bangla ,spell checker ,General Materials Science ,Electrical and Electronic Engineering ,media_common ,Grammar ,business.industry ,Cosine similarity ,General Engineering ,Spell ,Metaphone ,TK1-9971 ,lexicon ,Edit distance ,Trigram ,Electrical engineering. Electronics. Nuclear engineering ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
A spell and grammar checker is profoundly essential for diverse publications especially for Bangla language in particular as it is spoken by millions of native speakers around the world. Considering the lack of research efforts, we demonstrate the development of a comprehensive Bangla spell and grammar checker with necessary resources. At first, a full-fledged and generalised Bangla monolingual corpus comprising over 100 million words has been built by scraping reputed, diversified online sources and then an extensive Bangla lexicon consisting of over 1 million unique words has been extracted from that corpus. Based on these corpus and lexicon, we have developed a combined spell and grammar checker application that simultaneously detects distinct spelling and grammatical mistakes and provides appropriate suggestions for both as well. The spell checker uses the Double Metaphone algorithm and Edit distance based on the distributed lexicons and numerical suffix dataset to detect all types of Bangla spelling mistakes with an accuracy rate of 97.21% individually. The grammar checker detects errors based on language model probability i.e. combination of bigram and trigram, and generates suggestions based on the Cosine similarity measure with the accuracy rate of 94.29% individually. The datasets and codes used in this work are freely available at https://git.io/JzJ4w.
- Published
- 2021
- Full Text
- View/download PDF
31. A Term Weighted Neural Language Model and Stacked Bidirectional LSTM Based Framework for Sarcasm Identification
- Author
-
Aytug Onan and Mansur Alp Tocoglu
- Subjects
Word embedding ,General Computer Science ,Computer science ,media_common.quotation_subject ,02 engineering and technology ,term weighting ,computer.software_genre ,neural language model ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Word2vec ,media_common ,Sarcasm ,business.industry ,General Engineering ,020206 networking & telecommunications ,Sarcasm identification ,Weighting ,Term (time) ,bidirectional long shortterm memory ,Task analysis ,020201 artificial intelligence & image processing ,Trigram ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Language model ,Artificial intelligence ,business ,lcsh:TK1-9971 ,computer ,Natural language processing - Abstract
Sarcasm identification on text documents is one of the most challenging tasks in natural language processing (NLP), has become an essential research direction, due to its prevalence on social media data. The purpose of our research is to present an effective sarcasm identification framework on social media data by pursuing the paradigms of neural language models and deep neural networks. To represent text documents, we introduce inverse gravity moment based term weighted word embedding model with trigrams. In this way, critical words/terms have higher values by keeping the word-ordering information. In our model, we present a three-layer stacked bidirectional long short-term memory architecture to identify sarcastic text documents. For the evaluation task, the presented framework has been evaluated on three-sarcasm identification corpus. In the empirical analysis, three neural language models (i.e., word2vec, fastText and GloVe), two unsupervised term weighting functions (i.e., term-frequency, and TF-IDF) and eight supervised term weighting functions (i.e., odds ratio, relevance frequency, balanced distributional concentration, inverse question frequency-question frequency-inverse category frequency, short text weighting, inverse gravity moment, regularized entropy and inverse false negative-true positive-inverse category frequency) have been evaluated. For sarcasm identification task, the presented model yields promising results with a classification accuracy of 95.30%.
- Published
- 2021
- Full Text
- View/download PDF
32. Machine Learning Approaches for Classifying the Distribution of Covid-19 Sentiments
- Author
-
M. Kuyo, E. Okang’o, and S. Mwalili
- Subjects
Computer science ,business.industry ,Bigram ,Sentiment analysis ,Confusion matrix ,General Medicine ,Machine learning ,computer.software_genre ,Naive Bayes classifier ,n-gram ,Tokenization (data security) ,Trigram ,Language model ,Artificial intelligence ,business ,computer - Abstract
Previously, rapid disease detection and prevention was difficult. This is because disease modeling and prediction was dependent on a manually obtained dataset that includes use of survey. With the increased use of social media platforms like Twitter, Facebook, Instagram, etc., data mining and sentiment analysis can help avoid diseases. Sentiment analysis is a powerful tool for analyzing people’s perceptions, emotions, value assessments, attitudes, and feelings as expressed in texts. The purpose of this research is to use machine learning techniques to classify and predict the spatial distribution of positive and negative sentiments of Covid-19 pandemic. This study research has employed machine learning to classify spatial distribution of Covid-19 twitter sentiments as positive or negative. The data for this study were geo-tagged tweets concerning COVID-19 which were live streamed using streamR package. The key terms used for streaming the data were: Corona, Covid-19, sanitizer, virus, lockdown, quarantine, and social distance. The classification used Naive Bayes algorithms with ngram approaches. N-Gram model is a probabilistic language model used to predict next item in a sequence in the form (n - 1) order Markov. It relies on the Markov assumption—the probability of a word depends only on the previous word without looking too far into the past. The steps followed in this research include: cleaning and preprocessing the data, text tokenization using n-gram i.e. 1-gram, 2-gram, and 3-gram, tweets were converted or weighted into a matrix of numeric vectors using Term Frequency Inverse-Document. Also, data were divided 80:20 between train and test data. A confusion matrix was utilized to evaluate the classification accuracy, precision, and recall performance of the various algorithms tested. Prediction was done using the best performing Naive Bayes algorithm. The results of this research showed that under Multinomial Naive Bayes, unigram accuracy was 92.02%, bigram accuracy was 97.37%, and trigram accuracy was 94.40%. Unigram had 89.34% accuracy, bigram had 96.80%, and trigram had 94.90% accuracy using Bernoulli Naive Bayes. Unigram accuracy was 90.43%, bigram accuracy was 95.67%, and trigram accuracy was 92.89% using Gaussian Naive Bayes. Bigram tokenization outperformed unigram and trigram tokenization. Bigram Multinomial Naive Bayes was used to predict test data since it was the most accurate in classifying train data. Prediction accuracy was 84.92%, precision 85.50%, recall 81.02%, and F1 measure 83.20%. TF-IDF was employed to increase prediction accuracy, obtaining 87.06%. These were then plotted on a globe map. The study indicates that machine learning can identify patterns and emotions in public tweets, which may then be used to steer targeted intervention programs aimed at limiting disease spread.
- Published
- 2021
- Full Text
- View/download PDF
33. Multi-class Sports News Categorization using Machine Learning Techniques: Resource Creation and Evaluation
- Author
-
Mohammed Moshiul Hoque, Omar Sharif, and Adrita Barua
- Subjects
Computer science ,business.industry ,Feature vector ,Bigram ,Decision tree ,Machine learning ,computer.software_genre ,language.human_language ,Random forest ,ComputingMethodologies_PATTERNRECOGNITION ,Bengali ,Categorization ,language ,Feature (machine learning) ,General Earth and Planetary Sciences ,Trigram ,Artificial intelligence ,business ,computer ,General Environmental Science - Abstract
The proliferation of the Internet and social media usage creates enormous textual data (specifically, news content) on the web. The most proportion of contents primarily are unstructured. Extracting meaningful insights from unstructured content is nearly impossible or extremely hard, and time-consuming by human labor. Thus, automatic text classification has gained much attention from NLP experts in recent years. Several techniques have been developed to classify news text in high resource languages (e.g., English, Chinese, French). However, the automatic classification of Bengali news text is in a primitive stage to date. This paper investigates the six most popular machine learning techniques (such as Logistic Regression (LR), Support Vector Classifier (SVC), Decision Tree (DT), Multinomial Naive Bayes (MNB), Random Forest (RF), etc.) with Term Frequency-Inverse Document Frequency (TF-IDF) features for automatic sports news classification in Bengali. Due to the unavailability of benchmark corpus, this work also developed a Bengali news corpus (called BNeC) consisting of 43306 news documents with 202830 unique words in multiple classes: Cricket, Football, Tennis, and Athletics. Experimental results on the test dataset show that the Support Vector Classifier (SVC) with unigram+bigram+trigram feature space obtained the highest weighted f1-score of 97.60% than the other classifiers and feature combinations.
- Published
- 2021
- Full Text
- View/download PDF
34. Deep Learning Approach for Automatic Romanian Lemmatization
- Author
-
Maria Nuţu
- Subjects
Lemma (mathematics) ,Computer science ,business.industry ,Lemmatisation ,Deep learning ,Romanian ,Context (language use) ,computer.software_genre ,language.human_language ,Text processing ,language ,General Earth and Planetary Sciences ,Trigram ,Artificial intelligence ,business ,computer ,Word (computer architecture) ,Natural language processing ,General Environmental Science - Abstract
This paper proposes a deep learning sequence-to-sequence approach to improve the task of automatic Romanian lemmatization. The study compares 24 systems using different combinations of recurrent, convolutional and attention layers, while the text input consists of word-lemma pairs, both at word and trigram level. As Romanian is a low resourced language in the field of text processing, the aim of this study is to use as little input information as possible. Thus, to increase the lemmatization accuracy, additional lexical features (such as context and POS tags) have been provided only gradually. For the trigrams case, two scenarios were proposed: to predict the lemma for every word from the sequence or to predict the lemma only for the word in the middle. The lemmatizers have been analyzed on two Romanian datasets: the Romanian Explicative Dictionary (DEX) and the belletristic subset of the CoRoLa corpus. For the DEX dataset, the best results were obtained with the LSTM-based systems at both word (99.32%) and character level (99.43%). For the CoRoLa subset, the CNN-based architecture outperforms at trigram (95.86%) and word level (99.09%) while the LSTM-stacked system obtained the highest accuracy at character level (98.78%).
- Published
- 2021
- Full Text
- View/download PDF
35. The inhibitory effect of word neighborhood size when reading with central field loss is modulated by word predictability and reading proficiency
- Author
-
Carlos A. Aguilar, Aurélie Calabrèse, Eric Castet, N. Stolowy, Núria Gala, Thomas François, Frédéric Matonti, Lauren Sauvan, Ophtalmologie [Hôpital de la Timone - APMH], Aix Marseille Université (AMU)-Assistance Publique - Hôpitaux de Marseille (APHM)- Hôpital de la Timone [CHU - APHM] (TIMONE), Amaris Research Unit [Biot], Université Catholique de Louvain = Catholic University of Louvain (UCL), Laboratoire Parole et Langage (LPL), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Centre Paradis Monticelli [Marseille], Laboratoire de psychologie cognitive (LPC), Biologically plausible Integrative mOdels of the Visual system : towards synergIstic Solutions for visually-Impaired people and artificial visiON (BIOVISION), Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), ANR-16-CONV-0002,ILCB,ILCB: Institute of Language Communication and the Brain(2016), and Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)
- Subjects
0301 basic medicine ,Adult ,Male ,Quality of life ,Text simplification ,media_common.quotation_subject ,Science ,Visual impairment ,Vision, Low ,Article ,03 medical and health sciences ,Fluency ,0302 clinical medicine ,Reading (process) ,Human behaviour ,medicine ,Humans ,Attention ,media_common ,Aged ,Aged, 80 and over ,Multidisciplinary ,[SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior ,Macular degeneration ,Rehabilitation ,[SDV.NEU.SC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Cognitive Sciences ,[SCCO.LING]Cognitive science/Linguistics ,Middle Aged ,Translational research ,030104 developmental biology ,Reading ,Word recognition ,[SCCO.PSYC]Cognitive science/Psychology ,030221 ophthalmology & optometry ,Medicine ,Trigram ,Female ,medicine.symptom ,Psychology ,Sentence ,Word (computer architecture) ,Cognitive psychology - Abstract
International audience; Background: For normally sighted readers, word neighborhood size (i.e., the total number of words that can be formed from a single word by changing only one letter) has a facilitator effect on word recognition. When reading with central field loss (CFL), however, individual letters may not be correctly identified, leading to possible misidentifications and a reverse neighborhood size effect. Here we investigate this inhibitory effect of word neighborhood size on reading performance and whether it is modulated by word predictability and reading proficiency. Methods: Nineteen patients with binocular CFL from 32 to 89 years old (mean ± SD = 75 ± 15) read short sentences presented with the self-paced reading paradigm. Accuracy and reading time were measured for each target word read, along with its predictability, i.e., its probability of occurrence following the two preceding words in the sentence using a trigram analysis. Linear mixed effects models were then fit to estimate the individual contributions of word neighborhood size, predictability, frequency and length on accuracy and reading time, while taking patients’ reading proficiency into account.Results: For the less proficient readers, who have given up daily reading as a consequence of their visual impairment, we found that the effect of neighborhood size was reversed compared to normally sighted readers and of higher amplitude than the effect of frequency. Furthermore, this inhibitory effect is of greater amplitude (up to 50% decrease in reading speed) when a word is not easily predictable because its chances to occur after the two preceding words in a specific sentence are rather low.Conclusion: Severely impaired patients with CFL often quit reading on a daily basis because this task becomes simply too exhausting. Based on our results, we envision lexical text simplification as a new alternative to promote effective rehabilitation in these patients. By increasing reading accessibility for those who struggle the most, text simplification might be used as an efficient rehabilitation tool and daily reading assistive technology, fostering overall reading ability and fluency through increased practice.
- Published
- 2020
- Full Text
- View/download PDF
36. Part of Speech Tagging Using Hidden Markov Models
- Author
-
Daniel Morariu and Adrian Bărbulescu
- Subjects
Computer science ,Brown Corpus ,Speech recognition ,Bigram ,Trigram ,Hidden Markov model ,Tag system ,Sentence ,Word (computer architecture) ,Decoding methods - Abstract
In this paper, we present a wide range of models based on less adaptive and adaptive approaches for a PoS tagging system. These parameters for the adaptive approach are based on the n-gram of the Hidden Markov Model, evaluated for bigram and trigram, and based on three different types of decoding method, in this case forward, backward, and bidirectional. We used the Brown Corpus for the training and the testing phase. The bidirectional trigram model almost reaches state of the art accuracy but is disadvantaged by the decoding speed time while the backward trigram reaches almost the same results with a way better decoding speed time. By these results, we can conclude that the decoding procedure it’s way better when it evaluates the sentence from the last word to the first word and although the backward trigram model is very good, we still recommend the bidirectional trigram model when we want good precision on real data.
- Published
- 2020
- Full Text
- View/download PDF
37. Public opinion mining using natural language processing technique for improvisation towards smart city
- Author
-
M. Nithya and S. Leelavathy
- Subjects
Direct voice input ,Linguistics and Language ,Service (systems architecture) ,Computer science ,Public opinion ,computer.software_genre ,Article ,Language and Linguistics ,Smart city ,Hidden Markov models ,Government ,business.industry ,Natural language processing ,Covid 19 ,Transparency (behavior) ,Purchasing ,Fuzzy logic ,Human-Computer Interaction ,Speech processing ,Trigram ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software - Abstract
In this digital world integrating smart city concepts, there is a tremendous scope and need for e-governance applications. Now people analyze the opinion of others before purchasing any product, hotel booking, stepping onto restaurants etc. and the respective user share their experience as a feedback towards the service. But there is no e-governance platform to obtain public opinion grievances towards covid19, government new laws, policies etc. With the growing availability and emergence of opinion rich information's, new opportunities and challenges might arise in developing a technology for mining the huge set of public messages, opinions and alert the respective departments to take necessary actions and also nearby ambulances if its related to covid-19. To overcome this pandemic situation a natural language processing based efficient e-governance platform is demandful to detect the corona positive patients and provide transparency on the covid count and also alert the respective health ministry and nearby ambulance based on the user voice inputs. To convert the public voice messages into text, we used Hidden Markov Models (HMMs). To identify respective government department responsible for the respective user voice input, we perform pre-processing, part of speech, unigram, bigram, trigram analysis and fuzzy logic (machine learning technique). After identifying the responsible department, we perform 2 methods, (1) Automatic alert e-mail and message to the government departmental officials and nearby ambulance or covid camp if the user input is related to covis19. (2) Ticketing system for public and government officials monitoring. For experimental results, we used Java based web and mobile application to execute the proposed methodology. Integration of HMM, Fuzzy logic provides promising results.
- Published
- 2020
- Full Text
- View/download PDF
38. Sentiment Analysis on KAI Twitter Post Using Multiclass Support Vector Machine (SVM)
- Author
-
Dhina Nur Fitriana and Yuliant Sibaroni
- Subjects
text classification ,lcsh:T58.5-58.64 ,lcsh:Information technology ,business.industry ,Computer science ,Bigram ,Sentiment analysis ,Supervised learning ,Machine learning ,computer.software_genre ,lcsh:TA168 ,Support vector machine ,ComputingMethodologies_PATTERNRECOGNITION ,lcsh:Systems engineering ,sentiment analysis ,term frequency-invers document frequency ,Feature (machine learning) ,Trigram ,Social media ,twitter data ,Artificial intelligence ,Tag cloud ,business ,computer - Abstract
Information in form of unstructured texts is increasing and becoming commonplace for its existence on the internet. This information is easily found and utilized by business people or companies through social media. One of them is Twitter. Twitter is ranked 6th as a social media that is widely accessed today. The use of Twitter has the disadvantage of unstructured and large data. Consequently, it is difficult for business people or companies to know opinion towards service with limited resources. To Make it easier for businesses know the public's sentiment for better service in the future, public sentiment on Twitter needs to be classified as positive, neutral, and negative. The Multiclass Support Vector Machine (SVM) method is a supervised learning classification method that handles three classes classification. This paper uses One Against All (OAA) approach as a method to determine the class. This paper contains the results of classifying OAA Multiclass SVM methods with five different weighting features unigram, bigram, trigram, unigram+ bigram, and word cloud for analyzing tweet data, finding the best accuracy and important feature when processed with large data. The highest accuracy is the unigram TF-IDF model combined with the OAA Multiclass SVM with gamma 0.7 is 80.59.
- Published
- 2020
- Full Text
- View/download PDF
39. Verse Search System for Sound Differences in the Qur’an Based on the Text of Phonetic Similarities
- Author
-
Agni Octavia, Moch Arif Bijaksana, and Kemas Muslim Lhaksmana
- Subjects
lcsh:T58.5-58.64 ,Recall ,string matching ,lcsh:Information technology ,Computer science ,business.industry ,phonetic search ,Stop sign ,String searching algorithm ,Longest increasing subsequence ,Pronunciation ,computer.software_genre ,Phonetic search technology ,Trigram ,Artificial intelligence ,business ,computer ,n-grams ,Natural language processing ,Coding (social sciences) - Abstract
Al-Qur'an has a lot of content, so the system of searching for verses of the Al-Qur’an is needed because if it is done manually it will be difficult. One of the search systems for the verses of the Al-Qur'an in accordance with Indonesia’s pronunciation is Lafzi. The Lafzi system can search for verse fragments using keywords in Latin characters. Lafzi has been developed into Lafzi +, wherein the Lafzi + system can be used to search verses of the Al-Qur’an with different sounds on stop signs. However, the Lafzi+ can only overcome the difference in the sound of the stop sign and cannot be applied throughout Al-Qur’an. Based on these problems, the system needs to be developed to overcome the differences in sound in the middle of the verse and can be applied throughout the Al-Qur’an. The method used in the process of searching for the verse is the N-gram method. The N-gram used in this research is trigram. The process flow of this system is first normalized in the phonetic coding process after normalized then tokenization of trigrams and then trigrams are matched between the query and the corpus and entered into the ranking process to get an output candidate. In the making process, the LIS (Longest Increasing Subsequence) method is used to get an orderly and strict trigram sequence. The highest order score will be the top output. The results of this study obtained a recall value of 100% and MAP of 87%.
- Published
- 2020
- Full Text
- View/download PDF
40. Experimenting with factored language model and generalized back-off for Hindi
- Author
-
Arun Babhulgaonkar and Shefali Sonavane
- Subjects
Perplexity ,Computer Networks and Communications ,Computer science ,media_common.quotation_subject ,02 engineering and technology ,computer.software_genre ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Conversation ,Electrical and Electronic Engineering ,media_common ,Hindi ,business.industry ,Applied Mathematics ,020206 networking & telecommunications ,language.human_language ,Computer Science Applications ,Computational Theory and Mathematics ,language ,Factored language model ,020201 artificial intelligence & image processing ,Trigram ,Language model ,Artificial intelligence ,business ,computer ,Word (computer architecture) ,Sentence ,Natural language processing ,Information Systems - Abstract
Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest perplexity of 1.881235. It is about 50% less than traditional baseline trigram model.
- Published
- 2020
- Full Text
- View/download PDF
41. Summarizing Online Movie Reviews: A Machine Learning Approach to Big Data Analytics
- Author
-
Mazen Zaindin, Muhammad Adnan Gul, Shafiq Ahmad, Syed Atif Ali Shah, Muhammad Firdausi, Atif Khan, and M. Irfan Uddin
- Subjects
Article Subject ,Computer science ,business.industry ,Bigram ,Feature vector ,Feature extraction ,Big data ,02 engineering and technology ,Machine learning ,computer.software_genre ,Automatic summarization ,Computer Science Applications ,QA76.75-76.765 ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Trigram ,Word2vec ,Computer software ,Artificial intelligence ,Computational linguistics ,business ,computer ,Software - Abstract
Information is exploding on the web at exponential pace, and online movie review over the web is a substantial source of information for online users. However, users write millions of movie reviews on regular basis, and it is not possible for users to condense the reviews. Classification and summarization of reviews is a difficult task in computational linguistics. Hence, an automatic method is demanded to summarize the vast amount of movie reviews, and this method will permit the users to speedily distinguish between positive and negative features of a movie. This work has proposed a classification and summarization method for movie reviews. For movie review classification, bag-of-words feature extraction technique is used to extract unigrams, bigrams, and trigrams as a feature set from given review documents and represent the review documents as a vector. Next, the Na¨ıve Bayes algorithm is employed to categorize the movie reviews (signified as a feature vector) into negative and positive reviews. For the task of movie review summarization, word2vec model is used to extract features from classified movie review sentences, and then semantic clustering technique is used to cluster semantically related review sentences. Different text features are employed to compute the salience score of all review sentences in clusters. Finally, the best-ranked review sentences are picked based on top salience scores to form a summary of movie reviews. Empirical results indicate that the suggested machine learning approach performed better than benchmark summarization approaches.
- Published
- 2020
- Full Text
- View/download PDF
42. Exploring multinomial naïve Bayes for Yorùbá text document classification
- Author
-
Ikechukwu I. Ayogu
- Subjects
business.industry ,Computer science ,Bigram ,Yoruba ,Supervised learning ,Representation (arts) ,computer.software_genre ,language.human_language ,ComputingMethodologies_PATTERNRECOGNITION ,Text mining ,Categorization ,Bag-of-words model ,language ,Trigram ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representation
- Published
- 2020
- Full Text
- View/download PDF
43. Identification of Misconceptions about Corona Outbreak Using Trigrams and Weighted TF-IDF Model
- Author
-
Sujatha Arun Kokatnoor and Balachandran Krishnan
- Subjects
General Computer Science ,Computer science ,business.industry ,Word count ,General Engineering ,computer.software_genre ,Naive Bayes classifier ,Bag-of-words model ,Classifier (linguistics) ,Vector space model ,Trigram ,Social media ,Artificial intelligence ,business ,tf–idf ,computer ,Natural language processing - Abstract
Misconceptions of a particular issue like health, diseases, politics, government policies, epidemics and pandemics have been a social issue for a number of years, particularly after the advent of social media, and often spread faster than true truth The engagement with social media like Twitter being one of the most prominent news outlets continuing is a major source of information today, particularly the information distributed around the network In this paper, the efficacy of Misconception Detection System was tested on Corona Pandemic Dataset extracted from Twitter posts A Trigram and a weighted TF-IDF Model followed by a supervised classifier were used for categorizing the dataset into two classes: one with misconceptions about COVID-19 virus and the other comprising correct and authenticated information Trigrams were more reliable as the functional words related to coronavirus appeared more frequently in the corpus created The proposed system using a combination of trigrams and weighted TF-IDF gave relevant and a normalized score leading to an efficient creation of vector space model and this has yielded good performance results when compared with traditional approaches using Bag of Words and Count Vectorizer technique where the vector space model was created only through word count © 2020, Institute of Advanced Scientific Research, Inc All rights reserved
- Published
- 2020
- Full Text
- View/download PDF
44. Exploratory examination of lexical and neuroanatomic correlates of neglect dyslexia
- Author
-
Olga Boukrina, Peii Chen, Tamara Budinoska, and Anna M. Barrett
- Subjects
Adult ,Male ,medicine.medical_specialty ,media_common.quotation_subject ,Precuneus ,Audiology ,Concreteness ,Functional Laterality ,Article ,050105 experimental psychology ,Neglect ,Dyslexia ,Perceptual Disorders ,medicine ,Humans ,0501 psychology and cognitive sciences ,Aged ,media_common ,Aged, 80 and over ,Brain Mapping ,05 social sciences ,Brain ,Inferior temporal sulcus ,Middle Aged ,medicine.disease ,Magnetic Resonance Imaging ,Stroke ,Word lists by frequency ,Neuropsychology and Physiological Psychology ,medicine.anatomical_structure ,Reading ,Quality of Life ,Female ,Trigram ,Tomography, X-Ray Computed ,Psychology ,Psychomotor Performance ,Orthography - Abstract
OBJECTIVE: This study examined lexical and neuroanatomic correlates of reading errors in individuals with spatial neglect, defined as a failure to respond to stimuli in the side of space opposite a brain lesion, causing functional disability. METHOD: One hundred and ten participants with left spatial neglect after right-hemisphere stroke read aloud a list of 36 words. Reading errors were scored as “contralesional” (error in the left half of the word) or as “other”. The influence of lexical processing on neglect dyslexia was studied with a stepwise regression using word frequency, orthographic neighborhood (number of same length neighbors that differ by 1 letter), bigram and trigram counts (number of words with the same 2- and 3-letter combinations), length, concreteness, and imageability as predictors. MRI/CT images of 92 patients were studied in a voxelwise lesion-symptom analysis (VLSM). RESULTS: Longer length and more trigram neighbors increased, while higher concreteness reduced, the rate of contralesional errors. VLSM revealed lesions in the inferior temporal sulcus, middle temporal and angular gyri, precuneus, temporal pole, and temporo-parietal white matter associated with the rate of contralesional errors. CONCLUSIONS: Orthographic competitors may decrease word salience, while semantic concreteness may help constrain the selection of available word options when it is based on degraded information from the left side of the word. PUBLIC SIGNIFICANCE STATEMENT: Reading impairments arising after stroke represent a devastating problem, restricting an individual’s life participation, independence and quality of life. In this study, we examined reading impairments in neglect dyslexia, a symptom characterized by reading errors in the half of the word opposite a brain lesion. To help improve the current understanding of this symptom, we identified specific word characteristics and stroke locations that are associated with increased rates of neglect dyslexia reading errors.
- Published
- 2020
- Full Text
- View/download PDF
45. Performance of Methods in Identifying Similar Languages Based on String to Word Vector
- Author
-
Herry Sujaini
- Subjects
Language identification ,identification of languages ,business.industry ,Computer science ,string to word vector ,String (computer science) ,QA75.5-76.95 ,computer.software_genre ,local languages ,Naive Bayes classifier ,Character (mathematics) ,Electronic computers. Computer science ,Classifier (linguistics) ,Feature (machine learning) ,Trigram ,Artificial intelligence ,business ,computer ,Natural language processing ,Word (computer architecture) - Abstract
Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This study made an effort to identification Indonesian local languages, which used String to Word Vector approach. A string vector refers to a collection of ordered words. In a string vector, a word is represented as an element or value, while the word becomes an attribute or feature in each numeric vector. Among Naïve Bayes, SMO, J48, and ZeroR classifiers, SMO is found to be the most accurate classifier with a level of accuracy at 95.7% for 10-fold cross-validation and 94.4% for 60%: 40%. The best tokenizer in this classification is Character N-Gram. All classifiers, except ZeroR shows increased accuracy when using Character N-Gram Tokenizer compared to Word Tokenizer. The best features of this system are the TriGram and FourGram Character. The TriGram is preferred because it requires smaller training data. The highest accuracy value in the combination experiment is 0.965 obtained at a combination of IDF = FALSE and WC = TRUE, regardless the conditions of the TF.
- Published
- 2020
- Full Text
- View/download PDF
46. Indonesian part of speech tagging using maximum entropy markov model on Indonesian manually tagged corpus
- Author
-
Denis Eka Cahyani and Winda Mustikaningtyas
- Subjects
Maximum entropy markov model ,Part of speech tagging ,Information Systems and Management ,Artificial Intelligence ,Control and Systems Engineering ,N-gram ,Trigram ,Bigram ,Electrical and Electronic Engineering - Abstract
This research discusses the development of a part of speech (POS) tagging system to solve the problem of word ambiguity. This paper presents a new method, namely maximum entropy markov model (MEMM) to solve word ambiguity on the Indonesian dataset. A manually labeled “Indonesian manually tagged corpus” was used as data. Furthermore, the corpus is processed using the entropy formula to obtain the weight of the value of the word being searched for, then calculating it into the MEMM Bigram and MEMM Trigram algorithms with the previously obtained rules to determine the part of speech (POS) tag that has the highest probability. The results obtained show POS tagging using the MEMM method has advantages over the methods used previously which used the same data. This paper improves a performance evaluation of research previously. The resulting average accuracy is 83.04% for the MEMM Bigram algorithm and 86.66% for the MEMM Trigram. The MEMM Trigram algorithm is better than the MEMM Bigram algorithm.
- Published
- 2022
47. Machine Learning Based Approach For Prediction Of Suicide Related Activity
- Author
-
Neha Soni and Hemal Patel
- Subjects
business.industry ,Computer science ,Bigram ,Commit ,Machine learning ,computer.software_genre ,Random forest ,Support vector machine ,Naive Bayes classifier ,medicine ,Trigram ,Social media ,Artificial intelligence ,medicine.symptom ,business ,computer ,Suicidal ideation - Abstract
Suicidal ideation is a significant public health issue that kills a large number of people every year all around the world. Every year, over 800,000 people commit suicide all over the world. As their use of social networking sites has grown, users have utilized them to discuss personal problems, including exchanging suicide plans. Suicide can be prevented by analyzing such posts available on social media sites, either manually or automatically. Suicide-related subjects are gathered from social media sites, which leads to a spike in suicide ideation. Prior research has found individual terms or phrases from tweets that predict suicide risk factors, but this study will use the n-gram model to compute scores, which combines Unigram, Bigram, and Trigram with a dictionary. Suicide attempts will be categorized into three groups.: low, medium, and high risk. The purpose of this research is to use a range of important data, such as linguistic and location-based factors, as well as emoticons, to better identify cases of suicide. Machine learning-based approaches such as Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest (RF), as well as Extra Tree, will be used to categories suicidal cases.
- Published
- 2021
- Full Text
- View/download PDF
48. Prediction of words and morphological features using machine learning methods based on n-grams
- Author
-
Kujundžić, Pavao and Vuković, Marin
- Subjects
quadgram ,obrada prirodnog jezika ,ngramski nizovi ,Good Turing smoothing method ,TEHNIČKE ZNANOSTI. Računarstvo ,unigram ,ngram sequences ,ngramski jezični modeli ,bigram ,Good Turing metoda zaglađivanja [Laplasova metoda zaglađivanja] ,ngram language models ,linear interpolation ,Laplace smoothing method ,TECHNICAL SCIENCES. Computing ,morfološke značajke riječi ,classla ,morphological features of words ,natural language processing ,trigram - Abstract
U ovom diplomskom radu se obrađuje tema Predviđanje riječi i morfoloških značajki primjenom metoda strojnog učenja na temelju n-gramskih nizova. Prikazane su tehnologije i alati korišteni za učitavanje različitih podataka za treniranje modela na temelju kojih se kreiraju i treniraju različiti ngramski jezični modeli koristeći Laplasovu metodu zaglađivanja ili Good Turing metodu zaglađivanja. Objašnjava se korištenje linearne interpolacije u jezičnim modelima koji kombiniraju više ngramskih jezičnih modela, te određivanje računskih parametara λ koji određuju doprinos određenih jezičnih modela u određivanju vjerojatnosti na temelju ulaznih podataka za treniranje. Opširno je opisan i Classla pipeline koji sadrži sve potrebne tehnologije za prepoznavanje i analiziranje morfoloških značajki riječi. Definirane su ključne riječi obrada prirodnog jezika, ngramski nizovi, ngramski jezični modeli, unigram, bigram, trigram, quadgram, Laplasova metoda zaglađivanja, Good Turing metoda zaglađivanja, linearna interpolacija, morfološke značajke riječi i classla. The thesis deals with the topic of Prediction of words and morphological features using machine learning methods based on n-grams. morphological features of words. The technologies and tools used to load different training data to create and train different ngram language models by using Laplace smoothing method and Good Turing smoothing method. The use of linear interpolation in language models that combine multiple ngram language models is explained, as well as the determination of computational parameters λ that determine the contribution of certain ngram language models in determining probabilities based on input data for training. The Classla pipeline is also described in detail, containing all the necessary technologies for recognizing and analyzing the morphological features of words. Keywords are natural language processing, ngram sequences, ngram language models, unigram, bigram, trigram, quadgram Laplace smoothing method, Good Turing smoothing method, linear interpolation, morphological features of words and classla
- Published
- 2021
49. Effectiveness Analysis of Different POS Tagging Techniques for Bangla Language
- Author
-
Jueal Mia, Al Amin Biswas, and Mehedee Hassan
- Subjects
Machine translation ,business.industry ,Computer science ,Bigram ,Deep learning ,Sentiment analysis ,Context (language use) ,computer.software_genre ,language.human_language ,n-gram ,Bengali ,language ,Trigram ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Parts-of-speech (POS) tagging plays an important role in the field of natural language processing (NLP), such as—retrieval of information, machine translation, spelling check, language processing, sentiment analysis, and so on. Many works have been done for Bangla part-of-speech (POS) tagging using machine learning but the result does not enough. It is a matter of fact that not even a single effective research work has been conducted for Bangla POS tagging using deep learning due to a lack of data scarcity. Considering that our context is the Bangla POS tagging employing both machine learning and deep learning approach. In our research, we have compared some well-known supervised POS tagging approaches (Brill, HMM, unigram, bigram, trigram, and recurrent neural network) for Bangla languages. The supervised POS tagging technique requires a large number of data set to tag accurately. That is why we have used a large number of data set for POS tagging of Bangla languages, which will accept a raw Bangla text to produce a Bangla POS tagged output that can be directly used for other NLP applications. After the comparison, we have found the best tagging approach in terms of performance. Bangla is an inflectional language. That is why it is a very much tough job for grammatical categories of Bangla language. But our proposed model works well for Bangla languages.
- Published
- 2021
- Full Text
- View/download PDF
50. Sentiment Analysis of The Covid-19 Vaccine For Arabic Tweets Using Machine Learning
- Author
-
Abdullah Alsaeedi, Ruba Alhejaili, Wael M.S. Yafooz, and Enas S. Alhazmi
- Subjects
Computer science ,business.industry ,Bigram ,Sentiment analysis ,Decision tree ,Machine learning ,computer.software_genre ,Random forest ,Support vector machine ,Naive Bayes classifier ,Trigram ,AdaBoost ,Artificial intelligence ,business ,computer - Abstract
With exposure through social media, everyone has the freedom to publish their opinions and feelings with no restrictions. These opinions and feelings may be about individual or general trends of society in general. These feeling that are related to topics or general trends must be analysed. The purpose of this paper is to analyze sentiment for tweets posted about the COVID-19 vaccine. For this purpose, a dataset focusing on the tweets which were recorded between the February 19th to the 20th of March 2021 related to the COVID-19 vaccine was collected. Several classic machine learning methods were used, namely, Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), AdaBoost, K-Nearest Neighbors (KNN), and Gaussian Naive Bayes (GNB) were utilized, the preprocessing and annotation process steps on the proposed dataset. The TF -IDF feature extraction method where the dataset is trained based on unigram, bigram, and trigram features was applied. The best results were achieved through the bigram, where the LR model achieved the highest percentage in the accuracy with a value of 87.0%.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.