Descriptor: "n-gram" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"n-gram"' showing total 1,919 results

Start Over Descriptor "n-gram"

1,919 results on '"n-gram"'

1. Comparison of Perplexity Scores of Language Models for Telugu Data Corpus in the Agricultural Domain

Author: Rajesh, Pooja, Gupta, Akshita, Immadisetty, Praneeta, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Hassanien, Aboul Ella, editor, Anand, Sameer, editor, Jaiswal, Ajay, editor, and Kumar, Prabhat, editor
Published: 2025
Full Text: View/download PDF

2. Exploring low-level statistical features of n-grams in phishing URLs: a comparative analysis with high-level features.

Author: Tashtoush, Yahya, Alajlouni, Moayyad, Albalas, Firas, and Darwish, Omar
Subjects: *MACHINE learning, *PATTERN recognition systems, *CONVOLUTIONAL neural networks, *FEATURE selection, *STATISTICAL learning, *UNIFORM Resource Locators, *DEEP learning
Abstract: Phishing attacks are the biggest cybersecurity threats in the digital world. Attackers exploit users by impersonating real, authentic websites to obtain sensitive information such as passwords and bank statements. One common technique in these attacks is using malicious URLs. These malicious URLs mimic legitimate URLs, misleading users into interacting with malicious websites. This practice, URL phishing, presents a big threat to internet security, emphasizing the need for advanced detection methods. So we aim to enhance phishing URL detection by using machine learning and deep learning models, leveraging a set of low-level URL features derived from n-gram analysis. In this paper, we present a method for detecting malicious URLs using statistical features extracted from n-grams. These n-grams are extracted from the hexadecimal representation of URLs. We employed 4 experiments in our paper. The first 3 experiments used machine learning with the statistical features extracted from these n-grams, and the fourth experiment used these grams directly with deep learning models to evaluate their effectiveness. Also, we used Explainable AI (XAI) to explore the extracted features and evaluate their importance and role in phishing detection. A key advantage of our method is its ability to reduce the number of features required and reduce the training time by using fewer features after applying XAI techniques. This stands in contrast to the previous study, which relies on high-level URL features and needs pre-processing and a high number of features (87 high-level URL-based features). So our technique only uses statistical features extracted from n-grams and the n-gram itself, without the need for any high-level features. Our method is evaluated across different n-gram lengths (2, 4, 6, and 8), aiming to optimize detection accuracy. We conducted four experiments in our study. In the first experiment, we focused on extracting and using 12 common statistical features like mean, median, etc. In the first experiment, the XGBoost model achieved the highest accuracy using 8-gram features with 82.41%. In the second experiment, we expanded the feature set and extracted an additional 13 features, so our feature count became 25. XGBoost in the second experiment achieved the highest accuracy with 86.40%. Accuracy improvement continued in the third experiment, we extracted an additional 16 features (character count features), and these features increased XGBoost accuracy to 88.15% in the third experiment. In the fourth experiment, we directly fed n-gram representations into deep learning models. The Convolutional Neural Network (CNN) model achieved the highest accuracy of 94.09% in experiment four. Also, we applied XAI techniques, SHapley Additive exPlanations (SHAP), and Local Interpretable Model-agnostic Explanations (LIME). Through the explanation provided by XAI methods, we were able to determine the most important features in our feature set, enabling a reduction in feature count. Using fewer features (4, 7, 10, 13, 15), we got good accuracy compared to the 41 features used in experiment three and reduced the models' training times and complexity. This research aimed to enhance phishing URL detection by using machine learning and deep learning models, leveraging a set of low-level URL features derived from n-gram analysis. Our findings show the importance of using minimal statistical features to identify malicious URLs. Notably, the use of CNN had a great advancement, achieving an accuracy rate of 94.09% with using n-grams of URLs, surpassing traditional machine learning models. This achievement not only validates the efficacy of deep learning models in complex pattern recognition tasks but also highlights the efficiency of our feature selection approach, which relies on a lower number of features and is less complex compared to existing high-level feature-based studies. The research outcomes demonstrate a promising pathway toward developing more robust, efficient, and scalable phishing detection systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Interpreter-mediated political communication N-Grammed: a corpus-driven discourse analysis of government interpreters' (ideological) use of formulaic language.

Author: Gu, Chonglong and Li, Dechao
Subjects: CRITICAL discourse analysis, POLITICAL communication, DISCURSIVE practices, LANGUAGE acquisition, DISCOURSE analysis
Abstract: Formulaic expressions/prefabricated chunks are established as crucial in fluent speech production in psycholinguistics and language acquisition/learning yet have been largely underexplored in interpreting studies, barring a few experimental studies. Formulaic expressions are particularly underexplored from a discursive perspective in interpreting, that is, how interpreters might employ formulaic expressions for discursive mediation. Drawing on 20 years of China's political press conferences data, this study conducts a corpus-driven critical discourse analysis to explore the ideological/discursive properties of the linguistic category using the N-Gram function. The findings reveal that the interpreters' formulaic language use manifests at multiple levels: Instead of being seemingly routine and innocuous, formulaic expressions used in interpreting can be ideologically salient in (re)constructing versions of truth, fact, and reality, discursively further strengthening China's stance in the global lingua franca English. Contextualised bilingual examples are provided to demonstrate interpreters' mediation. This interdisciplinary study contributes to current understanding of government interpreters' agency in a changing sociocultural and geopolitical context. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. ViSL model: The model automatically generates sentences of Vietnamese sign language

Author: Khanh Dang and Igor A. Bessmertny
Subjects: vietnamese sign language, sign language model, automatic sentence generation, n-gram, markov model, breadth-first search, data enrichment, grammatical rules, Optics. Light, QC350-467, Electronic computers. Computer science, QA75.5-76.95
Abstract: The main problem in building intelligent systems is the lack of data for machine learning, which is especially important for sign language recognition for the deaf and hard of hearing. One of the ways to increase the amount of data for training is synthesis. Unlike speech synthesis, it is impossible to create a sequence of gestures in Vietnamese and some other languages that exactly repeat the text. This is due to the significant limitations of the gesture dictionary and the different word order in sentences. The aim of the work is to enrich the educational corpus of video data for use in creating recognition systems for the Vietnamese Sign Language (ViSL). Since it is impossible to translate the words of the source text into gestures one to one, the problem of translating from a regular language into a sign language arises. The paper proposes to use a two-phase process for this. The first phase involves pre-processing the text with standardization of the text format, segmentation of words and sentences, and then encoding the words using the sign language dictionary. At this stage, it should be noted that there is no need to remove punctuation marks and stop words, since they are related to the accuracy of the N-gram model. Next, instead of using syntactic analysis, a statistical method for forming a sequence of gestures is used, and the Markov model on the transition graph between words is taken as a basis in which the probability of the next word depends only on the two previous words. Transition probabilities are calculated on the existing marked corpus of the ViSL. The Breadth-first Search method is used to compile a list of all sentences generated based on a given grammatical rule and a matrix of semantic interactions between words. The inverse of the logarithm of the product of the probabilities of co-occurrence of consecutive 3-word phrases in a sentence is used to estimate the frequency of occurrence of that sentence in a given data set. Based on the ViSL data of 3,234 words, we calculated probability matrices representing the relationships between words based on Vietnamese natural language data with 50 million sentences collected from Vietnamese newspapers and magazines. For different grammar rules, we compare the number of generated sentences and evaluate the accuracy of the 50 most frequent sentences. The average accuracy is 88 %. The accuracy of the generated sentences is estimated by manual statistical methods. The number of generated sentences depends on the number of word parts that are labeled according to the grammar rules. The semantic accuracy of the generated sentences will be very high if the search words are labeled with the correct part-of-speech tagging. Compared with machine learning methods, our proposed method gives very good results for languages without inflections and word order that follow certain rules, such as Vietnamese, and does not require large computational resources. The disadvantage of this method is that its accuracy largely depends on the type of word, sentence, and word segmentation. The relationship of words depends on the observed dataset. Future research direction is to generate paragraphs in sign language. The obtained data can be used in machine learning models for sign language processing tasks.
Published: 2024
Full Text: View/download PDF

5. Ensemble machine learning technique-based plagiarism detection over opinions in social media

Author: Sethu Vinayaga Vadivu, Palanigurupackiam Nagaraj, and Bagavathi Ammai Shanmugam Murugan
Subjects: Plagiarism, n-gram, support vector machine, African vulture optimization, opinion mining, social media, Control engineering systems. Automatic machinery (General), TJ212-225, Automation, T59.5
Abstract: With the progressive enhancement of social media, several people prefer posting their opinions on various social media instead of posting on radios, television or newspapers. The postings differ in dimensions and include various titles and comments. Nowadays, the formation of plagiarism is increasing tremendously which occurs by rewriting or repeating one’s work. There are many ways to detect plagiarism by browsing through the internet. The significant intention of this paper involves the detection of plagiarism in social media using four different phases, namely the data pre-processing phase, n-gram evaluation, similarity/distance computation analysis and the plagiarism detection phase. The pre-processing includes data cleaning processes, such as the removal of redundant data, upper case letters, noise, irrelevant punctuations and characterizing into a vector form. After pre-processing the data are fed for n-gram evaluation to develop a posting attribution system. Then finally, an ensemble support vector machine-based African vulture optimization (ESVM-AVO) approach is employed to detect plagiarism which signifies that the performance based on detection is enhanced and the execution time in obtaining a high rate of detection accuracy is very low. Finally, the performance evaluation and the comparative analysis are carried out to determine the performance of the proposed system.
Published: 2024
Full Text: View/download PDF

6. A Combinatorial Strategy for API Completion: Deep Learning and Heuristics.

Author: Liu, Yi, Yin, Yiming, Deng, Jia, Li, Weimin, and Peng, Zhichao
Subjects: STATISTICS, SOFTWARE libraries (Computer programming), TRANSFORMER models, HEURISTIC, DEEP learning, SEMANTICS
Abstract: Remembering software library components and mastering their application programming interfaces (APIs) is a daunting task for programmers, due to the sheer volume of available libraries. API completion tools, which predict subsequent APIs based on code context, are essential for improving development efficiency. Existing API completion techniques, however, face specific weaknesses that limit their performance. Pattern-based code completion methods that rely on statistical information excel in extracting common usage patterns of API sequences. However, they often struggle to capture the semantics of the surrounding code. In contrast, deep-learning-based approaches excel in understanding the semantics of the code but may miss certain common usages that can be easily identified by pattern-based methods. Our insight into overcoming these challenges is based on the complementarity between these two types of approaches. This paper proposes a combinatorial method of API completion that aims to exploit the strengths of both pattern-based and deep-learning-based approaches. The basic idea is to utilize a confidence-based selector to determine which type of approach should be utilized to generate predictions. Pattern-based approaches will only be applied if the frequency of a particular pattern exceeds a pre-defined threshold, while in other cases, deep learning models will be utilized to generate the API completion results. The results showed that our approach dramatically improved the accuracy and mean reciprocal rank (MRR) in large-scale experiments, highlighting its utility. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data.

Author: Brezočnik, Lucija, Žlender, Tanja, Rupnik, Maja, and Podgorelec, Vili
Subjects: *NATURAL language processing, *CATTLE manure, *POLLUTANTS, *HYPERVARIABLE regions, *HIERARCHICAL clustering (Cluster analysis)
Abstract: Microbiota analysis can provide valuable insights in various fields, including diet and nutrition, understanding health and disease, and in environmental contexts, such as understanding the role of microorganisms in different ecosystems. Based on the results, we can provide targeted therapies, personalized medicine, or detect environmental contaminants. In our research, we examined the gut microbiota of 16 animal taxa, including humans, as well as the microbiota of cattle and pig manure, where we focused on 16S rRNA V3-V4 hypervariable regions. Analyzing these regions is common in microbiome studies but can be challenging since the results are high-dimensional. Thus, we utilized machine learning techniques and demonstrated their applicability in processing microbial sequence data. Moreover, we showed that techniques commonly employed in natural language processing can be adapted for analyzing microbial text vectors. We obtained the latter through frequency analyses and utilized the proposed hierarchical clustering method over them. All steps in this study were gathered in a proposed microbial sequence data processing pipeline. The results demonstrate that we not only found similarities between samples but also sorted groups' samples into semantically related clusters. We also tested our method against other known algorithms like the Kmeans and Spectral Clustering algorithms using clustering evaluation metrics. The results demonstrate the superiority of the proposed method over them. Moreover, the proposed microbial sequence data pipeline can be utilized for different types of microbiota, such as oral, gut, and skin, demonstrating its reusability and robustness. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Automatic Semantic Annotation of Indonesian Language Phrase Using N-Gram Language Model.

Author: Wardani, Dewi and Evangelista, Chania
Abstract: Building semantic data populations in unstructured data or text is challenging. In this type of data, several problems can be raised, some of which are difficult to analyze. Some groups of words or expressions cannot be defined according to their meaning and can be a source of ambiguity. It can have a different meaning depending on the context of its use. This work aims to automatically annotate Indonesian Language text, especially phrases, with the existing knowledge base. The result is text with semantic markup. Machines can automatically process this type of text because it describes its meaning. This work applies an n-gram language model to identify meaningful phrases and defines them as a unit so that every existing word or phrase is automatically semantically tagged. This work uses the DBpedia and schema.org knowledge base. The percentage of successfully labeled data in this job was 78% with 84.95% accuracy using DBpedia and 5.9% with 97.46% accuracy using schema .org. Some factors affect the accuracy score, including the availability of the required data with the data contained in the knowledge base, the system's ability in the POS tagging process, and many new terminology and local cultures that have not yet been contained in the knowledge bases, especially schema.org that is utilized as a standard for all search engines. This work will help the machine understand the semantics of text data. All pages obtained will be semantically tagged and, therefore, will be understood by machines. This ability will support the following processes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Ensemble machine learning technique-based plagiarism detection over opinions in social media.

Author: Vadivu, Sethu Vinayaga, Nagaraj, Palanigurupackiam, and Shanmugam Murugan, Bagavathi Ammai
Subjects: MACHINE learning, PLAGIARISM, SOCIAL media, DATA scrubbing, USER-generated content
Abstract: With the progressive enhancement of social media, several people prefer posting their opinions on various social media instead of posting on radios, television or newspapers. The postings differ in dimensions and include various titles and comments. Nowadays, the formation of plagiarism is increasing tremendously which occurs by rewriting or repeating one’s work. There are many ways to detect plagiarism by browsing through the internet. The significant intention of this paper involves the detection of plagiarism in social media using four different phases, namely the data pre-processing phase, n-gram evaluation, similarity/distance computation analysis and the plagiarism detection phase. The pre-processing includes data cleaning processes, such as the removal of redundant data, upper case letters, noise, irrelevant punctuations and characterizing into a vector form. After pre-processing the data are fed for n-gram evaluation to develop a post- ing attribution system. Then finally, an ensemble support vector machine-based African vulture optimization (ESVM-AVO) approach is employed to detect plagiarism which signifies that the performance based on detection is enhanced and the execution time in obtaining a high rate of detection accuracy is very low. Finally, the performance evaluation and the comparative analysis are carried out to determine the performance of the proposed system. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. A Technological Framework to Support Asthma Patient Adherence Using Pictograms.

Author: Figueroa, Rosa, Taramasco, Carla, Lagos, María Elena, Martínez, Felipe, Rimassa, Carla, Godoy, Julio, Pino, Esteban, Navarrete, Jean, Pinto, Jose, Nazar, Gabriela, Pérez, Cristhian, and Herrera, Daniel
Subjects: PATIENT compliance, ASTHMATICS, CLINICAL indications, OLDER patients, ARTIFICIAL intelligence
Abstract: Background: Low comprehension and adherence to medical treatment among the elderly directly and negatively affect their health. Many elderly patients forget medical instructions immediately after their appointments, misunderstand them, or fail to recall them altogether. Some identified causes include the short time slots allocated for appointments in the public health system in Chile, the complex terminology used by healthcare professionals, and the stress experienced by patients during appointments. One approach to improving patients' adherence to medical treatment is to combine written and oral instructions with graphical elements such as pictograms. However, several challenges arise due to the ambiguity of natural language and the need for pictograms to accurately represent various medication combinations, doses, and frequencies. Objective: This study introduces SIMAP (System for Integrating Medical Instructions with Pictograms), a technological framework aimed at enhancing adherence among asthma patients through the delivery of pictograms via a computational system. SIMAP utilizes a collaborative and user-centered methodology, involving health professionals and patients in the construction and validation of its components. Methods: The technological framework presented in this study is composed of three parts. The first two are medical indications and pictograms related to the treatment of the disease. Both components were developed through a comprehensive and iterative methodology that incorporates both qualitative and quantitative approaches. This methodology includes the utilization of focus groups, interviews, paper and online surveys, as well as expert validation, ensuring a robust and thorough development. The core of SIMAP is the technological component that leveraged artificial intelligence methods for natural language processing to analyze, tokenize, and associate words and their context to a set of one or more pictograms, addressing issues such as the ambiguity in the text, the cultural factor that involves many ways of expressing the same indication, and typographical errors in the indications. Results: Firstly, we successfully validated 18 clinical indications along with their respective pictograms. Some of the pictograms were redesigned based on the validation results. However, in the final validation, the comprehension percentages of the pictograms exceeded 70%. Furthermore, we developed a software called SIMAP, which translates medical indications into previously validated pictograms. Our proposed software, SIMAP, achieves a correct mapping rate of 96.69%. Conclusions: SIMAP demonstrates great potential as a technological component for supplementing medical instructions with pictograms when tested in a laboratory setting. The use of artificial intelligence for natural language processing can successfully map medical instructions, both structured and unstructured, into pictograms. This integration of textual instructions and pictograms holds promise for enhancing the comprehension and adherence of elderly patients to their medical indications, thereby improving their long-term health. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Advancing Sentiment Analysis in Restaurant Reviews through Unsupervised Machine Learning Algorithms.

Author: Gupta, Vijay and Rattan, Punam
Subjects: MACHINE learning, SENTIMENT analysis, RESTAURANT reviews, FOOD quality, RESTAURANTS, FEATURE extraction, CONSUMER preferences
Abstract: Restaurant reviews play a pivotal role in shaping consumer decisions and perceptions. Analyzing these reviews through sentiment analysis provides valuable insights into customer sentiments towards various aspects of dining experiences, such as food quality, service, ambiance, and pricing. By leveraging sentiment analysis techniques, businesses can better understand customer preferences, identify areas for improvement, and enhance overall customer satisfaction. This research focuses on utilizing aspect-based sentiment analysis to predict restaurant survival, leveraging customer-generated content from online reviews. The proposed methodology encompasses data acquisition, pre-processing, feature extraction, and unsupervised approaches-based classification. Data pre-processing involves tokenization, stop word removal, lemmatization, punctuation removal, and filtering short and long words to standardize the format. Feature extraction includes lexicon-based and word encoding methods, leveraging Term Frequency-Inverse Document Frequency (TF-IDF) vectors, Ngram, Bag of Words, and Word Embedding. Unsupervised approachesbased classification entails Fuzzy C-Means (FCM), K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Hierarchical Method, Hybrid Binary Particle Swarm-Optimized FCM, and HBPSO-Optimized Kmeans. Evaluation parameters are defined to assess the performance of each approach. The results showcase the effectiveness of aspect-based sentiment analysis in predicting restaurant survival, with HBPSO-Optimized FCM demonstrating the highest accuracy at 89.50%. These findings underscore the significance of leveraging customergenerated content for informed decision-making in the restaurant industry. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. NGST-Net: A N-Gram based Swin Transformer Network for improving multispectral and hyperspectral image fusion

Author: Ziyuan Feng, Xianfeng Zhang, Bo Zhou, Miao Ren, and Xiao Chen
Subjects: Image fusion, hyperspectral, multispectral, Swin Transformer, N-Gram, Mathematical geography. Cartography, GA1-1776
Abstract: Transformer-based deep networks have been widely employed for fusing multispectral and hyperspectral images because the global self-attention mechanism ensures that more pixel information is used in the fusion. However, the unreasonable information interactions between local windows in window self-attention-based Transformer can adversely affect the network’s modeling of the global relationship, resulting in low-quality fusion results. This study proposes a novel N-Gram based Swin Transformer Network (NGST-Net). It utilizes more pixels for modeling the global relationship, mining more spectral information of similar pixels. A N-Gram strategy is proposed to learn the spectral feature relationship between the local window and previous and subsequent windows to guide the information interactions between the local windows. A maskless shifted window strategy enables the efficient modeling of the global relationship. Experimental results show that the proposed NGST-Net has fewer parameters, higher inference speed, and smaller memory footprint than several recently published Transformer methods. The improvements in the SAM, ERGAS, and PSNR over the second-best method are 9.9%, 6.5%, and 1%, respectively, on the CAVE dataset. The inference speed of NGST-Net is 10.7 times faster than that of Fusformer. The proposed NGST-Net is a novel deep network model for the effective fusion of multispectral and hyperspectral images.
Published: 2024
Full Text: View/download PDF

13. Machine Translation of Chinese–Hindi Simple Sentences Using Moses

Author: Rawat, Devendra Singh, Lobiyal, Daya Krishan, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Tomar, Anuradha, editor, Mishra, Sukumar, editor, Sood, Y. R., editor, and Kumar, Pramod, editor
Published: 2024
Full Text: View/download PDF

14. Emotional Markers As Indicators of Investor Attitudes: EDA Sub-process Proposal

Author: Kruszewski, Tomasz, Michalak, Joanna, Gaul, Wolfgang, Managing Editor, Vichi, Maurizio, Managing Editor, Weihs, Claus, Managing Editor, Baier, Daniel, Editorial Board Member, Critchley, Frank, Editorial Board Member, Decker, Reinhold, Editorial Board Member, Diday, Edwin, Editorial Board Member, Greenacre, Michael, Editorial Board Member, Lauro, Carlo Natale, Editorial Board Member, Meulman, Jacqueline, Editorial Board Member, Monari, Paola, Editorial Board Member, Nishisato, Shizuhiko, Editorial Board Member, Ohsumi, Noboru, Editorial Board Member, Opitz, Otto, Editorial Board Member, Ritter, Gunter, Editorial Board Member, Schader, Martin, Editorial Board Member, Giordano, Giuseppe, editor, and Misuraca, Michelangelo, editor
Published: 2024
Full Text: View/download PDF

15. Development and Realization of Bigram Models for Recognizing Homonyms in the Uzbek Language

Author: Abjalova, Manzura, Tukeyev, Ualsher, Abduraxmanova, Mukaddas, Adilova, Munojot, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Nguyen, Ngoc Thanh, editor, Chbeir, Richard, editor, Manolopoulos, Yannis, editor, Fujita, Hamido, editor, Hong, Tzung-Pei, editor, Nguyen, Le Minh, editor, and Wojtkiewicz, Krystian, editor
Published: 2024
Full Text: View/download PDF

16. Extraction of Disease Symptoms from Free Text Using Natural Language Processing Techniques

Author: Laabidi, Adil, Aissaoui, Mohammed, Madani, Mohamed Amine, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Yang, Xin-She, editor, Sherratt, Simon, editor, Dey, Nilanjan, editor, and Joshi, Amit, editor
Published: 2024
Full Text: View/download PDF

17. Detection of Cyberbullying Text in Bangla Using N-Gram Analysis and Machine Learning Approaches

Author: Jahan, Busrat, Chowdhury, Muntasir Karim, Mazumder, Shazzad Hossain, Akter, Mariam, Rayan, Muhammad Abu, Rahman, Mohammad Abdur, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Yang, Xin-She, editor, Sherratt, Simon, editor, Dey, Nilanjan, editor, and Joshi, Amit, editor
Published: 2024
Full Text: View/download PDF

18. Building a Corpus for Teaching and Learning a Second Language by Using Sketch Engine

Author: Thao, Phan Thi Thanh, Kacprzyk, Janusz, Series Editor, Bui, Hung Phu, editor, and Namaziandost, Ehsan, editor
Published: 2024
Full Text: View/download PDF

19. A Corpus-Based Study of Lexical Chunks in Chinese Academic Discourse: Extraction, Classification, and Application

Author: Zhou, Qihong, Mou, Li, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Dong, Minghui, editor, Hong, Jia-Fei, editor, Lin, Jingxia, editor, and Jin, Peng, editor
Published: 2024
Full Text: View/download PDF

20. Data Mining Efficiency in the ESG Indexes Verbalization Analysis (on the Example of the MSCI Site)

Author: Goncharova, Oxana V., Khaleeva, Svetlana A., Ladonina, Natalia A., Eremeev, Igor D., Fioktistova, Varvara V., Pisello, Anna Laura, Editorial Board Member, Hawkes, Dean, Editorial Board Member, Bougdah, Hocine, Editorial Board Member, Rosso, Federica, Editorial Board Member, Abdalla, Hassan, Editorial Board Member, Boemi, Sofia-Natalia, Editorial Board Member, Mohareb, Nabil, Editorial Board Member, Mesbah Elkaffas, Saleh, Editorial Board Member, Bozonnet, Emmanuel, Editorial Board Member, Pignatta, Gloria, Editorial Board Member, Mahgoub, Yasser, Editorial Board Member, De Bonis, Luciano, Editorial Board Member, Kostopoulou, Stella, Editorial Board Member, Pradhan, Biswajeet, Editorial Board Member, Abdul Mannan, Md., Editorial Board Member, Alalouch, Chaham, Editorial Board Member, Gawad, Iman O., Editorial Board Member, Nayyar, Anand, Editorial Board Member, Amer, Mourad, Series Editor, Sergi, Bruno S., editor, Popkova, Elena G., editor, Ostrovskaya, Anna A., editor, Chursin, Alexander A., editor, and Ragulina, Yulia V., editor
Published: 2024
Full Text: View/download PDF

21. Emotional Visualization Analysis Based on Online Book User Comments

Author: Xu, Jingxiu, Vinluan, Albert A., Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Lin, Jerry Chun-Wei, editor, Shieh, Chin-Shiuh, editor, Horng, Mong-Fong, editor, and Chu, Shu-Chuan, editor
Published: 2024
Full Text: View/download PDF

22. Multimodal Authentication Token Through Automatic Part of Speech (POS) Tagged Word Embedding

Author: Kumar, Dharmendra, Sharma, Sudhansh, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Tiwari, Shailesh, editor, Trivedi, Munesh C., editor, Kolhe, Mohan L., editor, and Singh, Brajesh Kumar, editor
Published: 2024
Full Text: View/download PDF

23. KSRE-CNER: A Knowledge and Semantic Relation Enhancement Framework for Chinese NER

Author: Dong, Jikun, Long, Kaifang, Zhu, Jiran, Yu, Hui, Lv, Chen, Shao, Zengzhen, Xu, Weizhi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Fenrong, editor, Sadanandan, Arun Anand, editor, Pham, Duc Nghia, editor, Mursanto, Petrus, editor, and Lukose, Dickson, editor
Published: 2024
Full Text: View/download PDF

24. Statistical cryptanalysis of seven classical lightweight ciphers

Author: Chatterjee, Runa and Chakraborty, Rajdeep
Published: 2024
Full Text: View/download PDF

25. İnşaat Mühendisleri için Açılan İş İlanlarının Metin Madenciliği ile Değerlendirilmesi.

Author: ÖZBAŞARAN, Hakan, TAŞCI, Gökçenur, TUNA, Mert, KARATAŞ, İpek, ÇELİK, Arif, GÜMÜŞ, Mesut, and BİRŞEN, Mert
Published: 2024
Full Text: View/download PDF

26. Query based biomedical document retrieval for clinical information access with the semantic similarity.

Author: Gupta, Supriya, Sharaff, Aakanksha, and Nagwani, Naresh Kumar
Subjects: CLINICAL decision support systems, INFORMATION retrieval, ACCESS to information, NATURAL language processing, SIMILARITY (Geometry)
Abstract: The amount of exploration done for the available medical literature is quite less and at the same time, there is less awareness of information mining in this specific field. The accessibility of immense quantity of biomedical literature has opened up additional opportunities to apply Information Retrieval and NLP methods for mining existing archives. Therefore, a query based retrieval application (QBR) based on hybrid similarity of string and semantic similarity can help medical professionals in their ongoing research. There are multiple benefits of utilizing various NLP applications, for example, information retrieval engine, and clinical diagnosis frameworks for decision support in medical field. These applications depend on the capacity to gauge Hybrid textual similarity (HTS) and N-Gram similarity. Hybrid similarity is the combination of weighting function and word embedding models providing similarity scores with optimum results. In this work, the main focus is on building of a new biomedical document retrieval model which can pull relevant literature for clinical decision support system based on the specific query. There is also an attempt to compare the statistical and NLP based approaches of query based biomedical document retrieval with the baseline systems. Analysis of the proposed method inclusive of semantic word embeddings shows promising results for both of the suggested similarities. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Phrase Detection's Impact on Sentiment Analysis of Public Opinion and online Media Toward Political Figures

Author: Muhammad Irsa Nurodin and Yan Puspitarani
Subjects: sentiment analysis, text preprocessing, n-gram, politic, opinion, online media, twitter, Electronic computers. Computer science, QA75.5-76.95, Computer engineering. Computer hardware, TK7885-7895
Abstract: Public opinion of political figures and policy significantly impacts general elections. Sentiment analysis, as a method to comprehend opinion and emotion in texts, requires the step of text preprocessing to improve data quality. However, textual data often encounters irrelevant words and ambiguous language. These conditions can impact the accuracy of sentiment analysis. Given the significance of precisely interpreting public opinion toward political figures, these issues may result in biased or inaccurate sentiment analysis outcomes. Irregular punctuation or unclear language can disturb the text's intended context, compromising sentiment analysis quality. Additionally, irrelevant words can obscure the focus of the analysis, causing fundamental changes in the original text's meaning. This research focuses on the impact of a specific preprocessing technique, namely Phrase Detection with N-Gram, on sentiment analysis of political figures. By applying this method, the study aims to detail the effects of using Bigram, Trigram, and Unigram on the quality of sentiment analysis, particularly in the context of political figures on Twitter and online media articles. This research indicates that using Bigram in Phrase Detection provides more significant results than Trigram and Unigram for most political figures at Twitter, with the highest accuracy score of 88,23%. Sentiment analysis of articles in online media also indicates various results depending on the type of N-Gram. The results indicate that using N-gram phrase detection can influence the accuracy of sentiment analysis, and the resulting accuracy values are pretty high.
Published: 2024
Full Text: View/download PDF

28. Korean Voice Phishing Detection Applying NER With Key Tags and Sentence-Level N-Gram

Author: Seunguk Yu, Yejin Kwon, Minju Kim, and Kiseong Lee
Subjects: Voice phishing detection, named entity recognition, N-gram, machine learning, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Voice phishing is the criminal act of tricking others to transfer funds or to seek financial gain based on personal information obtained illegally. The importance of this crime is recognized worldwide, and technical solutions have been proposed to reduce the increasing damage. In this paper, we propose a process for voice phishing detection in Korean by applying named entity recognition (NER) with Key Tags and Sentence-level N-gram. From the perspective of humans, we collect financial counseling texts as non-phishing dataset since the victim confuses voice phishing with them. We carefully select Key Tags that can be meaningful for distinguishing voice phishing and financial counseling texts and combine sentence bundles to effectively detect voice phishing. The experimental results, using ten types of machine learning models, showed that maintained results when generalizing information by Key Tags and improved results when combining text bundles. We hope that the proposed process can be effectively applied to other criminal scenarios in the future.
Published: 2024
Full Text: View/download PDF

29. An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of Cyberbullying

Author: Abdulkarim Faraj Alqahtani and Mohammad Ilyas
Subjects: ensemble models, cyberbullying, multi-classification, multiclass, TF-IDF, N-gram, Computer engineering. Computer hardware, TK7885-7895
Abstract: The impact of communication through social media is currently considered a significant social issue. This issue can lead to inappropriate behavior using social media, which is referred to as cyberbullying. Automated systems are capable of efficiently identifying cyberbullying and performing sentiment analysis on social media platforms. This study focuses on enhancing a system to detect six types of cyberbullying tweets. Employing multi-classification algorithms on a cyberbullying dataset, our approach achieved high accuracy, particularly with the TF-IDF (bigram) feature extraction. Our experiment achieved high performance compared with that stated for previous experiments on the same dataset. Two ensemble machine learning methods, employing the N-gram with TF-IDF feature-extraction technique, demonstrated superior performance in classification. Three popular multi-classification algorithms: Decision Trees, Random Forest, and XGBoost, were combined into two varied ensemble methods separately. These ensemble classifiers demonstrated superior performance compared to traditional machine learning classifier models. The stacking classifier reached 90.71% accuracy and the voting classifier 90.44%. The results of the experiments showed that the framework can detect six different types of cyberbullying more efficiently, with an accuracy rate of 0.9071.
Published: 2024
Full Text: View/download PDF

30. Arithmetic N-gram: an efficient data compression technique

Author: Hassan, Ali, Javed, Sadaf, Hussain, Sajjad, Ahmad, Rizwan, and Qazi, Shams
Published: 2024
Full Text: View/download PDF

31. Detection of content-based cybercrime in Roman Kashmiri using ensemble learning.

Author: Farooq, Umar, Singh, Parvinder, Khurana, Surinder Singh, and Kumar, Munish
Abstract: The official language of Kashmir, Kashmiri language or Koshur, is spoken by more than 7 million people, yet its content-based cybercrime detection remains unexplored in theoretical and experimental research. Furthermore, the absence of programming libraries for sentimental analysis and a benchmark corpus has impeded advancements in this field. Challenges persist in working with diverse scripts of Kashmiri, including Perso-Arabic, Sharada, Devanagari, and Roman. Detecting cybercrime in this language is challenging due to its complex morphological nature, lack of resources, scarcity of annotated datasets, and varied linguistic characteristics, emphasizing the importance of overcoming these obstacles to develop effective detection systems. This paper attempts to detect content-based cybercrime in Roman Kashmiri script, extensively utilized on online platforms like social media, chat rooms, emails, etc., by the Kashmiri community. A well-balanced and meaningful dataset, the first of its kind in this context, is compiled, incorporating positive and negative comments, and three strategies were employed for analysis. The findings reveal that the Tf-Idf Vectorizer outperforms other tokenization methods (Count Vectorizer and Tf-Idf Transformer), bi-gram notation exhibits superior performance compared to one and tri-gram notations, and the XGBM proves to be the most effective in terms of evaluation metrics. Leveraging these strategies, Python applications were developed for text classification, successfully distinguishing cyberbullying (unsafe) from non-cyberbullying (safe) instances, with the XGBM exhibiting exceptional accuracy using the Tf-Idf Vectorizer with bi-gram, a Bag of Words, and lexical features. This pioneering research underscores the urgent need for content-based cybercrime detection advancements in the Kashmiri language, paving the way for effective detection systems to address language-specific challenges and promote a safer online environment for the Kashmiri community. Furthermore, this research opens new avenues for further advancements in detecting and preventing cybercrime in Kashmiri and potentially in other languages lacking robust cybercrime detection methodologies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. AGRAMP: machine learning models for predicting antimicrobial peptides against phytopathogenic bacteria.

Author: Shao, Jonathan, Yan Zhao, Wei Wei, and Vaisman, Iosif I.
Subjects: MACHINE learning, PHYTOPATHOGENIC bacteria, ANTIMICROBIAL peptides, AMINO acid residues, RANDOM forest algorithms, AMINO acid sequence
Abstract: Introduction: Antimicrobial peptides (AMPs) are promising alternatives to traditional antibiotics for combating plant pathogenic bacteria in agriculture and the environment. However, identifying potent AMPs through laborious experimental assays is resource-intensive and time-consuming. To address these limitations, this study presents a bioinformatics approach utilizing machine learning models for predicting and selecting AMPs active against plant pathogenic bacteria. Methods: N-gram representations of peptide sequences with 3-letter and 9-letter reduced amino acid alphabets were used to capture the sequence patterns and motifs that contribute to the antimicrobial activity of AMPs. A 5-fold cross-validation technique was used to train the machine learning models and to evaluate their predictive accuracy and robustness. Results: The models were applied to predict putative AMPs encoded by intergenic regions and small open reading frames (ORFs) of the citrus genome. Approximately 7% of the 10,000-peptide dataset from the intergenic region and 7% of the 685,924-peptide dataset from the whole genome were predicted as probable AMPs. The prediction accuracy of the reported models range from 0.72 to 0.91. A subset of the predicted AMPs was selected for experimental test against Spiroplasma citri, the causative agent of citrus stubborn disease. The experimental results confirm the antimicrobial activity of the selected AMPs against the target bacterium, demonstrating the predictive capability of the machine learning models. Discussion: Hydrophobic amino acid residues and positively charged amino acid residues are among the key features in predicting AMPs by the Random Forest Algorithm. Aggregation propensity appears to be correlated with the effectiveness of the AMPs. The described models would contribute to the development of effective AMP-based strategies for plant disease management in agricultural and environmental settings. To facilitate broader accessibility, our model is publicly available on the AGRAMP (Agricultural Ngrams Antimicrobial Peptides) server. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Combining semi-supervised model and optimized LSTM for image caption generation based on pseudo labels.

Author: Padate, Roshni, Jain, Amit, Kalla, Mukesh, and Sharma, Arvind
Abstract: Artificial intelligence's crucial area of image captions. It's a very difficult situation until the advancement of DL is made. A lot of open challenges remain as robustness, generalization and accuracy, results are far from reasonable. As image captioning schemes are data avaricious, pre-training on larger scale datasets, even if not well-curated, is fetching a solid approach. In addition to precisely identifying the image includes the scene, object, connection, and qualities of the item in the image, the image caption generation method should produce natural, fluid, precise, and useful sentences. However, since not all visual information may be utilized, it might be difficult to effectively convey the image's content when writing image captions. Here, the image captioning is done under two models, i.e. NIC model and LSTM based model. At first, (Neural Image Caption) NIC process is done, where, CNN based caption generation is carried out for unlabelled and labeled dataset. Further, features namely, improved BOW and N-gram are derived that are used for training the CNN model. The final caption is generated by optimized LSTM, where the weights are optimally tuned by Harris Hawks with Sinusoidal Chaotic Map Assisted Exploitation (HH-SCME). Finally, BLEU score, rouge and CIDER scores are computed to prove the efficiency of HH-SCME. The proposed model of LSTM+HH-SCME achieves 0.84 BLEU score 1 value as compared to other existing methods like CNN, SSO, PRO, AOA, RNN, LSTM and LSTM+HH-SCME. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of Cyberbullying.

Author: Alqahtani, Abdulkarim Faraj and Ilyas, Mohammad
Subjects: CYBERBULLYING, MACHINE learning, SOCIAL media, SENTIMENT analysis, RANDOM forest algorithms, FEATURE extraction
Abstract: The impact of communication through social media is currently considered a significant social issue. This issue can lead to inappropriate behavior using social media, which is referred to as cyberbullying. Automated systems are capable of efficiently identifying cyberbullying and performing sentiment analysis on social media platforms. This study focuses on enhancing a system to detect six types of cyberbullying tweets. Employing multi-classification algorithms on a cyberbullying dataset, our approach achieved high accuracy, particularly with the TF-IDF (bigram) feature extraction. Our experiment achieved high performance compared with that stated for previous experiments on the same dataset. Two ensemble machine learning methods, employing the N-gram with TF-IDF feature-extraction technique, demonstrated superior performance in classification. Three popular multi-classification algorithms: Decision Trees, Random Forest, and XGBoost, were combined into two varied ensemble methods separately. These ensemble classifiers demonstrated superior performance compared to traditional machine learning classifier models. The stacking classifier reached 90.71% accuracy and the voting classifier 90.44%. The results of the experiments showed that the framework can detect six different types of cyberbullying more efficiently, with an accuracy rate of 0.9071. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. N-gram and Kernel Performance Using Support Vector Machine Algorithm for Fake News Detection System

Author: Deny Jollyta, Gusrianty Gusrianty, Prihandoko Prihandoko, and Darmanta Sukrianto
Subjects: fake news, kernel, n-gram, support vector machine, text classification, Electronic computers. Computer science, QA75.5-76.95
Abstract: The modern technological advancements have made it simpler for fake news to circulate online. The researchers have developed several strategies to overcome this obstacle, including text classification, distribution network analysis, and human-machine hybrid methods. The most common method is text categorization, and many researchers offer deep learning and machine learning models as remedies. An Indonesian language fake news detection system based on news headlines was developed in this work using the Support Vector Machine (SVM) kernel and n-gram. The objective of this research is to identify the model that produces the best performance outcomes. The system deployment on the web will employ the model that produces the greatest outcomes. According to the research findings, the linear kernel SVM algorithm produces the best results, with an accuracy value of 0.974. Furthermore, the bigram feature used in the development of a classification model does not increase the precision of fake news identification in Indonesian. Utilizing the unigram function yields the most accurate results.
Published: 2023
Full Text: View/download PDF

36. Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data

Author: Lucija Brezočnik, Tanja Žlender, Maja Rupnik, and Vili Podgorelec
Subjects: machine learning, NLP, hierarchical clustering, microbial data, microbiome, n-gram, Mathematics, QA1-939
Abstract: Microbiota analysis can provide valuable insights in various fields, including diet and nutrition, understanding health and disease, and in environmental contexts, such as understanding the role of microorganisms in different ecosystems. Based on the results, we can provide targeted therapies, personalized medicine, or detect environmental contaminants. In our research, we examined the gut microbiota of 16 animal taxa, including humans, as well as the microbiota of cattle and pig manure, where we focused on 16S rRNA V3-V4 hypervariable regions. Analyzing these regions is common in microbiome studies but can be challenging since the results are high-dimensional. Thus, we utilized machine learning techniques and demonstrated their applicability in processing microbial sequence data. Moreover, we showed that techniques commonly employed in natural language processing can be adapted for analyzing microbial text vectors. We obtained the latter through frequency analyses and utilized the proposed hierarchical clustering method over them. All steps in this study were gathered in a proposed microbial sequence data processing pipeline. The results demonstrate that we not only found similarities between samples but also sorted groups’ samples into semantically related clusters. We also tested our method against other known algorithms like the Kmeans and Spectral Clustering algorithms using clustering evaluation metrics. The results demonstrate the superiority of the proposed method over them. Moreover, the proposed microbial sequence data pipeline can be utilized for different types of microbiota, such as oral, gut, and skin, demonstrating its reusability and robustness.
Published: 2024
Full Text: View/download PDF

37. A Technological Framework to Support Asthma Patient Adherence Using Pictograms

Author: Rosa Figueroa, Carla Taramasco, María Elena Lagos, Felipe Martínez, Carla Rimassa, Julio Godoy, Esteban Pino, Jean Navarrete, Jose Pinto, Gabriela Nazar, Cristhian Pérez, and Daniel Herrera
Subjects: natural language processing, machine learning, pictograms, tokenization, n-gram, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Background: Low comprehension and adherence to medical treatment among the elderly directly and negatively affect their health. Many elderly patients forget medical instructions immediately after their appointments, misunderstand them, or fail to recall them altogether. Some identified causes include the short time slots allocated for appointments in the public health system in Chile, the complex terminology used by healthcare professionals, and the stress experienced by patients during appointments. One approach to improving patients’ adherence to medical treatment is to combine written and oral instructions with graphical elements such as pictograms. However, several challenges arise due to the ambiguity of natural language and the need for pictograms to accurately represent various medication combinations, doses, and frequencies. Objective: This study introduces SIMAP (System for Integrating Medical Instructions with Pictograms), a technological framework aimed at enhancing adherence among asthma patients through the delivery of pictograms via a computational system. SIMAP utilizes a collaborative and user-centered methodology, involving health professionals and patients in the construction and validation of its components. Methods: The technological framework presented in this study is composed of three parts. The first two are medical indications and pictograms related to the treatment of the disease. Both components were developed through a comprehensive and iterative methodology that incorporates both qualitative and quantitative approaches. This methodology includes the utilization of focus groups, interviews, paper and online surveys, as well as expert validation, ensuring a robust and thorough development. The core of SIMAP is the technological component that leveraged artificial intelligence methods for natural language processing to analyze, tokenize, and associate words and their context to a set of one or more pictograms, addressing issues such as the ambiguity in the text, the cultural factor that involves many ways of expressing the same indication, and typographical errors in the indications. Results: Firstly, we successfully validated 18 clinical indications along with their respective pictograms. Some of the pictograms were redesigned based on the validation results. However, in the final validation, the comprehension percentages of the pictograms exceeded 70%. Furthermore, we developed a software called SIMAP, which translates medical indications into previously validated pictograms. Our proposed software, SIMAP, achieves a correct mapping rate of 96.69%. Conclusions: SIMAP demonstrates great potential as a technological component for supplementing medical instructions with pictograms when tested in a laboratory setting. The use of artificial intelligence for natural language processing can successfully map medical instructions, both structured and unstructured, into pictograms. This integration of textual instructions and pictograms holds promise for enhancing the comprehension and adherence of elderly patients to their medical indications, thereby improving their long-term health.
Published: 2024
Full Text: View/download PDF

38. On the Most Frequent Sequences of Words in Russian Spoken Everyday Language (Bigrams and Trigrams): An Experience of Classification

Author: Khokhlova, Maria V., Blinova, Olga V., Bogdanova-Beglarian, Natalia, Sherstinova, Tatiana, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Karpov, Alexey, editor, Samudravijaya, K., editor, Deepak, K. T., editor, Hegde, Rajesh M., editor, Agrawal, Shyam S., editor, and Prasanna, S. R. Mahadeva, editor
Published: 2023
Full Text: View/download PDF

39. N-Gram based Convolutional Neural Network Approach for Authorship Identification

Author: Alamanda, Sirisha, Pabboju, Suresh, Gugulothu, Narsimha, Powers, David M. W., Series Editor, Leibbrandt, Richard, Series Editor, Kumar, Amit, editor, Ghinea, Gheorghita, editor, and Merugu, Suresh, editor
Published: 2023
Full Text: View/download PDF

40. Analysis and Prediction of Datasets for Deep Learning: A Systematic Review

Author: Deshmukh, Vaishnavi J., Ambhaikar, Asha, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Chinara, Suchismita, editor, Tripathy, Asis Kumar, editor, Li, Kuan-Ching, editor, Sahoo, Jyoti Prakash, editor, and Mishra, Alekha Kumar, editor
Published: 2023
Full Text: View/download PDF

41. Bangla Spelling Error Detection and Correction Using N-Gram Model

Author: Bagchi, Promita, Arafin, Mursalin, Akther, Aysha, Alam, Kazi Masudul, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Satu, Md. Shahriare, editor, Moni, Mohammad Ali, editor, Kaiser, M. Shamim, editor, and Arefin, Mohammad Shamsul, editor
Published: 2023
Full Text: View/download PDF

42. Detecting Unknown Vulnerabilities in Smart Contracts with Binary Classification Model Using Machine Learning

Author: Li, Xiangbin, Xing, Xiaofei, Wang, Guojun, Li, Peiqiang, Liu, Xiangyong, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Wang, Guojun, editor, Choo, Kim-Kwang Raymond, editor, Wu, Jie, editor, and Damiani, Ernesto, editor
Published: 2023
Full Text: View/download PDF

43. A Study on the Analysis of Related Keywords on the Perception of Untact Coding Education in the Post-COVID Era Using Big Data Analysis

Author: Yoon, Soo-Yeon, Kim, Jong-Bae, Kacprzyk, Janusz, Series Editor, and Lee, Roger, editor
Published: 2023
Full Text: View/download PDF

44. N-Gram Based Amharic Grammar Checker

Author: Sharma, Deepak, Mattu, Gurjeet Singh, Sharma, Sukhdeep, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
Published: 2023
Full Text: View/download PDF

45. A Quantum Algorithm to Locate Unknown Hashgrams

Author: Allgood, Nicholas R., Nicholas, Charles K., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
Published: 2023
Full Text: View/download PDF

46. The role of surprisal in issue trackers

Author: Caddy, James, Treude, Christoph, Wagner, Markus, and Barr, Earl T.
Published: 2025
Full Text: View/download PDF

47. AGRAMP: machine learning models for predicting antimicrobial peptides against phytopathogenic bacteria

Author: Jonathan Shao, Yan Zhao, Wei Wei, and Iosif I. Vaisman
Subjects: antimicrobial peptide, AGRAMP, Spiroplasma, N-gram, random forest, AMP, Microbiology, QR1-502
Abstract: IntroductionAntimicrobial peptides (AMPs) are promising alternatives to traditional antibiotics for combating plant pathogenic bacteria in agriculture and the environment. However, identifying potent AMPs through laborious experimental assays is resource-intensive and time-consuming. To address these limitations, this study presents a bioinformatics approach utilizing machine learning models for predicting and selecting AMPs active against plant pathogenic bacteria.MethodsN-gram representations of peptide sequences with 3-letter and 9-letter reduced amino acid alphabets were used to capture the sequence patterns and motifs that contribute to the antimicrobial activity of AMPs. A 5-fold cross-validation technique was used to train the machine learning models and to evaluate their predictive accuracy and robustness.ResultsThe models were applied to predict putative AMPs encoded by intergenic regions and small open reading frames (ORFs) of the citrus genome. Approximately 7% of the 10,000-peptide dataset from the intergenic region and 7% of the 685,924-peptide dataset from the whole genome were predicted as probable AMPs. The prediction accuracy of the reported models range from 0.72 to 0.91. A subset of the predicted AMPs was selected for experimental test against Spiroplasma citri, the causative agent of citrus stubborn disease. The experimental results confirm the antimicrobial activity of the selected AMPs against the target bacterium, demonstrating the predictive capability of the machine learning models.DiscussionHydrophobic amino acid residues and positively charged amino acid residues are among the key features in predicting AMPs by the Random Forest Algorithm. Aggregation propensity appears to be correlated with the effectiveness of the AMPs. The described models would contribute to the development of effective AMP-based strategies for plant disease management in agricultural and environmental settings. To facilitate broader accessibility, our model is publicly available on the AGRAMP (Agricultural Ngrams Antimicrobial Peptides) server.
Published: 2024
Full Text: View/download PDF

48. Exploring the Subjectivity of English Academic Discourse in the Context of Big Data

Author: Pan Ying
Subjects: n-gram, bi-lstm, sentiment lexicon, word vectors, sentiment levelness, 93c62, Mathematics, QA1-939
Abstract: This study develops a sentiment analysis model for English academic discourse based on word information to effectively understand and analyze the sentiment tendencies in English literary texts. The structure of the model includes word embedding layer, character-level feature extraction, word-level feature extraction and feature fusion and classification layer. The word embedding layer realizes the mapping between word vectors and word vectors by microblogging pre-trained word vectors. The character-level feature extraction session uses a multi-window convolutional layer to capture N-Gram information. In contrast, the word-level feature extraction obtains deeper semantic information through a Bi-LSTM layer and fuses it with character-level information to enhance robustness. The feature fusion and classification layer further combines these features and determines the fusion weights through a linear layer to achieve sentiment classification. In performance tests, the model achieves 92.5% sentiment classification accuracy on the standard dataset, an improvement of about 6% compared to traditional methods. In particular, the accuracy is improved by 5% when dealing with text with sentiment polarity transition, showing good adaptability. In addition, using 657 positive and 679 negative sentiment words as seed words effectively expands the sentiment lexicon and enhances the comprehensiveness and accuracy of sentiment analysis.
Published: 2024
Full Text: View/download PDF

49. A FEATURE EXTRACTION BASED IMPROVED SENTIMENT ANALYSIS ON APACHE SPARK FOR REAL-TIME TWITTER DATA.

Author: KANUNGO, PIYUSH and SINGH, HARI
Subjects: SENTIMENT analysis, FEATURE extraction, CLASSIFICATION algorithms, SUPPORT vector machines, RANDOM forest algorithms
Abstract: This paper aims to improve the accuracy of sentiment analysis on Apache Spark for a real-time general twitter data. A lot of works exist on sentiment analysis on offline or stored twitter data that uses several classification algorithms on relevant features extracted using well-known feature extraction methodologies on pre-processed text data. However, not much works exist for sentiment analysis of real-time twitter data and especially for the generic data on big data processing platforms such as Apache Spark. This paper proposes a real-time sentiment analysis for generic twitter data through Apache Spark using six classification algorithms on N-gram and Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction methodologies on the pre-processed data. An exhaustive comparison is done using Logistic Regression (LR), Multinomial Naive Bayes (MNB), Random Forest Classfier(RFC), Support Vector Machine (SVM), K-Nearest Neighbour (K-NN), and Decision Tree (DT) classification algorithms. It is observed that the trigram feature extraction method performs the best on LR and SVM and the RFC results are also comparable on the considered general tweets data. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

50. Formulaic Competence in College-Level Asian English Learner's Argumentative Writing: Examining the Effects of Language Background and Topic.

Author: Li, Hang and Yao, Yao
Subjects: ENGLISH language, ENGLISH as a foreign language, SECOND language acquisition, ESSAYS, NATIVE language, PHILOSOPHY of language
Abstract: The present study examined the effect of language background and topic on productive formulaic competence. Guided by usage-based theory of language learning, this study used a distribution-based approach to the examination of the native-likeness of formulaic usage in English timed argumentative writing. Six indices informed by a large-scale native corpus were chosen to gauge the frequency and association strength of bigrams and trigrams in a total of 778 English timed independent argumentative essays on two topics by 100 native speaker writers, 210 English as a foreign language (EFL) writers and 79 English as a second language (ESL) writers in Asia. Results of a series of linear mixed-effects regression analyses showed that, while EFL writers scored lower on most of the indices than both NS writers and ESL writers, ESL writers did not differ much from NS writers in their use of n-grams across topics. Meanwhile, the topic that was deemed more cognitively demanding and of a stronger technical nature elicited more native-like performance in bigram and trigram use across all three language groups. Findings of the study highlight the important effect of input on the development of formulaic competence in a second language, offer empirical support to the use of cognitively complex topics in second language writing practice, and carry important implications for L2 writing pedagogy. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,919 results on '"n-gram"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources