1,257 results
Search Results
2. Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
- Author
-
Jose Dixon and Md Rahman
- Subjects
text retrieval ,text classification ,imbalanced sampling ,feature engineering ,statistical analysis ,data preprocessing ,Computer engineering. Computer hardware ,TK7885-7895 - Abstract
The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.
- Published
- 2023
- Full Text
- View/download PDF
3. Malware Detection with Neural Network Using Combined Features
- Author
-
Zhou, Huan, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Yuan, Junsong, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Yun, Xiaochun, editor, Wen, Weiping, editor, Lang, Bo, editor, Yan, Hanbing, editor, Ding, Li, editor, Li, Jia, editor, and Zhou, Yu, editor
- Published
- 2019
- Full Text
- View/download PDF
4. Enabling Self-Diagnosis of Automation Devices through Industrial Analytics
- Author
-
Gatica, Carlos Paiz, Boschmann, Alexander, inIT - Institut für industrielle Informa, Beyerer, Jürgen, editor, Kühnert, Christian, editor, and Niggemann, Oliver, editor
- Published
- 2019
- Full Text
- View/download PDF
5. Prediction of surface roughness using deep learning and data augmentation
- Author
-
Guo, Miaoxian, Wei, Shouheng, Han, Chentong, Xia, Wanliang, Luo, Chao, and Lin, Zhijian
- Published
- 2024
- Full Text
- View/download PDF
6. Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature.
- Author
-
Dixon, Jose and Rahman, Md
- Subjects
SUPERVISED learning ,ELECTRONIC data processing ,STATISTICS ,RECEIVER operating characteristic curves ,CLASSIFICATION ,MEDICAL research - Abstract
The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. Scientific papers citation analysis using textual features and SMOTE resampling techniques
- Author
-
Muhammad Umer, Malik Muhammad Saad Missen, Saima Sadiq, Zahid Aslam, Muhammad Abubakar Siddique, Michele Nappi, and Zahid Hameed
- Subjects
Feature engineering ,Computer science ,business.industry ,Citation sentiment analysis ,Sentiment analysis ,TF-IDF ,computer.software_genre ,Artificial Intelligence ,Citation analysis ,Signal Processing ,Classifier (linguistics) ,Pattern recognition (psychology) ,Machine learning ,Feature (machine learning) ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Citation ,business ,tf–idf ,computer ,Software ,Natural language processing ,SMOTE - Abstract
Ascertaining the impact of research is significant for the research community and academia of all disciplines. The only prevalent measure associated with the quantification of research quality is the citation-count. Although a number of citations play a significant role in academic research, sometimes citations can be biased or made to discuss only the weaknesses and shortcomings of the research. By considering the sentiment of citations and recognizing patterns in text can aid in understanding the opinion of the peer research community and will also help in quantifying the quality of research articles. Efficient feature representation combined with machine learning classifiers has yielded significant improvement in text classification. However, the effectiveness of such combinations has not been analyzed for citation sentiment analysis. This study aims to investigate pattern recognition using machine learning models in combination with frequency-based and prediction-based feature representation techniques with and without using Synthetic Minority Oversampling Technique (SMOTE) on publicly available citation sentiment dataset. Sentiment of citation instances are classified into positive, negative or neutral. Results indicate that the Extra tree classifier in combination with Term Frequency-Inverse Document Frequency achieved 98.26% accuracy on the SMOTE-balanced dataset.
- Published
- 2021
8. Feature engineering of EEG applied to mental disorders: a systematic mapping study.
- Author
-
García-Ponsoda, Sandra, García-Carrasco, Jorge, Teruel, Miguel A., Maté, Alejandro, and Trujillo, Juan
- Subjects
MENTAL illness ,MACHINE learning ,ELECTROENCEPHALOGRAPHY ,ARTIFICIAL intelligence ,ENGINEERING - Abstract
Around a third of the total population of Europe suffers from mental disorders. The use of electroencephalography (EEG) together with Machine Learning (ML) algorithms to diagnose mental disorders has recently been shown to be a prominent research area, as exposed by several reviews focused on the field. Nevertheless, previous to the application of ML algorithms, EEG data should be correctly preprocessed and prepared via Feature Engineering (FE). In fact, the choice of FE techniques can make the difference between an unusable ML model and a simple, effective model. In other words, it can be said that FE is crucial, especially when using complex, non-stationary data such as EEG. To this aim, in this paper we present a Systematic Mapping Study (SMS) focused on FE from EEG data used to identify mental disorders. Our SMS covers more than 900 papers, making it one of the most comprehensive to date, to the best of our knowledge. We gathered the mental disorder addressed, all the FE techniques used, and the Artificial Intelligence (AI) algorithm applied for classification from each paper. Our main contributions are: (i) we offer a starting point for new researchers on these topics, (ii) we extract the most used FE techniques to classify mental disorders, (iii) we show several graphical distributions of all used techniques, and (iv) we provide critical conclusions for detecting mental disorders. To provide a better overview of existing techniques, the FE process is divided into three parts: (i) signal transformation, (ii) feature extraction, and (iii) feature selection. Moreover, we classify and analyze the distribution of existing papers according to the mental disorder they treat, the FE processes used, and the ML techniques applied. As a result, we provide a valuable reference for the scientific community to identify which techniques have been proven and tested and where the gaps are located in the current state of the art. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Heart disease prediction using ML through enhanced feature engineering with association and correlation analysis.
- Author
-
Lakshmanarao, Annemneedi, Krishna, Thotakura Venkata Sai, Kiran, Tummala Srinivasa Ravi, krishna, Chinta Venkata Murali, Ushanag, Samsani, and Supriya, Nandikolla
- Subjects
HEART diseases ,STATISTICAL correlation ,MACHINE learning ,SUPPORT vector machines ,K-nearest neighbor classification ,CLASSIFICATION algorithms - Abstract
Heart disease remains a prevalent and critical health concern globally. This paper addresses the critical task of heart disease prediction through the utilization of advanced machine learning techniques. Our approach focuses on the enhancement of feature engineering by incorporating a novel integration of association and correlation analyses. A heart disease dataset from Kaggle was used for the experiments. Association analysis was applied to the categorical and binary features in the dataset. Correlation analysis was applied to the numerical features in the dataset. Based on the insights from association analysis and correlation analysis, a new dataset was created with combinations of features. Later, newly created features are integrated with the original dataset, and classification algorithms are applied. Five machine learning (ML) classifiers, namely decision tree, k-nearest neighbors (KNN), random forest, XG-Boost, and support vector machine (SVM), were applied to the final dataset and achieved a good accuracy rate for heart disease detection. By systematically exploring associations and relationships with categorical, binary, and numerical features, this paper unveils innovative insights that contribute to a more comprehensive understanding of the heart disease dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Predicting Employee Absence from Historical Absence Profiles with Machine Learning.
- Author
-
Zupančič, Peter and Panov, Panče
- Subjects
PERSONNEL management ,JOB absenteeism ,TECHNOLOGICAL innovations ,SICK leave ,MACHINE learning - Abstract
In today's dynamic business world, organizations are increasingly relying on innovative technologies to improve the efficiency and effectiveness of their human resource (HR) management. Our study uses historical time and attendance data collected with the MojeUre time and attendance system to predict employee absenteeism, including sick and vacation leave, using machine learning methods. We integrate employee demographic data and the absence profiles on timesheets showing daily attendance patterns as fundamental elements for our analysis. We also convert the absence data into a feature-based format suitable for the machine learning methods used. Our primary goal in this paper is to evaluate how well we can predict sick leave and vacation leave over short- and long-term intervals using tree-based machine learning methods based on the predictive clustering paradigm. This paper compares the effectiveness of these methods in different learning settings and discusses their impact on improving HR decision-making processes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. FN-GNN: A Novel Graph Embedding Approach for Enhancing Graph Neural Networks in Network Intrusion Detection Systems.
- Author
-
Tran, Dinh-Hau and Park, Minho
- Subjects
ARTIFICIAL neural networks ,GRAPH neural networks ,RECURRENT neural networks ,CONVOLUTIONAL neural networks ,DEEP learning ,INTRUSION detection systems (Computer security) - Abstract
With the proliferation of the Internet, network complexities for both commercial and state organizations have significantly increased, leading to more sophisticated and harder-to-detect network attacks. This evolution poses substantial challenges for intrusion detection systems, threatening the cybersecurity of organizations and national infrastructure alike. Although numerous deep learning techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and graph neural networks (GNNs) have been applied to detect various network attacks, they face limitations due to the lack of standardized input data, affecting model accuracy and performance. This paper proposes a novel preprocessing method for flow data from network intrusion detection systems (NIDSs), enhancing the efficacy of a graph neural network model in malicious flow detection. Our approach initializes graph nodes with data derived from flow features and constructs graph edges through the analysis of IP relationships within the system. Additionally, we propose a new graph model based on the combination of the graph neural network (GCN) model and SAGEConv, a variant of the GraphSAGE model. The proposed model leverages the strengths while addressing the limitations encountered by the previous models. Evaluations on two IDS datasets, CICIDS-2017 and UNSW-NB15, demonstrate that our model outperforms existing methods, offering a significant advancement in the detection of network threats. This work not only addresses a critical gap in the standardization of input data for deep learning models in cybersecurity but also proposes a scalable solution for improving the intrusion detection accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. An online monitoring method of milling cutter wear condition driven by digital twin.
- Author
-
Zi, Xintian, Gao, Shangshang, and Xie, Yang
- Subjects
DIGITAL twins ,TRAFFIC safety ,INFORMATION storage & retrieval systems ,MILLING cutters ,MANUFACTURING processes ,LARGE deviations (Mathematics) ,FORECASTING - Abstract
Real-time online tracking of tool wear is an indispensable element in automated machining, and tool wear directly impacts the processing quality of workpieces and overall productivity. For the milling tool wear state is difficult to real-time visualization monitoring and individual tool wear prediction model deviation is large and is not stable and so on, a digital twin-driven ensemble learning milling tool wear online monitoring novel method is proposed in this paper. Firstly, a digital twin-based milling tool wear monitoring system is built and the system model structure is clarified. Secondly, through the digital twin (DT) data multi-level processing system to optimize the signal characteristic data, combined with the ensemble learning model to predict the milling cutter wear status and wear values in real-time, the two will be verified with each other to enhance the prediction accuracy of the system. Finally, taking the milling wear experiment as an application case, the outcomes display that the predictive precision of the monitoring method is more than 96% and the prediction time is below 0.1 s, which verifies the effectiveness of the presented method, and provides a novel idea and a new approach for real-time on-line tracking of milling cutter wear in intelligent manufacturing process. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. A Novel Artificial Intelligence Prediction Process of Concrete Dam Deformation Based on a Stacking Model Fusion Method.
- Author
-
Wu, Wenyuan, Su, Huaizhi, Feng, Yanming, Zhang, Shuai, Zheng, Sen, Cao, Wenhan, and Liu, Hongchen
- Subjects
CONCRETE dams ,ARTIFICIAL intelligence ,HYDRAULIC structures ,DEFORMATIONS (Mechanics) ,MULTIPLE intelligences ,COMPOSITE columns ,MACHINE learning - Abstract
Deformation effectively represents the structural integrity of concrete dams and acts as a clear indicator of their operational performance. Predicting deformation is critical for monitoring the safety of hydraulic structures. To this end, this paper proposes an artificial intelligence-based process for predicting concrete dam deformation. Initially, using the principles of feature engineering, the preprocessing of deformation safety monitoring data is conducted. Subsequently, employing a stacking model fusion method, a novel prediction process embedded with multiple artificial intelligence algorithms is developed. Moreover, three new performance indicators—a superiority evaluation indicator, an accuracy evaluation indicator, and a generalization evaluation indicator—are introduced to provide a comprehensive assessment of the model's effectiveness. Finally, an engineering example demonstrates that the ensemble artificial intelligence method proposed herein outperforms traditional statistical models and single machine learning models in both fitting and predictive accuracy, thereby providing a scientific and effective foundation for concrete dam deformation prediction and safety monitoring. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. House Prices Prediction Using Statistics with Machine Learning.
- Author
-
Alqubati, Loai Nagib and Loai Nagib, Kiran Kumari Patil
- Subjects
HOME prices ,BEDROOMS ,MACHINE learning ,HOUSING ,STANDARD deviations ,FEATURE selection ,PRICE indexes - Abstract
After the housing crisis in 2009 that affected the global economy and the bubble that burst, researchers began to focus on how to estimate house prices. In the United States, for instance, they adopted the hedonic price index (HPI) method in estimating house prices. After Ames house pricing dataset was released, which contains houses data from 2006 to 2010, and detailed features that help in studying the estimation of house prices. In this paper, we suggest that House prices are determined by many features such as area, utilities, house style, location, age, grade living area, number of bedrooms, garage, and so on. Statistical methods were applied with two models which are multiple and stepwise linear regression, also, two machine learning algorithms which are LASSO and XGBoost regression. The accuracy of prediction was evaluated by the root mean square error (RMSE). XGBoost with 25 features, 0.973 R2, and 0.027 RMSE is the Best model. LASSO has helped in feature selection for XGBoost. [ABSTRACT FROM AUTHOR]
- Published
- 2023
15. A Systematic Review of Time Series Classification Techniques Used in Biomedical Applications.
- Author
-
Wang, Will Ke, Chen, Ina, Hershkovich, Leeor, Yang, Jiamu, Shetty, Ayush, Singh, Geetika, Jiang, Yihang, Kotla, Aditya, Shang, Jason Zisheng, Yerrabelli, Rushil, Roghanizad, Ali R., Shandhi, Md Mobashir Hasan, and Dunn, Jessilyn
- Subjects
TIME series analysis ,SMARTWATCHES ,DEEP learning ,SCIENCE databases ,CLASSIFICATION ,ELECTRONIC data processing ,ELECTRONICS engineers ,TECHNOLOGY assessment - Abstract
Background: Digital clinical measures collected via various digital sensing technologies such as smartphones, smartwatches, wearables, and ingestible and implantable sensors are increasingly used by individuals and clinicians to capture the health outcomes or behavioral and physiological characteristics of individuals. Time series classification (TSC) is very commonly used for modeling digital clinical measures. While deep learning models for TSC are very common and powerful, there exist some fundamental challenges. This review presents the non-deep learning models that are commonly used for time series classification in biomedical applications that can achieve high performance. Objective: We performed a systematic review to characterize the techniques that are used in time series classification of digital clinical measures throughout all the stages of data processing and model building. Methods: We conducted a literature search on PubMed, as well as the Institute of Electrical and Electronics Engineers (IEEE), Web of Science, and SCOPUS databases using a range of search terms to retrieve peer-reviewed articles that report on the academic research about digital clinical measures from a five-year period between June 2016 and June 2021. We identified and categorized the research studies based on the types of classification algorithms and sensor input types. Results: We found 452 papers in total from four different databases: PubMed, IEEE, Web of Science Database, and SCOPUS. After removing duplicates and irrelevant papers, 135 articles remained for detailed review and data extraction. Among these, engineered features using time series methods that were subsequently fed into widely used machine learning classifiers were the most commonly used technique, and also most frequently achieved the best performance metrics (77 out of 135 articles). Statistical modeling (24 out of 135 articles) algorithms were the second most common and also the second-best classification technique. Conclusions: In this review paper, summaries of the time series classification models and interpretation methods for biomedical applications are summarized and categorized. While high time series classification performance has been achieved in digital clinical, physiological, or biomedical measures, no standard benchmark datasets, modeling methods, or reporting methodology exist. There is no single widely used method for time series model development or feature interpretation, however many different methods have proven successful. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Fundamental Components and Principles of Supervised Machine Learning Workflows with Numerical and Categorical Data.
- Author
-
Kampezidou, Styliani I., Tikayat Ray, Archana, Bhat, Anirudh Prabhakara, Pinon Fischer, Olivia J., and Mavris, Dimitri N.
- Subjects
SUPERVISED learning ,WORKFLOW ,DATA augmentation ,ENGINEERING models ,AUTOMATION ,MACHINE learning ,RESEARCH personnel - Abstract
This paper offers a comprehensive examination of the process involved in developing and automating supervised end-to-end machine learning workflows for forecasting and classification purposes. It offers a complete overview of the components (i.e., feature engineering and model selection), principles (i.e., bias–variance decomposition, model complexity, overfitting, model sensitivity to feature assumptions and scaling, and output interpretability), models (i.e., neural networks and regression models), methods (i.e., cross-validation and data augmentation), metrics (i.e., Mean Squared Error and F1-score) and tools that rule most supervised learning applications with numerical and categorical data, as well as their integration, automation, and deployment. The end goal and contribution of this paper is the education and guidance of the non-AI expert academic community regarding complete and rigorous machine learning workflows and data science practices, from problem scoping to design and state-of-the-art automation tools, including basic principles and reasoning in the choice of methods. The paper delves into the critical stages of supervised machine learning workflow development, many of which are often omitted by researchers, and covers foundational concepts essential for understanding and optimizing a functional machine learning workflow, thereby offering a holistic view of task-specific application development for applied researchers who are non-AI experts. This paper may be of significant value to academic researchers developing and prototyping machine learning workflows for their own research or as customer-tailored solutions for government and industry partners. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Static analysis framework for permission-based dataset generation and android malware detection using machine learning.
- Author
-
Pathak, Amarjyoti, Kumar, Th. Shanta, and Barman, Utpal
- Abstract
Since Android is the popular mobile operating system worldwide, malicious attackers seek out Android smartphones as targets. The Android malware can be identified through a number of established detection techniques. However, the issues presented by modern malware cannot be met by traditional signature or heuristic-based malware detection methods. Previous research suggests that machine-learning classifiers can be utilised to analyse permissions, making it possible to differentiate between malicious and benign applications on the Android platform. There exist machine-learning methods that utilise permission-based attributes to build models for the detection of malware on Android devices. Nevertheless, the performance of these detection methods is dependent on the raw or feature datasets. Android malware research frequently faces a major obstacle due to the lack of adequate and up-to-date raw malware datasets. In this paper, we put forward a systematic approach to generate an Android permission-based dataset using static analysis. To create the dataset, we collect recent raw malware samples (APK files) and focus on the reverse engineering approach and permission-based features extraction. We also conduct a thorough feature analysis to determine the important Android permissions and present a machine-learning-based Android malware detection mechanism. The experimental result of our study demonstrates that with just 48 features, the random forest classifier-based Android malware detection model obtains the best accuracy of 97.5%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Feature-based detection of breast cancer using convolutional neural network and feature engineering.
- Author
-
Essa, Hiba Allah, Ismaiel, Ebrahim, and Hinnawi, Mhd Firas Al
- Abstract
Breast cancer (BC) is a prominent cause of female mortality on a global scale. Recently, there has been growing interest in utilizing blood and tissue-based biomarkers to detect and diagnose BC, as this method offers a non-invasive approach. To improve the classification and prediction of BC using large biomarker datasets, several machine-learning techniques have been proposed. In this paper, we present a multi-stage approach that consists of computing new features and then sorting them into an input image for the ResNet50 neural network. The method involves transforming the original values into normalized values based on their membership in the Gaussian distribution of healthy and BC samples of each feature. To test the effectiveness of our proposed approach, we employed the Coimbra and Wisconsin datasets. The results demonstrate efficient performance improvement, with an accuracy of 100% and 100% using the Coimbra and Wisconsin datasets, respectively. Furthermore, the comparison with existing literature validates the reliability and effectiveness of our methodology, where the normalized value can reduce the misclassified samples of ML techniques because of its generality. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Radio Signal Modulation Recognition Method Based on Hybrid Feature and Ensemble Learning: For Radar and Jamming Signals.
- Author
-
Zhou, Yu, Cao, Ronggang, Zhang, Anqi, and Li, Ping
- Subjects
MACHINE learning ,ELECTRONIC modulation ,RANDOM forest algorithms ,SIGNAL classification ,FRACTAL dimensions ,RADAR interference - Abstract
The detection performance of radar is significantly impaired by active jamming and mutual interference from other radars. This paper proposes a radio signal modulation recognition method to accurately recognize these signals, which helps in the jamming cancellation decisions. Based on the ensemble learning stacking algorithm improved by meta-feature enhancement, the proposed method adopts random forests, K-nearest neighbors, and Gaussian naive Bayes as the base-learners, with logistic regression serving as the meta-learner. It takes the multi-domain features of signals as input, which include time-domain features including fuzzy entropy, slope entropy, and Hjorth parameters; frequency-domain features, including spectral entropy; and fractal-domain features, including fractal dimension. The simulation experiment, including seven common signal types of radar and active jamming, was performed for the effectiveness validation and performance evaluation. Results proved the proposed method's performance superiority to other classification methods, as well as its ability to meet the requirements of low signal-to-noise ratio and few-shot learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Extreme Rare Events Identification Through Jaynes Inferential Approach
- Author
-
Yochai Cohen, Eden Shalom Erez, and Yair Neuman
- Subjects
Feature engineering ,inference ,Information Systems and Management ,extreme rare events ,Computer science ,business.industry ,Industrial production ,Inference ,macromolecular substances ,Original Articles ,Machine learning ,computer.software_genre ,Jaynes ,Computer Science Applications ,feature engineering ,Rare events ,Industry ,Identification (biology) ,Artificial intelligence ,business ,computer ,pulp-and-paper ,Information Systems - Abstract
The identification of extreme rare events is a challenge that appears in several real-world contexts, from screening for solo perpetrators to the prediction of failures in industrial production. In this article, we explain the challenge and present a new methodology for addressing it, a methodology that may be considered in terms of features engineering. This methodology, which is based on Jaynes inferential approach, is tested on a dataset dealing with failures in production in the pulp-and-paper industry. The results are discussed in the context of the benefits of using the approach for features engineering in practical contexts involving measurable risks.
- Published
- 2021
21. The Language of Deception: Applying Findings on Opinion Spam to Legal and Forensic Discourses.
- Author
-
Jakupov, Alibek, Longhi, Julien, and Zeddini, Besma
- Subjects
DECEPTION ,LEGAL professions ,LEGAL opinions ,INTELLECTUAL property theft ,DIGITAL forensics ,LEGAL discourse ,FORENSIC psychiatry - Abstract
Digital forensic investigations are becoming increasingly crucial in criminal investigations and civil litigations, especially in cases of corporate espionage and intellectual property theft as more communication occurs online via e-mail and social media. Deceptive opinion spam analysis is an emerging field of research that aims to detect and identify fraudulent reviews, comments, and other forms of deceptive online content. In this paper, we explore how the findings from this field may be relevant to forensic investigation, particularly the features that capture stylistic patterns and sentiments, which are psychologically relevant aspects of truthful and deceptive language. To assess these features' utility, we demonstrate the potential of our proposed approach using the real-world dataset from the Enron Email Corpus. Our findings suggest that deceptive opinion spam analysis may be a valuable tool for forensic investigators and legal professionals looking to identify and analyze deceptive behavior in online communication. By incorporating these techniques into their investigative and legal strategies, professionals can improve the accuracy and reliability of their findings, leading to more effective and just outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Novel approach for quantitative and qualitative authors research profiling using feature fusion and tree-based learning approach.
- Author
-
Umer, Muhammad, Aljrees, Turki, Ullah, Saleem, and Bashir, Ali Kashif
- Subjects
MACHINE learning ,QUALITATIVE research ,RESEARCH personnel ,TEXT mining ,RANDOM forest algorithms - Abstract
Article citation creates a link between the cited and citing articles and is used as a basis for several parameters like author and journal impact factor, H-index, i10 index, etc., for scientific achievements. Citations also include self-citation which refers to article citation by the author himself. Self-citation is important to evaluate an author's research profile and has gained popularity recently. Although different criteria are found in the literature regarding appropriate self-citation, self-citation does have a huge impact on a researcher's scientific profile. This study carries out two cases in this regard. In case 1, the qualitative aspect of the author's profile is analyzed using hand-crafted feature engineering techniques. The sentiments conveyed through citations are integral in assessing research quality, as they can signify appreciation, critique, or serve as a foundation for further research. Analyzing sentiments within in-text citations remains a formidable challenge, even with the utilization of automated sentiment annotations. For this purpose, this study employs machine learning models using term frequency (TF) and term frequency-inverse document frequency (TF-IDF). Random forest using TF with Synthetic Minority Oversampling Technique (SMOTE) achieved a 0.9727 score of accuracy. Case 2 deals with quantitative analysis and investigates direct and indirect self-citation. In this study, the top 2% of researchers in 2020 is considered as a baseline. For this purpose, the data of the top 25 Pakistani researchers are manually retrieved from this dataset, in addition to the citation information from the Web of Science (WoS). The selfcitation is estimated using the proposed model and results are compared with those obtained from WoS. Experimental results show a substantial difference between the two, as the ratio of self-citation from the proposed approach is higher than WoS. It is observed that the citations from the WoS for authors are overstated. For a comprehensive evaluation of the researcher's profile, both direct and indirect selfcitation must be included. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
23. Data-Driven Modeling of Appliance Energy Usage.
- Author
-
Assadian, Cameron Francis and Assadian, Francis
- Subjects
REGRESSION trees ,ENERGY consumption ,STANDARD deviations ,MACHINE learning ,RANDOM forest algorithms ,WIND speed ,HOME improvement centers - Abstract
Due to the transition toward the Internet of Everything (IOE), the prediction of energy consumed by household appliances has become a progressively more difficult topic to model. Even with advancements in data analytics and machine learning, several challenges remain to be addressed. Therefore, providing highly accurate and optimized models has become the primary research goal of many studies. This paper analyzes appliance energy consumption through a variety of machine learning-based strategies. Utilizing data recorded from a single-family home, input variables comprised internal temperatures and humidities, lighting consumption, and outdoor conditions including wind speed, visibility, and pressure. Various models were trained and evaluated: (a) multiple linear regression, (b) support vector regression, (c) random forest, (d) gradient boosting, (e) xgboost, and (f) the extra trees regressor. Both feature engineering and hyperparameter tuning methodologies were applied to not only extend existing features but also create new ones that provided improved model performance across all metrics: root mean square error (RMSE), coefficient of determination (R
2 ), mean absolute error (MAE), and mean absolute percentage error (MAPE). The best model (extra trees) was able to explain 99% of the variance in the training set and 66% in the testing set when using all the predictors. The results were compared with those obtained using a similar methodology. The objective of performing these actions was to show a unique perspective in simulating building performance through data-driven models, identifying how to maximize predictive performance through the use of machine learning-based strategies, as well as understanding the potential benefits of utilizing different models. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
24. An Automated Machine Learning Framework for Adaptive and Optimized Hyperspectral-Based Land Cover and Land-Use Segmentation.
- Author
-
Vali, Ava, Comai, Sara, and Matteucci, Matteo
- Subjects
ENGINEERING models ,DEEP learning ,LAND cover ,REMOTE sensing ,WORKFLOW - Abstract
Hyperspectral imaging holds significant promise in remote sensing applications, particularly for land cover and land-use classification, thanks to its ability to capture rich spectral information. However, leveraging hyperspectral data for accurate segmentation poses critical challenges, including the curse of dimensionality and the scarcity of ground truth data, that hinder the accuracy and efficiency of machine learning approaches. This paper presents a holistic approach for adaptive optimized hyperspectral-based land cover and land-use segmentation using automated machine learning (AutoML). We address the challenges of high-dimensional hyperspectral data through a revamped machine learning pipeline, thus emphasizing feature engineering tailored to hyperspectral classification tasks. We propose a framework that dissects feature engineering into distinct steps, thus allowing for comprehensive model generation and optimization. This framework incorporates AutoML techniques to streamline model selection, hyperparameter tuning, and data versioning, thus ensuring robust and reliable segmentation results. Our empirical investigation demonstrates the efficacy of our approach in automating feature engineering and optimizing model performance, even without extensive ground truth data. By integrating automatic optimization strategies into the segmentation workflow, our approach offers a systematic, efficient, and scalable solution for hyperspectral-based land cover and land-use classification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Optimizing Tourism Accommodation Offers by Integrating Language Models and Knowledge Graph Technologies.
- Author
-
Cadeddu, Andrea, Chessa, Alessandro, De Leo, Vincenzo, Fenu, Gianni, Motta, Enrico, Osborne, Francesco, Reforgiato Recupero, Diego, Salatino, Angelo, and Secchi, Luca
- Subjects
LANGUAGE models ,NATURAL language processing ,KNOWLEDGE graphs ,CLASSIFICATION ,LANGUAGE acquisition - Abstract
Online platforms have become the primary means for travellers to search, compare, and book accommodations for their trips. Consequently, online platforms and revenue managers must acquire a comprehensive comprehension of these dynamics to formulate a competitive and appealing offerings. Recent advancements in natural language processing, specifically through the development of large language models, have demonstrated significant progress in capturing the intricate nuances of human language. On the other hand, knowledge graphs have emerged as potent instruments for representing and organizing structured information. Nevertheless, effectively integrating these two powerful technologies remains an ongoing challenge. This paper presents an innovative deep learning methodology that combines large language models with domain-specific knowledge graphs for classification of tourism offers. The main objective of our system is to assist revenue managers in the following two fundamental dimensions: (i) comprehending the market positioning of their accommodation offerings, taking into consideration factors such as accommodation price and availability, together with user reviews and demand, and (ii) optimizing presentations and characteristics of the offerings themselves, with the intention of improving their overall appeal. For this purpose, we developed a domain knowledge graph covering a variety of information about accommodations and implemented targeted feature engineering techniques to enhance the information representation within a large language model. To evaluate the effectiveness of our approach, we conducted a comparative analysis against alternative methods on four datasets about accommodation offers in London. The proposed solution obtained excellent results, significantly outperforming alternative methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. AutoAIViz
- Author
-
Erick Oduor, Alexander G. Gray, Justin D. Weisz, Dakuo Wang, Michael Muller, Daniel Karl I. Weidele, and Josh Andres
- Subjects
FOS: Computer and information sciences ,Feature engineering ,Hyperparameter ,Computer Science - Machine Learning ,Process (engineering) ,Computer science ,business.industry ,05 social sciences ,Short paper ,Computer Science - Human-Computer Interaction ,Machine Learning (stat.ML) ,020207 software engineering ,02 engineering and technology ,Machine Learning (cs.LG) ,Human-Computer Interaction (cs.HC) ,Visualization ,Workflow ,Experimental system ,Statistics - Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,0501 psychology and cognitive sciences ,Artificial intelligence ,business ,050107 human factors ,Parallel coordinates - Abstract
Artificial Intelligence (AI) can now automate the algorithm selection, feature engineering, and hyperparameter tuning steps in a machine learning workflow. Commonly known as AutoML or AutoAI, these technologies aim to relieve data scientists from the tedious manual work. However, today's AutoAI systems often present only limited to no information about the process of how they select and generate model results. Thus, users often do not understand the process, neither do they trust the outputs. In this short paper, we provide a first user evaluation by 10 data scientists of an experimental system, AutoAIViz, that aims to visualize AutoAI's model generation process. We find that the proposed system helps users to complete the data science tasks, and increases their understanding, toward the goal of increasing trust in the AutoAI system., Comment: 5 pages, 1 figure, IUI2020
- Published
- 2020
27. A review on customer segmentation methods for personalized customer targeting in e-commerce use cases.
- Author
-
Alves Gomes, Miguel and Meisen, Tobias
- Subjects
TARGET marketing ,CONSUMERS' reviews ,FEATURE selection ,ELECTRONIC commerce - Abstract
The importance of customer-oriented marketing has increased for companies in recent decades. With the advent of one-customer strategies, especially in e-commerce, traditional mass marketing in this area is becoming increasingly obsolete as customer-specific targeting becomes realizable. Such a strategy makes it essential to develop an underlying understanding of the interests and motivations of the individual customer. One method frequently used for this purpose is segmentation, which has evolved steadily in recent years. The aim of this paper is to provide a structured overview of the different segmentation methods and their current state of the art. For this purpose, we conducted an extensive literature search in which 105 publications between the years 2000 and 2022 were identified that deal with the analysis of customer behavior using segmentation methods. Based on this paper corpus, we provide a comprehensive review of the used methods. In addition, we examine the applied methods for temporal trends and for their applicability to different data set dimensionalities. Based on this paper corpus, we identified a four-phase process consisting of information (data) collection, customer representation, customer analysis via segmentation and customer targeting. With respect to customer representation and customer analysis by segmentation, we provide a comprehensive overview of the methods used in these process steps. We also take a look at temporal trends and the applicability to different dataset dimensionalities. In summary, customer representation is mainly solved by manual feature selection or RFM analysis. The most commonly used segmentation method is k-means, regardless of the use case and the amount of data. It is interesting to note that it has been widely used in recent years. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Investigation of Feature Engineering Methods for Domain-Knowledge-Assisted Bearing Fault Diagnosis.
- Author
-
Bienefeld, Christoph, Becker-Dombrowsky, Florian Michael, Shatri, Etnik, and Kirchner, Eckhard
- Subjects
FAULT diagnosis ,METHODS engineering ,MACHINE learning ,RANDOM forest algorithms ,HILBERT-Huang transform ,DEEP learning ,WAVELET transforms - Abstract
The engineering challenge of rolling bearing condition monitoring has led to a large number of method developments over the past few years. Most commonly, vibration measurement data are used for fault diagnosis using machine learning algorithms. In current research, purely data-driven deep learning methods are becoming increasingly popular, aiming for accurate predictions of bearing faults without requiring bearing-specific domain knowledge. Opposing this trend in popularity, the present paper takes a more traditional approach, incorporating domain knowledge by evaluating a variety of feature engineering methods in combination with a random forest classifier. For a comprehensive feature engineering study, a total of 42 mathematical feature formulas are combined with the preprocessing methods of envelope analysis, empirical mode decomposition, wavelet transforms, and frequency band separations. While each single processing method and feature formula is known from the literature, the presented paper contributes to the body of knowledge by investigating novel series connections of processing methods and feature formulas. Using the CWRU bearing fault data for performance evaluation, feature calculation based on the processing method of frequency band separation leads to particularly high prediction accuracies, while at the same time being very efficient in terms of low computational effort. Additionally, in comparison with deep learning approaches, the proposed feature engineering method provides excellent accuracies and enables explainability. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
- Author
-
Hanna M. Zafar, Robert Lou, Tessa S. Cook, Darco Lalevic, and Charles Chambers
- Subjects
Feature engineering ,medicine.medical_specialty ,Computer science ,Decision tree ,computer.software_genre ,030218 nuclear medicine & medical imaging ,Machine Learning ,03 medical and health sciences ,Naive Bayes classifier ,0302 clinical medicine ,medicine ,Humans ,Radiology, Nuclear Medicine and imaging ,Natural Language Processing ,Original Paper ,Radiological and Ultrasound Technology ,business.industry ,Bayes Theorem ,Computer Science Applications ,Statistical classification ,Tokenization (data security) ,Binary classification ,Trigram ,Artificial intelligence ,Radiology ,business ,F1 score ,computer ,030217 neurology & neurosurgery ,Natural language processing ,Follow-Up Studies - Abstract
While radiologists regularly issue follow-up recommendations, our preliminary research has shown that anywhere from 35 to 50% of patients who receive follow-up recommendations for findings of possible cancer on abdominopelvic imaging do not return for follow-up. As such, they remain at risk for adverse outcomes related to missed or delayed cancer diagnosis. In this study, we develop an algorithm to automatically detect free text radiology reports that have a follow-up recommendation using natural language processing (NLP) techniques and machine learning models. The data set used in this study consists of 6000 free text reports from the author's institution. NLP techniques are used to engineer 1500 features, which include the most informative unigrams, bigrams, and trigrams in the training corpus after performing tokenization and Porter stemming. On this data set, we train naive Bayes, decision tree, and maximum entropy models. The decision tree model, with an F1 score of 0.458 and accuracy of 0.862, outperforms both the naive Bayes (F1 score of 0.381) and maximum entropy (F1 score of 0.387) models. The models were analyzed to determine predictive features, with term frequency of n-grams such as "renal neoplasm" and "evalu with enhanc" being most predictive of a follow-up recommendation. Key to maximizing performance was feature engineering that extracts predictive information and appropriate selection of machine learning algorithms based on the feature set.
- Published
- 2019
30. Reliable Deep Learning–Based Detection of Misplaced Chest Electrodes During Electrocardiogram Recording: Algorithm Development and Validation
- Author
-
Stephen J Leslie, Raymond Bond, Aleeha Iftikhar, Charles Knoery, Victoria McGilligan, Ali Rababah, Dewar Finlay, Khaled Rjoob, Daniel Guldenring, Anne McShane, and Aaron Peace
- Subjects
physicians ,engineering ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Health Informatics ,ECG interpretation ,02 engineering and technology ,Precordial examination ,030204 cardiovascular system & hematology ,Left ventricular hypertrophy ,03 medical and health sciences ,0302 clinical medicine ,Health Information Management ,cardiovascular disease ,electrode misplacement ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Second intercostal space ,Myocardial infarction ,Medical diagnosis ,Trial registration ,Original Paper ,business.industry ,ECG ,Deep learning ,deep learning ,myocardial ,medicine.disease ,feature engineering ,machine learning ,myocardial infarction ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Algorithm - Abstract
Background A 12-lead electrocardiogram (ECG) is the most commonly used method to diagnose patients with cardiovascular diseases. However, there are a number of possible misinterpretations of the ECG that can be caused by several different factors, such as the misplacement of chest electrodes. Objective The aim of this study is to build advanced algorithms to detect precordial (chest) electrode misplacement. Methods In this study, we used traditional machine learning (ML) and deep learning (DL) to autodetect the misplacement of electrodes V1 and V2 using features from the resultant ECG. The algorithms were trained using data extracted from high-resolution body surface potential maps of patients who were diagnosed with myocardial infarction, diagnosed with left ventricular hypertrophy, or a normal ECG. Results DL achieved the highest accuracy in this study for detecting V1 and V2 electrode misplacement, with an accuracy of 93.0% (95% CI 91.46-94.53) for misplacement in the second intercostal space. The performance of DL in the second intercostal space was benchmarked with physicians (n=11 and age 47.3 years, SD 15.5) who were experienced in reading ECGs (mean number of ECGs read in the past year 436.54, SD 397.9). Physicians were poor at recognizing chest electrode misplacement on the ECG and achieved a mean accuracy of 60% (95% CI 56.09-63.90), which was significantly poorer than that of DL (P Conclusions DL provides the best performance for detecting chest electrode misplacement when compared with the ability of experienced physicians. DL and ML could be used to help flag ECGs that have been incorrectly recorded and flag that the data may be flawed, which could reduce the number of erroneous diagnoses.
- Published
- 2021
31. Noninvasive Real-Time Mortality Prediction in Intensive Care Units Based on Gradient Boosting Method: Model Development and Validation Study
- Author
-
Hao Wang, Congpu Zhao, Yun Long, Dongkai Li, Weiguo Zhu, Na Hong, Longxiang Su, and Huizhen Jiang
- Subjects
0301 basic medicine ,Feature engineering ,medicine.medical_specialty ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Health Informatics ,intensive care unit ,law.invention ,03 medical and health sciences ,0302 clinical medicine ,Health Information Management ,law ,noninvasive ,Intensive care ,Medicine ,030212 general & internal medicine ,Mortality prediction ,mortality prediction ,Intensive care medicine ,Oxygen saturation (medicine) ,Original Paper ,real time ,business.industry ,Area under the curve ,Intensive care unit ,030104 developmental biology ,Mean blood pressure ,Gradient boosting ,business - Abstract
Background Monitoring critically ill patients in intensive care units (ICUs) in real time is vitally important. Although scoring systems are most often used in risk prediction of mortality, they are usually not highly precise, and the clinical data are often simply weighted. This method is inefficient and time-consuming in the clinical setting. Objective The objective of this study was to integrate all medical data and noninvasively predict the real-time mortality of ICU patients using a gradient boosting method. Specifically, our goal was to predict mortality using a noninvasive method to minimize the discomfort to patients. Methods In this study, we established five models to predict mortality in real time based on different features. According to the monitoring, laboratory, and scoring data, we constructed the feature engineering. The five real-time mortality prediction models were RMM (based on monitoring features), RMA (based on monitoring features and the Acute Physiology and Chronic Health Evaluation [APACHE]), RMS (based on monitoring features and Sequential Organ Failure Assessment [SOFA]), RMML (based on monitoring and laboratory features), and RM (based on all monitoring, laboratory, and scoring features). All models were built using LightGBM and tested with XGBoost. We then compared the performance of all models, with particular focus on the noninvasive method, the RMM model. Results After extensive experiments, the area under the curve of the RMM model was 0.8264, which was superior to that of the RMA and RMS models. Therefore, predicting mortality using the noninvasive method was both efficient and practical, as it eliminated the need for extra physical interventions on patients, such as the drawing of blood. In addition, we explored the top nine features relevant to real-time mortality prediction: invasive mean blood pressure, heart rate, invasive systolic blood pressure, oxygen concentration, oxygen saturation, balance of input and output, total input, invasive diastolic blood pressure, and noninvasive mean blood pressure. These nine features should be given more focus in routine clinical practice. Conclusions The results of this study may be helpful in real-time mortality prediction in patients in the ICU, especially the noninvasive method. It is efficient and favorable to patients, which offers a strong practical significance.
- Published
- 2021
32. Malware Detection Issues, Challenges, and Future Directions: A Survey.
- Author
-
Aboaoja, Faitouri A., Zainal, Anazida, Ghaleb, Fuad A., Al-rimy, Bander Ali Saleh, Eisa, Taiseer Abdalla Elfadil, and Elnour, Asma Abbas Hassan
- Subjects
MALWARE ,FEATURE extraction ,SCIENTIFIC community ,COMPUTER crimes - Abstract
The evolution of recent malicious software with the rising use of digital services has increased the probability of corrupting data, stealing information, or other cybercrimes by malware attacks. Therefore, malicious software must be detected before it impacts a large number of computers. Recently, many malware detection solutions have been proposed by researchers. However, many challenges limit these solutions to effectively detecting several types of malware, especially zero-day attacks due to obfuscation and evasion techniques, as well as the diversity of malicious behavior caused by the rapid rate of new malware and malware variants being produced every day. Several review papers have explored the issues and challenges of malware detection from various viewpoints. However, there is a lack of a deep review article that associates each analysis and detection approach with the data type. Such an association is imperative for the research community as it helps to determine the suitable mitigation approach. In addition, the current survey articles stopped at a generic detection approach taxonomy. Moreover, some review papers presented the feature extraction methods as static, dynamic, and hybrid based on the utilized analysis approach and neglected the feature representation methods taxonomy, which is considered essential in developing the malware detection model. This survey bridges the gap by providing a comprehensive state-of-the-art review of malware detection model research. This survey introduces a feature representation taxonomy in addition to the deeper taxonomy of malware analysis and detection approaches and links each approach with the most commonly used data types. The feature extraction method is introduced according to the techniques used instead of the analysis approach. The survey ends with a discussion of the challenges and future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Depression Detection on Reddit With an Emotion-Based Attention Network: Algorithm Development and Validation
- Author
-
Bo Xu, Shaowu Zhang, Hongfei Lin, Shichang Sun, Liang Yang, and Lu Ren
- Subjects
Feature engineering ,Computer science ,social media ,emotion ,Health Informatics ,02 engineering and technology ,Task (project management) ,emotional semantic information ,Health Information Management ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Social media ,natural language processing ,Depression (differential diagnoses) ,Original Paper ,algorithm ,Recall ,business.industry ,Deep learning ,deep learning ,Mental health ,dynamic fusion strategy ,attention network ,depression detection ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,mental health ,Cognitive psychology ,Computer technology - Abstract
Background As a common mental disease, depression seriously affects people’s physical and mental health. According to the statistics of the World Health Organization, depression is one of the main reasons for suicide and self-harm events in the world. Therefore, strengthening depression detection can effectively reduce the occurrence of suicide or self-harm events so as to save more people and families. With the development of computer technology, some researchers are trying to apply natural language processing techniques to detect people who are depressed automatically. Many existing feature engineering methods for depression detection are based on emotional characteristics, but these methods do not consider high-level emotional semantic information. The current deep learning methods for depression detection cannot accurately extract effective emotional semantic information. Objective In this paper, we propose an emotion-based attention network, including a semantic understanding network and an emotion understanding network, which can capture the high-level emotional semantic information effectively to improve the depression detection task. Methods The semantic understanding network module is used to capture the contextual semantic information. The emotion understanding network module is used to capture the emotional semantic information. There are two units in the emotion understanding network module, including a positive emotion understanding unit and a negative emotion understanding unit, which are used to capture the positive emotional information and the negative emotional information, respectively. We further proposed a dynamic fusion strategy in the emotion understanding network module to fuse the positive emotional information and the negative emotional information. Results We evaluated our method on the Reddit data set. The experimental results showed that the proposed emotion-based attention network model achieved an accuracy, precision, recall, and F-measure of 91.30%, 91.91%, 96.15%, and 93.98%, respectively, which are comparable results compared with state-of-the-art methods. Conclusions The experimental results showed that our model is competitive with the state-of-the-art models. The semantic understanding network module, the emotion understanding network module, and the dynamic fusion strategy are effective modules for depression detection. In addition, the experimental results verified that the emotional semantic information was effective in depression detection.
- Published
- 2021
34. Advanced Algorithmic Approaches for Scam Profile Detection on Instagram.
- Author
-
Bokolo, Biodoumoye George and Liu, Qingzhong
- Subjects
MACHINE learning ,SWINDLERS & swindling ,SOCIAL media ,RANDOM forest algorithms ,DECISION trees ,LOGISTIC regression analysis - Abstract
Social media platforms like Instagram have become a haven for online scams, employing various deceptive tactics to exploit unsuspecting users. This paper investigates advanced algorithmic approaches to combat this growing threat. We explore various machine learning models for scam profile detection on Instagram. Our methodology involves collecting a comprehensive dataset from a trusted source and meticulously preprocessing the data for analysis. We then evaluate the effectiveness of a suite of machine learning algorithms, including decision trees, logistic regression, SVMs, and other ensemble methods. Each model's performance is measured using established metrics like accuracy, precision, recall, and F1-scores. Our findings indicate that ensemble methods, particularly random forest, XGBoost, and gradient boosting, outperform other models, achieving accuracy of 90%. The insights garnered from this study contribute significantly to the body of knowledge in social media forensics, offering practical implications for the development of automated tools to combat online deception. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. A Novel Feature Engineering-Based SOH Estimation Method for Lithium-Ion Battery with Downgraded Laboratory Data.
- Author
-
Wang, Jinyu, Zhang, Caiping, Meng, Xiangfeng, Zhang, Linjing, Li, Xu, and Zhang, Weige
- Subjects
ELECTRIC vehicle batteries ,MEAN square algorithms ,FEATURE extraction ,STANDARD deviations ,ELECTRIC vehicles ,PIPELINE transportation - Abstract
Accurate estimation of lithium-ion battery state of health (SOH) can effectively improve the operational safety of electric vehicles and optimize the battery operation strategy. However, previous SOH estimation algorithms developed based on high-precision laboratory data have ignored the discrepancies between field and laboratory data, leading to difficulties in field application. Therefore, aiming to bridge the gap between the lab-developed models and the field operational data, this paper presents a feature engineering-based SOH estimation method with downgraded laboratory battery data, applicable to real vehicles under different operating conditions. Firstly, a data processing pipeline is proposed to downgrade laboratory data to operational fleet-level data. The six key features are extracted on the partial ranges to capture the battery's aging state. Finally, three machine learning (ML) algorithms for easy online deployment are employed for SOH assessment. The results show that the hybrid feature set performs well and has high accuracy in SOH estimation for downgraded data, with a minimum root mean square error (RMSE) of 0.36%. Only three mechanism features derived from the incremental capacity curve can still provide a proper assessment, with a minimum RMSE of 0.44%. Voltage-based features can assist in evaluating battery state, improving accuracy by up to 20%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Revolutionizing Wind Power Prediction—The Future of Energy Forecasting with Advanced Deep Learning and Strategic Feature Engineering.
- Author
-
Habib, Md. Ahasan and Hossain, M. J.
- Subjects
DEEP learning ,WIND power ,ENERGY futures ,STANDARD deviations ,WIND forecasting ,FORECASTING - Abstract
This paper introduces an innovative framework for wind power prediction that focuses on the future of energy forecasting utilizing intelligent deep learning and strategic feature engineering. This research investigates the application of a state-of-the-art deep learning model for wind energy prediction to make extremely short-term forecasts using real-time data on wind generation from New South Wales, Australia. In contrast with typical approaches to wind energy forecasting, this model relies entirely on historical data and strategic feature engineering to make predictions, rather than relying on meteorological parameters. A hybrid feature engineering strategy that integrates features from several feature generation techniques to obtain the optimal input parameters is a significant contribution to this work. The model's performance is assessed using key metrics, yielding optimal results with a Mean Absolute Error (MAE) of 8.76, Mean Squared Error (MSE) of 139.49, Root Mean Squared Error (RMSE) of 11.81, R-squared score of 0.997, and Mean Absolute Percentage Error (MAPE) of 4.85%. Additionally, the proposed framework outperforms six other deep learning and hybrid deep learning models in terms of wind energy prediction accuracy. These findings highlight the importance of advanced data analysis for feature generation in data processing, pointing to its key role in boosting the precision of forecasting applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Urban Vegetation Classification for Unmanned Aerial Vehicle Remote Sensing Combining Feature Engineering and Improved DeepLabV3+.
- Author
-
Cao, Qianyang, Li, Man, Yang, Guangbin, Tao, Qian, Luo, Yaopei, Wang, Renru, and Chen, Panfang
- Subjects
VEGETATION classification ,DRONE aircraft ,REMOTE sensing ,URBAN plants ,DEEP learning ,VISIBLE spectra ,THEMATIC mapper satellite ,LANDSAT satellites - Abstract
Addressing the problems of misclassification and omissions in urban vegetation fine classification from current remote sensing classification methods, this research proposes an intelligent urban vegetation classification method that combines feature engineering and improved DeepLabV3+ based on unmanned aerial vehicle visible spectrum images. The method constructs feature engineering under the ReliefF algorithm to increase the number of features in the samples, enabling the deep learning model to learn more detailed information about the vegetation. Moreover, the method improves the classical DeepLabV3+ network structure based on (1) replacing the backbone network using MoblieNetV2; (2) adjusting the atrous spatial pyramid pooling null rate; and (3) adding the attention mechanism and the convolutional block attention module. Experiments were conducted with self-constructed sample datasets, where the method was compared and analyzed with a fully convolutional network (FCN) and U-Net and ShuffleNetV2 networks; the migration of the method was tested as well. The results show that the method in this paper is better than FCN, U-Net, and ShuffleNetV2, and reaches 92.27%, 91.48%, and 85.63% on the accuracy evaluation indices of overall accuracy, MarcoF1, and mean intersection over union, respectively. Furthermore, the segmentation results are accurate and complete, which effectively alleviates misclassifications and omissions of urban vegetation; moreover, it has a certain migration ability that can quickly and accurately classify the vegetation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. The knowledge graph as the default data model for learning on heterogeneous knowledge
- Author
-
Victor de Boer, Xander Wilcke, Peter Bloem, Spatial Economics, CLUE+, Artificial intelligence, Network Institute, Knowledge Representation and Reasoning, Business Web and Media, and Intelligent Information Systems
- Subjects
Feature engineering ,business.industry ,Computer science ,02 engineering and technology ,Machine learning ,computer.software_genre ,Domain (software engineering) ,Machine Learning ,Knowledge-based systems ,Knowledge extraction ,Data model ,020204 information systems ,End-to-End Learning ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Use case ,Position paper ,Artificial intelligence ,Knowledge Graphs ,business ,Raw data ,Semantic Web ,computer - Abstract
In modern machine learning, raw data is the pre-ferred input for our models. Where a decade ago data scien-tists were still engineering features, manually picking out the details they thought salient, they now prefer the data in their raw form. As long as we can assume that all relevant and ir-relevant information is present in the input data, we can de-sign deep models that build up intermediate representations to sift out relevant features. However, these models are often domain specific and tailored to the task at hand, and therefore unsuited for learning on heterogeneous knowledge: informa-tion of different types and from different domains. If we can develop methods that operate on this form of knowledge, we can dispense with a great deal of ad-hoc feature engineering and train deep models end-to-end in many more domains. To accomplish this, we first need a data model capable of ex-pressing heterogeneous knowledge naturally in various do-mains, in as usable a form as possible, and satisfying as many use cases as possible. In this position paper, we argue that the knowledge graph is a suitable candidate for this data model. This paper describes current research and discusses some of the promises and challenges of this approach.
- Published
- 2017
39. Developing a Process for the Analysis of User Journeys and the Prediction of Dropout in Digital Health Interventions: Machine Learning Approach
- Author
-
Vincent Bremer, Burkhardt Funk, Frances P. Thorndike, Lee M. Ritterband, and Philip I. Chow
- Subjects
Feature engineering ,Adult ,Male ,020205 medical informatics ,Computer science ,Psychological intervention ,digital health ,Health Informatics ,Context (language use) ,02 engineering and technology ,Machine learning ,computer.software_genre ,dropout ,lcsh:Computer applications to medicine. Medical informatics ,Machine Learning ,03 medical and health sciences ,Young Adult ,0302 clinical medicine ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,030212 general & internal medicine ,Dropout (neural networks) ,Aged ,Original Paper ,business.industry ,Dropout ,User journey ,lcsh:Public aspects of medicine ,Business informatics ,lcsh:RA1-1270 ,Middle Aged ,Digital health ,Mobile Applications ,Alternating decision tree ,lcsh:R858-859.7 ,The Internet ,Female ,Artificial intelligence ,business ,computer ,Internet-Based Intervention - Abstract
Background: User dropout is a widespread concern in the delivery and evaluation of digital (ie, web and mobile apps) health interventions. Researchers have yet to fully realize the potential of the large amount of data generated by these technology-based programs. Of particular interest is the ability to predict who will drop out of an intervention. This may be possible through the analysis of user journey data—self-reported as well as system-generated data—produced by the path (or journey) an individual takes to navigate through a digital health intervention.Objective: The purpose of this study is to provide a step-by-step process for the analysis of user journey data and eventually to predict dropout in the context of digital health interventions. The process is applied to data from an internet-based intervention for insomnia as a way to illustrate its use. The completion of the program is contingent upon completing 7 sequential cores, which include an initial tutorial core. Dropout is defined as not completing the seventh core.Methods: Steps of user journey analysis, including data transformation, feature engineering, and statistical model analysis and evaluation, are presented. Dropouts were predicted based on data from 151 participants from a fully automated web-based program (Sleep Healthy Using the Internet) that delivers cognitive behavioral therapy for insomnia. Logistic regression with L1 and L2 regularization, support vector machines, and boosted decision trees were used and evaluated based on their predictive performance. Relevant features from the data are reported that predict user dropout.Results: Accuracy of predicting dropout (area under the curve [AUC] values) varied depending on the program core and the machine learning technique. After model evaluation, boosted decision trees achieved AUC values ranging between 0.6 and 0.9. Additional handcrafted features, including time to complete certain steps of the intervention, time to get out of bed, and days since the last interaction with the system, contributed to the prediction performance.Conclusions: The results support the feasibility and potential of analyzing user journey data to predict dropout. Theory-driven handcrafted features increased the prediction performance. The ability to predict dropout at an individual level could be used to enhance decision making for researchers and clinicians as well as inform dynamic intervention regimens. Background: User dropout is a widespread concern in the delivery and evaluation of digital (ie, web and mobile apps) health interventions. Researchers have yet to fully realize the potential of the large amount of data generated by these technology-based programs. Of particular interest is the ability to predict who will drop out of an intervention. This may be possible through the analysis of user journey data—self-reported as well as system-generated data—produced by the path (or journey) an individual takes to navigate through a digital health intervention.Objective: The purpose of this study is to provide a step-by-step process for the analysis of user journey data and eventually to predict dropout in the context of digital health interventions. The process is applied to data from an internet-based intervention for insomnia as a way to illustrate its use. The completion of the program is contingent upon completing 7 sequential cores, which include an initial tutorial core. Dropout is defined as not completing the seventh core.Methods: Steps of user journey analysis, including data transformation, feature engineering, and statistical model analysis and evaluation, are presented. Dropouts were predicted based on data from 151 participants from a fully automated web-based program (Sleep Healthy Using the Internet) that delivers cognitive behavioral therapy for insomnia. Logistic regression with L1 and L2 regularization, support vector machines, and boosted decision trees were used and evaluated based on their predictive performance. Relevant features from the data are reported that predict user dropout.Results: Accuracy of predicting dropout (area under the curve [AUC] values) varied depending on the program core and the machine learning technique. After model evaluation, boosted decision trees achieved AUC values ranging between 0.6 and 0.9. Additional handcrafted features, including time to complete certain steps of the intervention, time to get out of bed, and days since the last interaction with the system, contributed to the prediction performance.Conclusions: The results support the feasibility and potential of analyzing user journey data to predict dropout. Theory-driven handcrafted features increased the prediction performance. The ability to predict dropout at an individual level could be used to enhance decision making for researchers and clinicians as well as inform dynamic intervention regimens.
- Published
- 2020
40. A novel tolerance geometric method based on machine learning.
- Author
-
Cui, Lu-jun, Sun, Man-ying, Cao, Yan-long, Zhao, Qi-jian, Zeng, Wen-han, and Guo, Shi-rui
- Subjects
MACHINE learning ,MACHINE performance ,PRODUCT costing ,TECHNICAL specifications ,PRODUCT design - Abstract
In most cases, designers must manually specify geometric tolerance types and values when designing mechanical products. For the same nominal geometry, different designers may specify different types and values of geometric tolerances. To reduce the uncertainty and realize the tolerance specification automatically, a tolerance specification method based on machine learning is proposed. The innovation of this paper is to find out the information that affects geometric tolerances selection and use machine learning methods to generate tolerance specifications. The realization of tolerance specifications is changed from rule-driven to data-driven. In this paper, feature engineering is performed on the data for the application scenarios of tolerance specifications, which improves the performance of the machine learning model. This approach firstly considers the past tolerance specification schemes as cases and sets up the cases to the tolerance specification database which contains information such as datum reference frame, positional relationship, spatial relationship, and product cost. Then perform feature engineering on the data and established machine learning algorithm to convert the tolerance specification problem into an optimization problem. Finally, a gear reducer as a case study is given to verify the method. The results are evaluated with three different machine learning evaluation indicators and made a comparison with the tolerance specification method in the industry. The final results show that the machine learning algorithm can automatically generate tolerance specifications, and after feature engineering, the accuracy of the tolerance specification results is improved. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
41. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text
- Author
-
Cécile Pereira, Xiaolin Li, Qile Zhu, and Ana Conesa
- Subjects
0301 basic medicine ,Statistics and Probability ,Feature engineering ,Computer science ,Context (language use) ,02 engineering and technology ,computer.software_genre ,Biochemistry ,Convolutional neural network ,03 medical and health sciences ,Deep Learning ,Named-entity recognition ,0202 electrical engineering, electronic engineering, information engineering ,Molecular Biology ,Artificial neural network ,business.industry ,Deep learning ,Computational Biology ,Original Papers ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data and Text Mining ,business ,computer ,Natural language processing ,Word (computer architecture) ,Software - Abstract
Motivation Best performing named entity recognition (NER) methods for biomedical literature are based on hand-crafted features or task-specific rules, which are costly to produce and difficult to generalize to other corpora. End-to-end neural networks achieve state-of-the-art performance without hand-crafted features and task-specific knowledge in non-biomedical NER tasks. However, in the biomedical domain, using the same architecture does not yield competitive performance compared with conventional machine learning models. Results We propose a novel end-to-end deep learning approach for biomedical NER tasks that leverages the local contexts based on n-gram character and word embeddings via Convolutional Neural Network (CNN). We call this approach GRAM-CNN. To automatically label a word, this method uses the local information around a word. Therefore, the GRAM-CNN method does not require any specific knowledge or feature engineering and can be theoretically applied to a wide range of existing NER problems. The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.26% on the Biocreative II dataset, 87.26% on the NCBI dataset and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods. To the best of our knowledge, we are the first to apply CNN based structures to BioNER problems. Availability and implementation The GRAM-CNN source code, datasets and pre-trained model are available online at: https://github.com/valdersoul/GRAM-CNN. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2017
42. Prediction and evaluation of health state for power battery based on Ridge linear regression model.
- Author
-
Huang, Bixiong, Liao, Haiyu, Wang, Yiquan, Liu, Xintian, and Yan, Xiao
- Abstract
The state of health (SOH) of power battery reflects the difference between the current performance of the battery and the time it left the factory. Accurate prediction of it is the key to improving battery cycle efficiency. This paper studies the application of data-driven algorithms in power battery health estimation. Firstly, Using the data of actual operating vehicles which are monitoring in the data platform as the research objects. The charging event segmentation algorithm is designed for the full amount of data, and the K-means clustering model is used to extract slow charging events. Secondly, feature engineering is performed on the data, including the use of Pearson and Spearman coefficients analysis for numerical features, the use of one-hot encoding for category features to determine the final input features of SOH model. Eventually, using the Ridge linear regression model to predict the health status of the power battery. The research shows that the MAE is less than 5%, which meets the needs of practical use. In addition, this paper comparing Ridge with three other models named Linear Regression, Lasso, and Elastic Net. The result showed that the linear regression model with L2 regularization is more applicable in low-dimensional feature application scenarios without cell data in prediction of SOH. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
43. Upper-Limb Motion Recognition Based on Hybrid Feature Selection: Algorithm Development and Validation
- Author
-
Zhi Chen, Lidian Chen, Juan Li, Qiaoqin Li, Bin Zhu, Liu Lang, Yang Shangming, Liu Yongguo, Guanyi Zhu, Jiajing Zhu, Jing Tao, and Rongjiang Jin
- Subjects
Feature engineering ,Original Paper ,Computer science ,inertial measurement unit ,Bayes Theorem ,Health Informatics ,Feature selection ,rehabilitation exercises ,Filter (signal processing) ,Random forest ,Machine Learning ,Upper Extremity ,Naive Bayes classifier ,feature selection ,Inertial measurement unit ,Classifier (linguistics) ,motion recognition ,Feature (machine learning) ,Cluster Analysis ,Humans ,Algorithm ,Algorithms - Abstract
Background For rehabilitation training systems, it is essential to automatically record and recognize exercises, especially when more than one type of exercise is performed without a predefined sequence. Most motion recognition methods are based on feature engineering and machine learning algorithms. Time-domain and frequency-domain features are extracted from original time series data collected by sensor nodes. For high-dimensional data, feature selection plays an important role in improving the performance of motion recognition. Existing feature selection methods can be categorized into filter and wrapper methods. Wrapper methods usually achieve better performance than filter methods; however, in most cases, they are computationally intensive, and the feature subset obtained is usually optimized only for the specific learning algorithm. Objective This study aimed to provide a feature selection method for motion recognition of upper-limb exercises and improve the recognition performance. Methods Motion data from 5 types of upper-limb exercises performed by 21 participants were collected by a customized inertial measurement unit (IMU) node. A total of 60 time-domain and frequency-domain features were extracted from the original sensor data. A hybrid feature selection method by combining filter and wrapper methods (FESCOM) was proposed to eliminate irrelevant features for motion recognition of upper-limb exercises. In the filter stage, candidate features were first selected from the original feature set according to the significance for motion recognition. In the wrapper stage, k-nearest neighbors (kNN), Naïve Bayes (NB), and random forest (RF) were evaluated as the wrapping components to further refine the features from the candidate feature set. The performance of the proposed FESCOM method was verified using experiments on motion recognition of upper-limb exercises and compared with the traditional wrapper method. Results Using kNN, NB, and RF as the wrapping components, the classification error rates of the proposed FESCOM method were 1.7%, 8.9%, and 7.4%, respectively, and the feature selection time in each iteration was 13 seconds, 71 seconds, and 541 seconds, respectively. Conclusions The experimental results demonstrated that, in the case of 5 motion types performed by 21 healthy participants, the proposed FESCOM method using kNN and NB as the wrapping components achieved better recognition performance than the traditional wrapper method. The FESCOM method dramatically reduces the search time in the feature selection process. The results also demonstrated that the optimal number of features depends on the classifier. This approach serves to improve feature selection and classification algorithm selection for upper-limb motion recognition based on wearable sensor data, which can be extended to motion recognition of more motion types and participants.
- Published
- 2021
44. A Comparative Study of Deep Learning Models on Tropospheric Ozone Forecasting Using Feature Engineering Approach.
- Author
-
Rezaei, Reza, Naderalvojoud, Behzad, and Güllü, Gülen
- Subjects
ARTIFICIAL neural networks ,DEEP learning ,TROPOSPHERIC ozone ,FORECASTING ,ENGINEERING ,ARCHITECTURAL design ,AIR quality - Abstract
This paper investigates the effect of the architectural design of deep learning models in combination with a feature engineering approach considering the temporal variation in the features in the case of tropospheric ozone forecasting. Although deep neural network models have shown successful results by extracting features automatically from raw data, their performance in the domain of air quality forecasting is influenced by different feature analysis approaches and model architectures. This paper proposes a simple but effective analysis of tropospheric ozone time series data that can reveal temporal phases of the ozone evolution process and assist neural network models to reflect these temporal variations. We demonstrate that addressing the ozone evolution phases when developing the model architecture improves the performance of deep neural network models. As a result, we evaluated our approach on the CNN model and showed that not only does it improve the performance of the CNN model, but also that the CNN model in combination with our approach boosts the performance of the other deep neural network models such as LSTM. The development of the CNN, LSTM-CNN, and CNN-LSTM models using the proposed approach improved the prediction performance of the models by 3.58%, 1.68%, and 3.37%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
45. Feature Engineering for Anti-Fraud Models Based on Anomaly Detection.
- Author
-
Przekop, Damian
- Subjects
ENGINEERING models ,STATISTICAL models ,FORECASTING ,ALGORITHMS - Abstract
The paper presents two algorithms as a solution to the problem of identifying fraud intentions of a customer. Their purpose is to generate variables that contribute to fraud models' predictive power improvement. In this article, a novel approach to the feature engineering, based on anomaly detection, is presented. As the choice of statistical model used in the research improves predictive capabilities of a solution to some extent, most of the attention should be paid to the choice of proper predictors. The main finding of the research is that model enrichment with additional predictors leads to the further improvement of predictive power and better interpretability of anti-fraud model. The paper is a contribution to the fraud prediction problem but the method presented may generate variable input to every tool equipped with variableselection algorithm. The cost is the increased complexity of the models obtained. The approach is illustrated on a dataset from one of the European banks. [ABSTRACT FROM AUTHOR]
- Published
- 2020
46. Deep learning for healthcare: review, opportunities and challenges
- Author
-
Riccardo Miotto, Xiaoqian Jiang, Shuang Wang, Fei Wang, and Joel T. Dudley
- Subjects
Diagnostic Imaging ,Paper ,0301 basic medicine ,Feature engineering ,Computer science ,02 engineering and technology ,Data type ,Domain (software engineering) ,03 medical and health sciences ,Deep Learning ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,Data Mining ,Electronic Health Records ,Humans ,Cluster analysis ,Molecular Biology ,Interpretability ,business.industry ,Deep learning ,Computational Biology ,Genomics ,Data science ,Telemedicine ,030104 developmental biology ,Domain knowledge ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Delivery of Health Care ,Information Systems - Abstract
Gaining knowledge and actionable insights from complex, high-dimensional and heterogeneous biomedical data remains a key challenge in transforming health care. Various types of data have been emerging in modern biomedical research, including electronic health records, imaging, -omics, sensor data and text, which are complex, heterogeneous, poorly annotated and generally unstructured. Traditional data mining and statistical learning approaches typically need to first perform feature engineering to obtain effective and more robust features from those data, and then build prediction or clustering models on top of them. There are lots of challenges on both steps in a scenario of complicated data and lacking of sufficient domain knowledge. The latest advances in deep learning technologies provide new effective paradigms to obtain end-to-end learning models from complex data. In this article, we review the recent literature on applying deep learning technologies to advance the health care domain. Based on the analyzed work, we suggest that deep learning approaches could be the vehicle for translating big biomedical data into improved human health. However, we also note limitations and needs for improved methods development and applications, especially in terms of ease-of-understanding for domain experts and citizen scientists. We discuss such challenges and suggest developing holistic and meaningful interpretable architectures to bridge deep learning models and human interpretability.
- Published
- 2017
47. Privacy-Preserving Methods for Feature Engineering Using Blockchain: Review, Evaluation, and Proof of Concept
- Author
-
Michael Jones, Noah Zimmerman, Matthew Johnson, Mark M. Shervey, and Joel T. Dudley
- Subjects
Feature engineering ,blockchain ,Information privacy ,Blockchain ,data collection ,020205 medical informatics ,Smart contract ,Computer science ,Health Informatics ,Cryptography ,02 engineering and technology ,Computer security ,computer.software_genre ,privacy ,Proof of Concept Study ,03 medical and health sciences ,0302 clinical medicine ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,030212 general & internal medicine ,mobile health ,Computer Security ,Original Paper ,cryptography ,business.industry ,trusted execution environment ,Trusted third party ,confidentiality ,feature engineering ,machine learning ,geolocation ,Key (cryptography) ,business ,Raw data ,smart contract ,computer - Abstract
Background The protection of private data is a key responsibility for research studies that collect identifiable information from study participants. Limiting the scope of data collection and preventing secondary use of the data are effective strategies for managing these risks. An ideal framework for data collection would incorporate feature engineering, a process where secondary features are derived from sensitive raw data in a secure environment without a trusted third party. Objective This study aimed to compare current approaches based on how they maintain data privacy and the practicality of their implementations. These approaches include traditional approaches that rely on trusted third parties, and cryptographic, secure hardware, and blockchain-based techniques. Methods A set of properties were defined for evaluating each approach. A qualitative comparison was presented based on these properties. The evaluation of each approach was framed with a use case of sharing geolocation data for biomedical research. Results We found that approaches that rely on a trusted third party for preserving participant privacy do not provide sufficiently strong guarantees that sensitive data will not be exposed in modern data ecosystems. Cryptographic techniques incorporate strong privacy-preserving paradigms but are appropriate only for select use cases or are currently limited because of computational complexity. Blockchain smart contracts alone are insufficient to provide data privacy because transactional data are public. Trusted execution environments (TEEs) may have hardware vulnerabilities and lack visibility into how data are processed. Hybrid approaches combining blockchain and cryptographic techniques or blockchain and TEEs provide promising frameworks for privacy preservation. For reference, we provide a software implementation where users can privately share features of their geolocation data using the hybrid approach combining blockchain with TEEs as a supplement. Conclusions Blockchain technology and smart contracts enable the development of new privacy-preserving feature engineering methods by obviating dependence on trusted parties and providing immutable, auditable data processing workflows. The overlap between blockchain and cryptographic techniques or blockchain and secure hardware technologies are promising fields for addressing important data privacy needs. Hybrid blockchain and TEE frameworks currently provide practical tools for implementing experimental privacy-preserving applications.
- Published
- 2019
48. Generating Medical Assessments Using a Neural Network Model: Algorithm Development and Validation
- Author
-
Hong Yu, Baotian Hu, and Adarsha S. Bajracharya
- Subjects
Feature engineering ,020205 medical informatics ,Computer science ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Inference ,Health Informatics ,02 engineering and technology ,medical assessment generation ,computer.software_genre ,Machine learning ,Clinical decision support system ,Domain (software engineering) ,03 medical and health sciences ,0302 clinical medicine ,Health Information Management ,deep neural network model ,0202 electrical engineering, electronic engineering, information engineering ,030212 general & internal medicine ,Medical diagnosis ,natural language processing ,Baseline (configuration management) ,Original Paper ,Artificial neural network ,business.industry ,artificial intelligence ,Expert system ,Artificial intelligence ,electronic health record note ,business ,computer - Abstract
Background Since its inception, artificial intelligence has aimed to use computers to help make clinical diagnoses. Evidence-based medical reasoning is important for patient care. Inferring clinical diagnoses is a crucial step during the patient encounter. Previous works mainly used expert systems or machine learning–based methods to predict the International Classification of Diseases - Clinical Modification codes based on electronic health records. We report an alternative approach: inference of clinical diagnoses from patients’ reported symptoms and physicians’ clinical observations. Objective We aimed to report a natural language processing system for generating medical assessments based on patient information described in the electronic health record (EHR) notes. Methods We processed EHR notes into the Subjective, Objective, Assessment, and Plan sections. We trained a neural network model for medical assessment generation (N2MAG). Our N2MAG is an innovative deep neural model that uses the Subjective and Objective sections of an EHR note to automatically generate an “expert-like” assessment of the patient. N2MAG can be trained in an end-to-end fashion and does not require feature engineering and external knowledge resources. Results We evaluated N2MAG and the baseline models both quantitatively and qualitatively. Evaluated by both the Recall-Oriented Understudy for Gisting Evaluation metrics and domain experts, our results show that N2MAG outperformed the existing state-of-the-art baseline models. Conclusions N2MAG could generate a medical assessment from the Subject and Objective section descriptions in EHR notes. Future work will assess its potential for providing clinical decision support.
- Published
- 2019
49. Applying Multivariate Segmentation Methods to Human Activity Recognition From Wearable Sensors’ Data
- Author
-
Eleanne van Vliet, Frank D. Gilliland, Kenan Li, Alex A. T. Bui, Robert Urman, Rima Habre, Anahita Hosseini, Yijun Lin, Yao-Yi Chiang, John Morrison, Christine E. King, Huiyu Deng, José Luis Ambite, Majid Sarrafzadeh, Sandrah P. Eckel, and Dimitris Stripelis
- Subjects
Adult ,Male ,Feature engineering ,Multivariate statistics ,Time Factors ,020205 medical informatics ,Computer science ,Wearable computer ,physical activity ,Health Informatics ,02 engineering and technology ,Information technology ,smartphone ,Machine Learning ,Activity recognition ,Smartwatch ,Wearable Electronic Devices ,03 medical and health sciences ,0302 clinical medicine ,Sliding window protocol ,Accelerometry ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Human Activities ,Segmentation ,statistical data analysis wearable devices ,030212 general & internal medicine ,Wearable technology ,Original Paper ,business.industry ,Recognition, Psychology ,Pattern recognition ,Middle Aged ,T58.5-58.64 ,3. Good health ,machine learning ,Multivariate Analysis ,Female ,Artificial intelligence ,Public aspects of medicine ,RA1-1270 ,business - Abstract
BackgroundTime-resolved quantification of physical activity can contribute to both personalized medicine and epidemiological research studies, for example, managing and identifying triggers of asthma exacerbations. A growing number of reportedly accurate machine learning algorithms for human activity recognition (HAR) have been developed using data from wearable devices (eg, smartwatch and smartphone). However, many HAR algorithms depend on fixed-size sampling windows that may poorly adapt to real-world conditions in which activity bouts are of unequal duration. A small sliding window can produce noisy predictions under stable conditions, whereas a large sliding window may miss brief bursts of intense activity. ObjectiveWe aimed to create an HAR framework adapted to variable duration activity bouts by (1) detecting the change points of activity bouts in a multivariate time series and (2) predicting activity for each homogeneous window defined by these change points. MethodsWe applied standard fixed-width sliding windows (4-6 different sizes) or greedy Gaussian segmentation (GGS) to identify break points in filtered triaxial accelerometer and gyroscope data. After standard feature engineering, we applied an Xgboost model to predict physical activity within each window and then converted windowed predictions to instantaneous predictions to facilitate comparison across segmentation methods. We applied these methods in 2 datasets: the human activity recognition using smartphones (HARuS) dataset where a total of 30 adults performed activities of approximately equal duration (approximately 20 seconds each) while wearing a waist-worn smartphone, and the Biomedical REAl-Time Health Evaluation for Pediatric Asthma (BREATHE) dataset where a total of 14 children performed 6 activities for approximately 10 min each while wearing a smartwatch. To mimic a real-world scenario, we generated artificial unequal activity bout durations in the BREATHE data by randomly subdividing each activity bout into 10 segments and randomly concatenating the 60 activity bouts. Each dataset was divided into ~90% training and ~10% holdout testing. ResultsIn the HARuS data, GGS produced the least noisy predictions of 6 physical activities and had the second highest accuracy rate of 91.06% (the highest accuracy rate was 91.79% for the sliding window of size 0.8 second). In the BREATHE data, GGS again produced the least noisy predictions and had the highest accuracy rate of 79.4% of predictions for 6 physical activities. ConclusionsIn a scenario with variable duration activity bouts, GGS multivariate segmentation produced smart-sized windows with more stable predictions and a higher accuracy rate than traditional fixed-size sliding window approaches. Overall, accuracy was good in both datasets but, as expected, it was slightly lower in the more real-world study using wrist-worn smartwatches in children (BREATHE) than in the more tightly controlled study using waist-worn smartphones in adults (HARuS). We implemented GGS in an offline setting, but it could be adapted for real-time prediction with streaming data.
- Published
- 2019
50. 小样本数据下特种材料基因 工程的数据扩充方法.
- Author
-
杨涛, 张兆波, 郑添屹, and 彭保
- Abstract
Copyright of Big Data Research (2096-0271) is the property of Beijing Xintong Media Co., Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.