2,272 results on '"Data Preprocessing"'
Search Results
2. On many-objective feature selection and the need for interpretability
- Author
-
Njoku, Uchechukwu F., Abelló, Alberto, Bilalli, Besim, and Bontempi, Gianluca
- Published
- 2025
- Full Text
- View/download PDF
3. A review of methods and applications in structural health monitoring (SHM) for bridges
- Author
-
Zhang, Bangcheng, Ren, Yuheng, He, Siming, Gao, Zhi, Li, Bo, and Song, Jingyuan
- Published
- 2025
- Full Text
- View/download PDF
4. Real-time monitoring and prediction of glucose content in enzymatic reactions: Application of a deep neural network mode
- Author
-
Guo, FengYing and Zhuang, DianZheng
- Published
- 2025
- Full Text
- View/download PDF
5. A hybrid load forecasting system based on data augmentation and ensemble learning under limited feature availability
- Author
-
Yang, Qing and Tian, Zhirui
- Published
- 2025
- Full Text
- View/download PDF
6. Harnessing artificial intelligence for predictive modelling in oral oncology: Opportunities, challenges, and clinical Perspectives
- Author
-
Veeraraghavan, Vishnu Priya, Daniel, Shikhar, Dasari, Arun Kumar, Aileni, Kaladhar Reddy, patil, Chaitra, and Patil, Santosh R.
- Published
- 2024
- Full Text
- View/download PDF
7. Advancements in rice disease detection through convolutional neural networks: A comprehensive review
- Author
-
Gülmez, Burak
- Published
- 2024
- Full Text
- View/download PDF
8. Robust resampling and stacked learning models for electricity theft detection in smart grid
- Author
-
Ullah, Ashraf, Khan, Inam Ullah, Younas, Muhammad Zeeshan, Ahmad, Maqbool, and Kryvinska, Natalia
- Published
- 2025
- Full Text
- View/download PDF
9. Analysis of selected algorithms for detecting outliers in data
- Author
-
Brzezińska, Agnieszka Nowak and Jasiak, Dawid
- Published
- 2024
- Full Text
- View/download PDF
10. Advanced Data Analysis for Machine Learning-powered Recommender Systems
- Author
-
Antal, Lidia-Monica and Iantovics, László Barna
- Published
- 2024
- Full Text
- View/download PDF
11. Artificial Neural Networks for Predicting Student Performance: A Case Study on Student Scores Dataset
- Author
-
Habeeb, Rasha Jasim Habeeb, Fadare, Olusolade Aribake, Al-Turjman, Fadi, Shehata, Hany Farouk, Editor-in-Chief, ElZahaby, Khalid M., Advisory Editor, Chen, Dar Hao, Advisory Editor, Amer, Mourad, Series Editor, and Al-Turjman, Fadi, editor
- Published
- 2025
- Full Text
- View/download PDF
12. Dataset Ratio Influence on kNN Classification Results
- Author
-
Gaće, Marin, Galba, Tomislav, Baumgartner, Alfonzo, Livada, Časlav, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Glavaš, Hrvoje, editor, Hadzima-Nyarko, Marijana, editor, Ademović, Naida, editor, and Hanák, Tomáš, editor
- Published
- 2025
- Full Text
- View/download PDF
13. Refining Human-Data Interaction: Advanced Techniques for EEGEyeNet Dataset Precision
- Author
-
Wu, Jade, Dou, Jingwen, Utoft, Sofia, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Kurosu, Masaaki, editor, Hashizume, Ayako, editor, Mori, Hirohiko, editor, Asahi, Yumi, editor, Schmorrow, Dylan D., editor, and Fidopiastis, Cali M., editor
- Published
- 2025
- Full Text
- View/download PDF
14. A Weighted Discrete Wavelet Transform-Based Capsule Network for Malware Classification
- Author
-
Qiao, Tonghua, Cao, Chunjie, Zou, Binghui, Tao, Fangjian, Cheng, Yinan, Zhang, Qi, Sun, Jingzhang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
15. Machine Learning in IoT: An In-Depth Dataset Analysis Based on Attack Detection
- Author
-
Tyagi, Kajal, Ahlawat, Anil, Chaudhary, Himanshi, Ghosh, Ashish, Editorial Board Member, Dev, Amita, editor, Sharma, Arun, editor, Agrawal, S. S., editor, and Rani, Ritu, editor
- Published
- 2025
- Full Text
- View/download PDF
16. Resource Forecasting of Geographical Area Using Big Data Analytics
- Author
-
Rejeti, Venkata Kishore Kumar, Chandra, G. Rajesh, Mounika, V., Thanveer, S. K. Thasleema, Jyothika, V., Harika, Y., Anand, D., Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Tolio, Tullio A. M., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Schmitt, Robert, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Deepak, B B V L, editor, Bahubalendruni, M.V.A. Raju, editor, Parhi, D.R.K., editor, and Biswal, B. B., editor
- Published
- 2025
- Full Text
- View/download PDF
17. Optimizing News Categorization with Machine Learning: A Comprehensive Study Using Naive Bayes (MultinomialNB) Classifier
- Author
-
Mansoori, Ahmed, Tahat, Khalaf, Tahat, Dina Naser, Habes, Mohammad, Salloum, Said A., Kacprzyk, Janusz, Series Editor, and Hamdan, Allam, editor
- Published
- 2025
- Full Text
- View/download PDF
18. Brain Tumor Detection and Classification Using Deep Learning Models
- Author
-
Pujar, Manjunath, Kavanashree, H., Jitendra, M., Halemani, Shankaraling, Handur, Vidya, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Shrivastava, Vivek, editor, Bansal, Jagdish Chand, editor, and Panigrahi, B. K., editor
- Published
- 2025
- Full Text
- View/download PDF
19. Unlocking Insights in Healthcare: A Comparative Study of Hyperparameter Tuned Machine Learning Algorithms
- Author
-
Ferdous, Shahriar Faysal, Efat, Anwar Hossain, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Mahmud, Mufti, editor, Kaiser, M. Shamim, editor, Bandyopadhyay, Anirban, editor, Ray, Kanad, editor, and Al Mamun, Shamim, editor
- Published
- 2025
- Full Text
- View/download PDF
20. Data preprocessing methods for selective sweep detection using convolutional neural networks.
- Author
-
Zhao, Hanqing and Alachiotis, Nikolaos
- Subjects
- *
CONVOLUTIONAL neural networks , *CLASSIFICATION algorithms , *POPULATION genetics , *ALGORITHMS , *PIXELS , *BOOSTING algorithms - Abstract
The identification of positive selection has been framed as a classification task, with Convolutional Neural Networks (CNNs) already outperforming summary statistics and likelihood-based approaches in accuracy. Despite the prevalence of CNN-based methods that manipulate the pixels of images representing raw genomic data as a preprocessing step to improve classification accuracy, the efficacy of these pixel-rearrangement techniques remains inadequately examined, particularly in the presence of confounding factors like population bottlenecks, migration and recombination hotspots. We introduce a set of pixel rearrangement algorithms aimed at enhancing CNN classification accuracy in detecting selective sweeps. These algorithms are employed to assess the performance of four CNN models for selective sweep detection. Our findings illustrate that the judicious application of rearrangement algorithms notably enhances the overall classification accuracy of a CNN across various datasets simulating confounding factors. We observed that sorting the columns of the genomic matrices has higher on CNN performance than rearranging the sequences. To some extent, these rearrangement algorithms are more robust to misspecified demographic models compared with the utilization of the default preprocessing algorithm as suggested by the respective authors of each CNN architecture. We provide the data rearrangement algorithms as a distinct package available for download at: https://github.com/Zhaohq96/Genetic-data-rearrangement. • Data rearrangement algorithms can boost the overall classification accuracy of CNNs in identifying selective sweeps. • To some extent, data rearrangement algorithms improve classification robustness to demographic model misspecification. • Suitable rearrangement algorithms per CNN are robust to varying genomic window sizes. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
21. Enhancing movie recommendations using quantum support vector machine (QSVM).
- Author
-
Shahid, Maida, Hassan, Muhammad Awais, Iqbal, Faiza, Altaf, Ayesha, Shah, Sayyed Wajihul Husnain, Elizaincin, Ana Visiers, and Ashraf, Imran
- Abstract
The rising demand for high-quality movie recommendations in streaming services necessitates more efficient algorithms capable of handling large datasets. Traditional recommendation systems often struggle with long training times and high computational costs. This study introduces a novel movie recommendation system utilizing a quantum support vector machine (QSVM) to overcome these limitations. By leveraging quantum algorithms, QSVM enhances both the speed and accuracy of recommendations. Our approach involves collecting and preprocessing data, implementing classical SVM for baseline comparison, encoding data for QSVM, and executing QSVM using a publicly accessible IBM quantum computer. The results demonstrate that QSVM outperforms classical SVM, achieving a 96% accuracy and an F1 score of 0.9693, compared to the classical SVM’s 95.33% accuracy and 0.9641 F1 score. This signifies QSVM’s superior capability in handling complex datasets. Our findings highlight the potential of QSVM in movie recommendation systems, suggesting future research directions in quantum machine learning and its applications. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
22. Evaluation of machine learning algorithms in tunnel boring machine applications: a case study in Mashhad metro line 3.
- Author
-
Abbasi, Morteza, Namadchi, Amir Hossein, Abbasi, Mehdi, and Abbasi, Mohsen
- Subjects
MACHINE learning ,FEATURE selection ,ARTIFICIAL intelligence ,EARTH pressure ,DECISION trees - Abstract
Accurately predicting the performance of Earth Pressure Balance Tunnel Boring Machines (EPB-TBMs) in soft ground conditions is crucial yet challenging due to the complex interaction of geological and operational factors. This study investigates Mashhad Metro Line 3, where a TBM was employed to excavate a 1831-m section through variable soil compositions, including significant cobble and boulder content, presenting unique challenges to performance prediction. To address these complexities, several machine learning models—Multiple Linear Regression (MLR), Decision Trees (DT), and Multi-Layer Perceptron (MLP) neural networks—were applied to predict TBM penetration rates and assess model efficacy. Beginning with a dataset of 438,960 rows, rigorous feature selection and data processing yielded a final dataset of 1833 rows. Among the models, MLR achieved an R
2 score of 0.991, closely matching the more complex MLP model, which reached an R2 score of 0.988. In contrast, the Decision Tree model demonstrated a lower R2 score of 0.923, suggesting a tendency to overfit. While MLR provided an effective, straightforward approach, MLP proved valuable for capturing non-linear patterns that could improve predictive accuracy in more variable tunneling conditions. These findings underscore the practical applications of both simple and complex machine learning models in enhancing TBM performance prediction. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
23. Performance Evaluation of Hybrid PSO-BPNN-AdaBoost and PSO-BPNN-XGBoost Models for Rockburst Prediction with Imbalanced Datasets.
- Author
-
Li, Shujian, Lu, Pengpeng, Liang, Weizhang, Chen, Ying, and Da, Qi
- Subjects
MACHINE learning ,ELASTIC deformation ,COMPRESSIVE strength ,TENSILE strength ,PREDICTION models - Abstract
The rockburst hazard is a primary geological disaster endangering the environment in underground engineering. Due to the complexity of the rockburst mechanism, traditional methods are insufficient to predict the rockburst hazard objectively, especially when dealing with an imbalanced dataset. To address this issue, the hybrid models of PSO-BPNN-AdaBoost and PSO-BPNN-XGBoost were developed to predict rockburst hazards in this study. First, a rockburst dataset with 266 cases was constructed, containing six indicators: the maximum tangential stress, uniaxial compressive strength, uniaxial tensile strength, elastic deformation energy index, tangential stress index, and brittleness coefficient of strength. Then, the original dataset was oversampled using the synthetic minority oversampling technique (SMOTE) for dataset balancing. Subsequently, the PSO-BPNN-AdaBoost and PSO-BPNN-XGBoost models were constructed and evaluated to have the best accuracies of 0.901 and 0.851, respectively. Finally, the developed models were applied to predict the rockburst hazard in the Daxaingling Tunnel, the Cangling Tunnel, and the Zhongnanshan Tunnel shaft. The results indicate that the obtained rockburst hazard levels are consistent with engineering records, and the developed PSO-BPNN-AdaBoost and PSO-BPNN-XGBoost models are reliable for rockburst prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Normalization Strategies for Lipidome Data in Cell Line Panels.
- Author
-
Leegwater, Hanneke, Zhang, Zhengzheng, Zhang, Xiaobing, Hankemeier, Thomas, Harms, Amy C., Zweemer, Annelien J. M., Le Dévédec, Sylvia E., and Kindt, Alida
- Subjects
- *
PHENOMENOLOGICAL biology , *CELL lines , *LIPIDOMICS , *BIOMATERIALS , *CELL morphology , *INTRACLASS correlation - Abstract
ABSTRACT Sample collection can significantly affect lipid concentration measurements in cell line panels, concealing intrinsic differences between cancer subtypes. Most quality control steps in lipidomic data analysis focus on controlling technical variation. Correcting for the total amount of biological material remains an additional challenge for cell line panels. Here, we investigated how we can normalize lipidomic data acquired from multiple cell lines to correct for differences in sample biomass. We studied how commonly used data normalization and transformation strategies influence the resulting lipid data distributions. We compared normalization by biological properties including cell count and total protein concentration, to statistical and data‐based approaches, such as median, mean, or probabilistic quotient‐based normalization. We used intraclass correlations to estimate how normalization influenced the similarity between replicates. Normalizing lipidomic data by cell count improved the similarity between replicates but only for cell lines with similar morphologies. When comparing cell line panels with diverse morphologies neither cell count nor protein concentration was sufficient to increase the similarity of lipid abundances between cell line replicates. Data‐based normalizations increased these similarities but resulted in a bias towards the large and variable lipid class of triglycerides. These artifacts are reduced by normalizing for the abundance of only structural lipids. We conclude that there is a delicate balance between improving the similarity between replicates and avoiding artifacts in lipidomic data and emphasize the importance of an appropriate normalization strategy in studying biological phenomena using lipidomics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Predicting Peritoneal Dialysis Failure Within the Next Three Months Based on Deep Learning and Important Features Analysis.
- Author
-
Hsu, Fang-Yu, Hwang, Ren-Hung, Tsai, Ming-Hsien, and Wang, Jing-Tong
- Subjects
- *
PERITONEAL dialysis , *MEDICAL protocols , *EQUILIBRIUM testing , *MACHINE learning , *TIME series analysis , *DEEP learning - Abstract
This study aims to develop a deep learning model to predict peritoneal dialysis (PD) failure within the next three months using data from the preceding three months. Background: PD patients typically perform treatments at home and visit the clinic only once per month, leading to significant gaps in clinical care and increased risks of PD failure, which may necessitate a transition to hemodialysis (HD). Current studies on PD patients largely focus on predicting PD failure, mortality risk, and hospitalization through traditional machine learning methods, with limited application of deep learning for this purpose. Methods: We collected comprehensive patient data, including demographic information, comorbidities, medication history, biochemical test results, dialysis prescriptions, and peritoneal equilibrium test outcomes. After preprocessing, we employed time-series deep learning models to train and make predictions. Results: The model achieved a prediction accuracy of 89% and an AUROC of 92%, outperforming previous methods used for PD failure prediction. Conclusion: This approach not only improves prediction accuracy but also identifies key features that can aid clinicians in developing more precise treatment plans and enhancing patient care. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Machine and Deep Learning Models for Hypoxemia Severity Triage in CBRNE Emergencies.
- Author
-
Nanini, Santino, Abid, Mariem, Mamouni, Yassir, Wiedemann, Arnaud, Jouvet, Philippe, and Bourassa, Stephane
- Subjects
- *
MACHINE learning , *EARLY warning score , *ALARM fatigue , *ARTIFICIAL intelligence , *DEEP learning - Abstract
Background/Objectives: This study develops machine learning (ML) models to predict hypoxemia severity during emergency triage, particularly in Chemical, Biological, Radiological, Nuclear, and Explosive (CBRNE) scenarios, using physiological data from medical-grade sensors. Methods: Tree-based models (TBMs) such as XGBoost, LightGBM, CatBoost, Random Forests (RFs), Voting Classifier ensembles, and sequential models (LSTM, GRU) were trained on the MIMIC-III and IV datasets. A preprocessing pipeline addressed missing data, class imbalances, and synthetic data flagged with masks. Models were evaluated using a 5-min prediction window with minute-level interpolations for timely interventions. Results: TBMs outperformed sequential models in speed, interpretability, and reliability, making them better suited for real-time decision-making. Feature importance analysis identified six key physiological variables from the enhanced NEWS2+ score and emphasized the value of mask and score features for transparency. Voting Classifier ensembles showed slight metric gains but did not outperform individually optimized models, facing a precision-sensitivity tradeoff and slightly lower F1-scores for key severity levels. Conclusions: TBMs were effective for real-time hypoxemia prediction, while sequential models, though better at temporal handling, were computationally costly. This study highlights ML's potential to improve triage systems and reduce alarm fatigue, with future plans to incorporate multi-hospital datasets for broader applicability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Analysis of power in preprocessing methodologies for datasets with missing values.
- Author
-
Carvalho, Iago A. and Moreira, Arthur F.
- Subjects
- *
DECISION making , *MISSING data (Statistics) , *INFERENTIAL statistics , *STATISTICS , *ALGORITHMS - Abstract
The empirical evaluation of algorithms usually produces a large set of data that needs to be assessed through an appropriate statistical methodology. Sometimes, the generated dataset has missing entries due to the inability of an algorithm to compute a solution for a given benchmark. These missing entries largely restrict the use of statistical tests in such a way that classic parametric or non-parametric tests cannot correctly evaluate such datasets. There are some preprocessing methods in the literature to deal with this problem. In this paper, we evaluate four of these methods: the Bi-objective Lexicographical Ranking Scheme, PAR10 scores, the Skillings–Mack test, and the Wittkowski test. We measure the power of the Friedman's test when each one of them is used. Our results indicate that the Bi-objective Lexicographical Ranking Scheme or the PAR10 scores should be used when the number of missing entries is small or unknown in advance, while the Skillings–Mack test is recommended when more than 30 % of the entries are missing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. Robust Network Security: A Deep Learning Approach to Intrusion Detection in IoT.
- Author
-
Odeh, Ammar and Taleb, Anas Abu
- Subjects
CONVOLUTIONAL neural networks ,LONG short-term memory ,COMPUTER network traffic ,ENGINEERS ,FEATURE selection ,DEEP learning ,INTRUSION detection systems (Computer security) - Abstract
The proliferation of Internet of Things (IoT) technology has exponentially increased the number of devices interconnected over networks, thereby escalating the potential vectors for cybersecurity threats. In response, this study rigorously applies and evaluates deep learning models—namely Convolutional Neural Networks (CNN), Autoencoders, and Long Short-Term Memory (LSTM) networks—to engineer an advanced Intrusion Detection System (IDS) specifically designed for IoT environments. Utilizing the comprehensive UNSW-NB15 dataset, which encompasses 49 distinct features representing varied network traffic characteristics, our methodology focused on meticulous data preprocessing including cleaning, normalization, and strategic feature selection to enhance model performance. A robust comparative analysis highlights the CNN model's outstanding performance, achieving an accuracy of 99.89%, precision of 99.90%, recall of 99.88%, and an F1 score of 99.89% in binary classification tasks, outperforming other evaluated models significantly. These results not only confirm the superior detection capabilities of CNNs in distinguishing between benign and malicious network activities but also illustrate the model's effectiveness in multiclass classification tasks, addressing various attack vectors prevalent in IoT setups. The empirical findings from this research demonstrate deep learning's transformative potential in fortifying network security infrastructures against sophisticated cyber threats, providing a scalable, high-performance solution that enhances security measures across increasingly complex IoT ecosystems. This study's outcomes are critical for security practitioners and researchers focusing on the next generation of cyber defense mechanisms, offering a data-driven foundation for future advancements in IoT security strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Research on Gas Emission Prediction Based on KPCA-ICSA-SVR.
- Author
-
Liu, Li, Dai, Linchao, Mao, Xinyi, Chen, Yutao, and Jing, Yongheng
- Subjects
OPTIMIZATION algorithms ,PRINCIPAL components analysis ,SEARCH algorithms ,PREDICTION models ,MINE safety - Abstract
In the context of deep mining, the uncertainty of gas emission levels presents significant safety challenges for mines. This study proposes a gas emission prediction model based on Kernel Principal Component Analysis (KPCA), an Improved Crow Search Algorithm (ICSA) incorporating adaptive neighborhood search, and Support Vector Regression (SVR). Initially, data preprocessing is conducted to ensure a clean and complete dataset. Subsequently, KPCA is applied to reduce dimensionality by extracting key nonlinear features from the gas emission influencing factors, thereby enhancing computational efficiency. The ICSA is then employed to optimize SVR hyperparameters, improving the model's optimization capabilities and generalization performance, leading to the development of a robust KPCA-ICSA-SVR prediction model. The results indicate that the KPCA-ICSA-SVR model achieves the best performance, with RMSE values of 0.17898 and 0.3071 for the training and testing sets, respectively, demonstrating superior robustness and generalization capability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. On the Utilization of Emoji Encoding and Data Preprocessing with a Combined CNN-LSTM Framework for Arabic Sentiment Analysis.
- Author
-
Alawneh, Hussam, Hasasneh, Ahmad, and Maree, Mohammed
- Subjects
SENTIMENT analysis ,MACHINE learning ,WEBSITES ,ENCODING ,EMOTIONS ,EMOTICONS & emojis ,MICROBLOGS - Abstract
Social media users often express their emotions through text in posts and tweets, and these can be used for sentiment analysis, identifying text as positive or negative. Sentiment analysis is critical for different fields such as politics, tourism, e-commerce, education, and health. However, sentiment analysis approaches that perform well on English text encounter challenges with Arabic text due to its morphological complexity. Effective data preprocessing and machine learning techniques are essential to overcome these challenges and provide insightful sentiment predictions for Arabic text. This paper evaluates a combined CNN-LSTM framework with emoji encoding for Arabic Sentiment Analysis, using the Arabic Sentiment Twitter Corpus (ASTC) dataset. Three experiments were conducted with eight-parameter fusion approaches to evaluate the effect of data preprocessing, namely the effect of emoji encoding on their real and emotional meaning. Emoji meanings were collected from four websites specialized in finding the meaning of emojis in social media. Furthermore, the Keras tuner optimized the CNN-LSTM parameters during the 5-fold cross-validation process. The highest accuracy rate (91.85%) was achieved by keeping non-Arabic words and removing punctuation, using the Snowball stemmer after encoding emojis into Arabic text, and applying Keras embedding. This approach is competitive with other state-of-the-art approaches, showing that emoji encoding enriches text by accurately reflecting emotions, and enabling investigation of the effect of data preprocessing, allowing the hybrid model to achieve comparable results to the study using the same ASTC dataset, thereby improving sentiment analysis accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Research on ECG Signal Classification Based on Hybrid Residual Network.
- Author
-
Qi, Tianyu, Zhang, He, Zhao, Huijun, Shen, Chong, and Liu, Xiaochen
- Subjects
BUTTERWORTH filters (Signal processing) ,DISCRETE wavelet transforms ,CONVOLUTIONAL neural networks ,SIGNAL classification ,DEEP learning ,ARRHYTHMIA - Abstract
Arrhythmia detection in electrocardiogram (ECG) signals is essential for monitoring cardiovascular health. Current automated arrhythmia classification methods frequently encounter difficulties in detecting multiple cardiac abnormalities, particularly when dealing with imbalanced datasets. This paper proposes a novel deep learning approach for the detection and classification of arrhythmias in ECG signals using a Hybrid Residual Network (Hybrid ResNet). Our method employs a Hybrid Residual Network architecture that integrates standard convolution, depthwise separable convolution, and residual connections to enhance the feature extraction efficiency and classification accuracy. To guarantee superior input signals, we preprocess the ECG signals by removing baseline drift with a high-pass Butterworth filter, denoising via discrete wavelet transform, and segmenting heartbeat cycles through R-peak detection. Additionally, we rectify the class imbalance in the MIT-BIH Arrhythmia Database by applying the Synthetic Minority Oversampling Technique (SMOTE), therefore enhancing the model's ability to detect infrequent arrhythmia types. The suggested system achieves a classification accuracy of 99.09% on the MIT-BIH dataset, surpassing conventional convolutional neural networks and other state-of-the-art methodologies. Compared to existing approaches, our strategy exhibits superior effectiveness and robustness in managing diverse irregular heartbeats and arrhythmias. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Influence of Modal Decomposition Algorithms on Nonlinear Time Series Machine Learning Prediction Models in Engineering: A Case Study of Subway Tunnel Settlement.
- Author
-
Shen, Qingmeng, Wu, Yuming, Wan, Limin, Chen, Qian, Li, Yue, Liao, Zichao, Wang, Wenbo, Li, Feng, Li, Tao, and Shu, Jiajun
- Subjects
MACHINE learning ,SUBWAY tunnels ,ENGINEERING models ,PREDICTION models ,DECOMPOSITION method - Abstract
Featured Application: This study provides a method that can quickly and accurately predict subway tunnel settlement, which can be effectively applied to prevent and control the safety of subway projects. The settlement values of subway tunnels during the construction period exhibit significant nonlinear and spatial–temporal variation characteristics. To overcome the problems of historical data interference and spatiotemporal characteristics in tunnel settlement prediction models, this paper proposes a tunnel settlement prediction method based on data decomposition, reconstruction, and optimization. First, the original data are optimized via the SSA, which has global optimization capability, high noise immunity, and high adaptivity. The original signal is subsequently decomposed into multiple subsignal sequences via a variational modal decomposition (VMD) algorithm combined with a rolling decomposition technique. Finally, the decomposed signals are fed into the machine learning model to construct a high-precision settlement prediction model based on rolling decomposition. The prediction accuracy of different models was analyzed via the measured settlement data during the construction period of the Beijing subway as an example. The results show that the prediction model with the integrated decomposition algorithm reduces the RMSE and MAE by 33% and 37%, respectively, which significantly improves the prediction accuracy and generalization ability of the neural network to meet the demand of practical engineering prediction and simultaneously enhances the risk warning ability of the model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications.
- Author
-
Chang, Seok-Woo and Kim, Dong-Sun
- Subjects
LANGUAGE models ,NATURAL language processing ,GENERATIVE artificial intelligence ,TRANSFORMER models ,TEXT summarization ,DEEP learning - Abstract
Transformer model is a type of deep learning model that has quickly become fundamental in natural language processing (NLP) and other machine learning tasks. Transformer hardware accelerators are usually designed for specific models, such as Bidirectional Encoder Representations from Transformers (BERT), and vision Transformer models, like the ViT. In this study, we propose a Scalable Transformer Accelerator Unit (STAU) for multiple models, enabling efficient handling of various Transformer models used in voice assistant applications. Variable Systolic Array (VSA) centralized design, along with control and data preprocessing in embedded processors, enables matrix operations of varying sizes. In addition, we propose an efficient variable structure and a row-wise data input method for natural language processing where the word count changes. The proposed scalable Transformer accelerator accelerates text summarization, audio processing, image search, and generative AI used in voice assistance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. 基于 DP-FS-BP 预测框架和 SHAP 算法的 数据资产价值评估指标贡献率.
- Author
-
周翠平, 李少波, 张仪宗, 袁攀亮, 廖子豪, and 张星星
- Abstract
Data asset valuation is of strategic significance to the development of data elementalization, in order to clarify the contribution rate of data asset valuation indicators and balance the accuracy and interpretability of machine learning models, a data preprocessing-feature selection-back propagation neural network ( DP-FS-BP) prediction framework prediction framework was proposed, and the Shapley Additive exPlanations( SHAP) algorithm was used to explain the metric contribution of the prediction model. Taking the transaction block data collected by Youe data network as an example, data preprocessing and feature selection were used to clean the data and select indicators, and then the values of R², root mean squared error( RMSE) and mean absolute error( MAE) were compared with the original data on linear regression, support vector machine( SVM), decision tree, k-nearest neighbors( KNN), random forest, XGBoost and DP-FS-BP models. The results show that the DP-FS-BP model obtains the most ideal prediction results, and has a significant advantage over other models in prediction accuracy. The results of explaining the BP neural network model using the SHAP algorithm show that the average absolute values of SHAP values for scientific research techniques and data sample sizes are 209. 25 and 191. 24, respectively, ranking first and second. By visualizing the contribution rate of features to the output, a decision-making basis is provided for establishing a corresponding data asset value evaluation index system. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. 基于相位变换和 CNN-BiLSTM 的自动调制识别算法.
- Author
-
胡国乐, 李 鹏, 林事力, and 纵 彪
- Subjects
SIGNAL classification ,PHASE modulation ,TRANSFORMER models ,WIRELESS communications ,PROBLEM solving - Abstract
Copyright of Telecommunication Engineering is the property of Telecommunication Engineering and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
36. Reliable Autism Spectrum Disorder Diagnosis for Pediatrics Using Machine Learning and Explainable AI.
- Author
-
Jeon, Insu, Kim, Minjoong, So, Dayeong, Kim, Eun Young, Nam, Yunyoung, Kim, Seungsoo, Shim, Sehoon, Kim, Joungmin, and Moon, Jihoon
- Subjects
- *
MACHINE learning , *AUTISM spectrum disorders , *MEDICAL personnel , *ARTIFICIAL intelligence , *MISSING data (Statistics) - Abstract
Background: As the demand for early and accurate diagnosis of autism spectrum disorder (ASD) increases, the integration of machine learning (ML) and explainable artificial intelligence (XAI) is emerging as a critical advancement that promises to revolutionize intervention strategies by improving both accuracy and transparency. Methods: This paper presents a method that combines XAI techniques with a rigorous data-preprocessing pipeline to improve the accuracy and interpretability of ML-based diagnostic tools. Our preprocessing pipeline included outlier removal, missing data handling, and selecting pertinent features based on clinical expert advice. Using R and the caret package (version 6.0.94), we developed and compared several ML algorithms, validated using 10-fold cross-validation and optimized by grid search hyperparameter tuning. XAI techniques were employed to improve model transparency, offering insights into how features contribute to predictions, thereby enhancing clinician trust. Results: Rigorous data-preprocessing improved the models' generalizability and real-world applicability across diverse clinical datasets, ensuring a robust performance. Neural networks and extreme gradient boosting models achieved the best performance in terms of accuracy, precision, and recall. XAI techniques demonstrated that behavioral features significantly influenced model predictions, leading to greater interpretability. Conclusions: This study successfully developed highly precise and interpretable ML models for ASD diagnosis, connecting advanced ML methods with practical clinical application and supporting the adoption of AI-driven diagnostic tools by healthcare professionals. This study's findings contribute to personalized intervention strategies and early diagnostic practices, ultimately improving outcomes and quality of life for individuals with ASD. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Enhanced Plant Leaf Classification over a Large Number of Classes Using Machine Learning.
- Author
-
Elbasi, Ersin, Topcu, Ahmet E., Cina, Elda, Zreikat, Aymen I., Shdefat, Ahmed, Zaki, Chamseddine, and Abdelbaki, Wiem
- Subjects
PLANT classification ,FEATURE selection ,FEATURE extraction ,AGRICULTURE ,PLANT identification - Abstract
In botany and agriculture, classifying leaves is a crucial process that yields vital information for studies on biodiversity, ecological studies, and the identification of plant species. The Cope Leaf Dataset offers a comprehensive collection of leaf images from various plant species, enabling the development and evaluation of advanced classification algorithms. This study presents a robust methodology for classifying leaf images within the Cope Leaf Dataset by enhancing the feature extraction and selection process. Cope Leaf Dataset has 99 classes and 64 features with 1584 records. Features are extracted based on the margin, texture, and shape of the leaves. It is challenging to classify a large number of labels because of class imbalance, feature complexity, overfitting, and label noise. Our approach combines advanced feature selection techniques with robust preprocessing methods, including normalization, imputation, and noise reduction. By systematically integrating these techniques, we aim to reduce dimensionality, eliminate irrelevant or redundant features, and improve data quality. Increasing accuracy in classification, especially when dealing with large datasets and many classes, involves a combination of data preprocessing, model selection, regularization techniques, and fine-tuning. The results indicate that the Multilayer Perception algorithm gives 89.48%, the Naïve Bayes Classifier gives 89.63%, Convolutional Neural Networks has 88.72%, and the Hoeffding Tree algorithm gives 89.92% accuracy for the classification of 99 label plant leaf classification problems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. The Integration of Internet of Things and Machine Learning for Energy Prediction of Wind Turbines.
- Author
-
Emexidis, Christos and Gkonis, Panagiotis
- Subjects
WIND turbine efficiency ,STANDARD deviations ,ENERGY industries ,AKAIKE information criterion ,WIND power - Abstract
Wind power has emerged as a crucial substitute for conventional fossil fuels. The combination of advanced technologies such as the internet of things (IoT) and machine learning (ML) has given rise to a new generation of energy systems that are intelligent, reliable, and efficient. The wind energy sector utilizes IoT devices to gather vital data, subsequently converting them into practical insights. The aforementioned information aids among others in the enhancement of wind turbine efficiency, precise anticipation of energy production, optimization of maintenance approaches, and detection of potential risks. In this context, the main goal of this work is to combine the IoT with ML in the wind energy sector by processing weather data acquired from sensors to predict wind power generation. To this end, three different regression models are evaluated. The models under comparison include Linear Regression, Random Forest, and Lasso Regression, which were evaluated using metrics such as coefficient of determination (R²), adjusted R², mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE). Moreover, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were taken into consideration as well. After examining a dataset from IoT devices that included weather data, the models provided substantial insights regarding their capabilities and responses to preprocessing, as well as each model's reaction in terms of statistical performance deviation indicators. Ultimately, the data analysis and the results from metrics and criteria show that Random Forest regression is more suitable for weather condition datasets than the other two regression models. Both the advantages and shortcomings of the three regression models indicate that their integration with IoT devices will facilitate successful energy prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. A method for long car-following pair extraction and comprehensive data quality assessment: a case study using Zen Traffic Data.
- Author
-
Li, Ruijie, Zheng, Zuduo, Ni, Daiheng, and Li, Linbo
- Subjects
- *
DATA extraction , *DATA quality , *ZEN Buddhism , *MOTOR vehicle driving , *OVERTAKING - Abstract
This paper introduces a car-following (CF) extraction algorithm to address challenges in aerial-based trajectory data extraction. The algorithm, comprising four steps – vehicle grouping, elimination of false overtaking behavior, vehicle sorting, and CF pair matching – was applied to Zen Traffic Data, extracting 246 CF pairs. Three datasets were then generated: kilopost-based, geography-based, and velocity-based. A quality analysis revealed significant inconsistencies between data fields, with the geography-based dataset being least affected by high-frequency noise. The extracted CF data also demonstrated a more comprehensive driving regime than NGSIM, with complete driving regimes identified. Furthermore, the impact of data noise on CF model calibration and heterogeneity analysis was thoroughly assessed. This study enhances our understanding of trajectory data quality and highlights the richness of driving behavior information in Zen Traffic Data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Research on Fine-Tuning Optimization Strategies for Large Language Models in Tabular Data Processing.
- Author
-
Zhao, Xiaoyong, Leng, Xingxin, Wang, Lei, and Wang, Ningning
- Subjects
- *
LANGUAGE models , *NATURAL language processing , *DATA structures , *ELECTRONIC data processing , *LANGUAGE acquisition - Abstract
Recent advancements in natural language processing (NLP) have been significantly driven by the development of large language models (LLMs). Despite their impressive performance across various language tasks, these models still encounter challenges when processing tabular data. This study investigates the optimization of fine-tuning strategies for LLMs specifically in the context of tabular data processing. The focus is on the effects of decimal truncation, multi-dataset mixing, and the ordering of JSON key–value pairs on model performance. Experimental results indicate that decimal truncation reduces data noise, thereby enhancing the model's learning efficiency. Additionally, multi-dataset mixing improves the model's generalization and stability, while the random shuffling of key–value pair orders increases the model's adaptability to changes in data structure. These findings underscore the significant impact of these strategies on model performance and robustness. The research provides novel insights into improving the practical effectiveness of LLMs and offers effective data processing methods for researchers in related fields. By thoroughly analyzing these strategies, this study aims to establish theoretical foundations and practical guidance for the future optimization of LLMs across a broader range of application scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. 거대언어모델 기반 검색증강생성 시스템의 표 데이터 인식률을 높이기 위한 최적의 초매개변수 조합.
- Author
-
정민수 and 이정훈
- Subjects
LANGUAGE models ,NATURAL languages ,CORPORA ,QUESTION answering systems - Abstract
Large Language Models are highly proficient at handling unstructured data, like natural language, but their performance significantly declines when processing structured data, such as tables or other similar formats. To address this limitation, this study proposes an optimal combination of hyperparameters aimed at improving the recognition of table data in a retrieval-augmented question-answering system. Preprocessing techniques are applied to ensure the effective handling of table data, and the experiments conducted use corpora based on preprocessed tables. The main focus was on discovering the best-performing hyperparameter combination by adjusting chunk sizes and varying overlap settings. The experimental results revealed that the optimal hyperparameters differed based on the specific language model being used. Although chunk size had little effect on overall response quality, introducing overlap consistently led to notable performance improvements. Future research will extend these findings by conducting further experiments with structured data across various domains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. ProteinFlow: An advanced framework for feature engineering in protein data analysis.
- Author
-
Mi, Yanlin, Marcu, Stefan‐Bogdan, Yallapragada, Venkata V. B., and Tabirca, Sabin
- Abstract
In the burgeoning field of proteins, the effective analysis of intricate protein data remains a formidable challenge, necessitating advanced computational tools for data processing, feature extraction, and interpretation. This study introduces ProteinFlow, an innovative framework designed to revolutionize feature engineering in protein data analysis. ProteinFlow stands out by offering enhanced efficiency in data collection and preprocessing, along with advanced capabilities in feature extraction, directly addressing the complexities inherent in multidimensional protein data sets. Through a comparative analysis, ProteinFlow demonstrated a significant improvement over traditional methods, notably reducing data preprocessing time and expanding the scope of biologically significant features identified. The framework's parallel data processing strategy and advanced algorithms ensure not only rapid data handling but also the extraction of comprehensive, meaningful insights from protein sequences, structures, and interactions. Furthermore, ProteinFlow exhibits remarkable scalability, adeptly managing large‐scale data sets without compromising performance, a crucial attribute in the era of big data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Data Collection and Preprocessing in Web Usage Mining: Implementation and Analysis
- Author
-
Mohammed Ali Mohammed, Rula A. Hamid, and Reem Razzaq AbdulHussein
- Subjects
web usage mining ,access log file ,data collection ,data preprocessing ,Technology - Abstract
Data collection and data preprocessing are crucial stages in web usage mining, mainly because of the unstructured, diverse, and noisy nature of log data. During data collection, log file datasets are loaded and merged. Effective and comprehensive data preprocessing plays a vital role in ensuring the efficiency and scalability of algorithms used in the pattern discovery phase of web usage mining. This work aims to address these phases by introducing two innovative approaches. The first approach focuses on determining the device used for accessing the web, distinguishing between computers and mobile devices. The second approach aims to determine user sessions and complete paths by utilizing the referrer URL. The entire preprocessing pipeline has been implemented using the C# programming language, and the source code is available on GitHub at the following link: https://github.com/Mohammed91/Web-Usage-Mining.
- Published
- 2024
- Full Text
- View/download PDF
44. Rapid determination of the geographical origin of kimchi by Fourier transform near-infrared spectroscopy coupled with chemometric techniques
- Author
-
Su-Yeon Kim and Ji-Hyoung Ha
- Subjects
Supervised classification ,Discrimination ,Data preprocessing ,K-nearest neighbors ,Support vector machine ,Partial least squares-discriminant analysis ,Medicine ,Science - Abstract
Abstract Determining the geographical origin of kimchi holds significance because of the considerable variation in quality and price among kimchi products from different regions. This study explored the feasibility of employing Fourier transform near-infrared spectroscopy in conjunction with supervised chemometric techniques to differentiate domestic and imported kimchi products. A total of 30 domestic and 30 imported kimchi products were used to build datasets. Three categories of preprocessing methods such as scattering correction (multiplicative signal correction and standard normal variate), spectral derivatives (the first derivative and the second derivative), and data smoothing (Savitzky–Golay filtering and Norris derivative filtering) were used. K-nearest neighbors, support vector machine, random forest, and partial least squares-discriminant analysis were employed. By appropriately preprocessing spectral data, these four methods successfully distinguished between the two sample groups based on their origin. Notably, the k-nearest neighbors method exhibited exceptional performance, accurately classifying the sample groups irrespective of the preprocessing method employed and swiftly achieving this classification. In comparison, classification and regression tree as well as naïve Bayes methods were outperformed by the aforementioned four classification techniques. Particularly, the efficiency and accuracy of the k-nearest neighbors method make it the most recommended chemometric tool for determining the geographical origins of kimchi.
- Published
- 2024
- Full Text
- View/download PDF
45. Machine learning-guided strategies for reaction conditions design and optimization
- Author
-
Lung-Yi Chen and Yi-Pei Li
- Subjects
data preprocessing ,reaction conditions prediction ,reaction data mining ,reaction optimization ,reaction representation ,Science ,Organic chemistry ,QD241-441 - Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
- Published
- 2024
- Full Text
- View/download PDF
46. lab2clean: a novel algorithm for automated cleaning of retrospective clinical laboratory results data for secondary uses
- Author
-
Ahmed Medhat Zayed, Arne Janssens, Pavlos Mamouris, and Nicolas Delvaux
- Subjects
Electronic medical records ,Clinical laboratories ,Data integrity ,Algorithms ,Data preprocessing ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background The integrity of clinical research and machine learning models in healthcare heavily relies on the quality of underlying clinical laboratory data. However, the preprocessing of this data to ensure its reliability and accuracy remains a significant challenge due to variations in data recording and reporting standards. Methods We developed lab2clean, a novel algorithm aimed at automating and standardizing the cleaning of retrospective clinical laboratory results data. lab2clean was implemented as two R functions specifically designed to enhance data conformance and plausibility by standardizing result formats and validating result values. The functionality and performance of the algorithm were evaluated using two extensive electronic medical record (EMR) databases, encompassing various clinical settings. Results lab2clean effectively reduced the variability of laboratory results and identified potentially erroneous records. Upon deployment, it demonstrated effective and fast standardization and validation of substantial laboratory data records. The evaluation highlighted significant improvements in the conformance and plausibility of lab results, confirming the algorithm’s efficacy in handling large-scale data sets. Conclusions lab2clean addresses the challenge of preprocessing and cleaning clinical laboratory data, a critical step in ensuring high-quality data for research outcomes. It offers a straightforward, efficient tool for researchers, improving the quality of clinical laboratory data, a major portion of healthcare data. Thereby, enhancing the reliability and reproducibility of clinical research outcomes and clinical machine learning models. Future developments aim to broaden its functionality and accessibility, solidifying its vital role in healthcare data management. Graphical Abstract
- Published
- 2024
- Full Text
- View/download PDF
47. Rapid determination of the geographical origin of kimchi by Fourier transform near-infrared spectroscopy coupled with chemometric techniques.
- Author
-
Kim, Su-Yeon and Ha, Ji-Hyoung
- Subjects
- *
FOURIER transform spectroscopy , *K-nearest neighbor classification , *SUPPORT vector machines , *STATISTICAL smoothing , *REGRESSION trees , *NAIVE Bayes classification - Abstract
Determining the geographical origin of kimchi holds significance because of the considerable variation in quality and price among kimchi products from different regions. This study explored the feasibility of employing Fourier transform near-infrared spectroscopy in conjunction with supervised chemometric techniques to differentiate domestic and imported kimchi products. A total of 30 domestic and 30 imported kimchi products were used to build datasets. Three categories of preprocessing methods such as scattering correction (multiplicative signal correction and standard normal variate), spectral derivatives (the first derivative and the second derivative), and data smoothing (Savitzky–Golay filtering and Norris derivative filtering) were used. K-nearest neighbors, support vector machine, random forest, and partial least squares-discriminant analysis were employed. By appropriately preprocessing spectral data, these four methods successfully distinguished between the two sample groups based on their origin. Notably, the k-nearest neighbors method exhibited exceptional performance, accurately classifying the sample groups irrespective of the preprocessing method employed and swiftly achieving this classification. In comparison, classification and regression tree as well as naïve Bayes methods were outperformed by the aforementioned four classification techniques. Particularly, the efficiency and accuracy of the k-nearest neighbors method make it the most recommended chemometric tool for determining the geographical origins of kimchi. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Short-Term Water Demand Forecasting from Univariate Time Series of Water Reservoir Stations.
- Author
-
Myllis, Georgios, Tsimpiris, Alkiviadis, and Vrana, Vasiliki
- Subjects
- *
WATER management , *STATISTICAL models , *WATER levels , *DEEP learning , *TIME series analysis , *DEMAND forecasting , *WATER demand management - Abstract
This study presents an improved data-centric approach to short-term water demand forecasting using univariate time series from water reservoir levels. The dataset comprises water level recordings from 21 reservoirs in Eastern Thessaloniki collected over 15 months via a SCADA system provided by the water company EYATH S.A. The methodology involves data preprocessing, anomaly detection, data imputation, and the application of predictive models. Techniques such as the Interquartile Range method and moving standard deviation are employed to identify and handle anomalies. Missing values are imputed using LSTM networks optimized through the Optuna framework. This study emphasizes a data-centric approach in deep learning, focusing on improving data quality before model application, which has proven to enhance prediction accuracy. This strategy is crucial, especially in regions where reservoirs are the primary water source, and demand distribution cannot be solely determined by flow meter readings. LSTM, Random Forest Regressor, ARIMA, and SARIMA models are utilized to extract and analyze water level trends, enabling more accurate future water demand predictions. Results indicate that combining deep learning techniques with traditional statistical models significantly improves the accuracy and reliability of water demand predictions, providing a robust framework for optimizing water resource management. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. ATR‐FTIR Spectroscopy Preprocessing Technique Selection for Identification of Geographical Origins of Gastrodia elata Blume.
- Author
-
Liu, Hong, Liu, Honggao, Li, Jieqing, and Wang, Yuanzhong
- Subjects
- *
ATTENUATED total reflectance , *CHINESE medicine , *SUPPORT vector machines , *PRODUCT counterfeiting , *SOIL classification , *PARTIAL least squares regression - Abstract
Gastrodia elata Blume from different regions varies in growth conditions, soil types, and climate, which directly affects the content and quality of its medicinal components. Accurately identifying the origin can effectively ensure the medicinal value of G. elata Bl., prevent the circulation of counterfeit products, and thus protect the interests and health of consumers. Attenuated total reflectance Fourier transform infrared (ATR‐FTIR) spectroscopy is a rapid and effective method for verifying the authenticity of traditional Chinese medicines. However, the presence of scattering effects in the spectra poses challenges in establishing reliable discrimination models. Therefore, employing appropriate scattering correction techniques is crucial for improving the quality of spectral data and the accuracy of discrimination models. This study uses two ensemble preprocessing approaches; the first type is series fusion of scatter correction technologies (SCSF), and another method is sequential preprocessing through orthogonalization (SPORT). Four discriminant models were established using a single scattering correction technique and two ensemble preprocessing approaches. The results show that the data‐driven version of the soft independent modeling of class analogy (DD‐SIMCA) model built based on multiplicative scatter correction (MSC) preprocessing has a sensitivity of 0.98 and a specificity of 0.91, able to effectively distinguish whether a sample of G. elata Bl. originates from Zhaotong. In addition, three discriminant models including support vector machine (SVM), partial least squares discriminant analysis (PLS‐DA), and three gradient boosting machine (GBM) algorithms built using the ensemble preprocessing approach have good classification and generalization capabilities. Among them, the SCSF‐PLS‐DA model has the best performance with 99.68% and 98.08% accuracy for the training and test sets, respectively, and F1 of 0.97; the SPORT‐SVM model achieved the second‐best classification ability. The results show that the ensemble preprocessing approach used can improve the success rate of G. elata Bl. geographical origin classification. There are differences in the chemical composition of Gastrodia elata Blume from different regions. This study used single scanning correction technology and two ensemble preprocessing approaches to process ATR‐FTIR spectroscopy and established discrimination model for identifying the origin of G. elata Blume. The results indicate that the discriminative model established using two ensemble preprocessing approaches has the best discriminative performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Evaluating the Sensitivity of Machine Learning Models to Data Preprocessing Technique in Concrete Compressive Strength Estimation.
- Author
-
Habib, Maan and Okayli, Maan
- Subjects
- *
MACHINE learning , *STANDARD deviations , *CONCRETE industry , *PRINCIPAL components analysis , *COMPRESSIVE strength - Abstract
This study rigorously examines the impact of various data preprocessing techniques on the accuracy of machine learning models in predicting concrete's compressive strength. It develops ten regression models under nine distinct preprocessing scenarios, including normalization, standardization, principal component analysis (PCA), and polynomial features, utilizing a comprehensive dataset featuring normal and high-strength performances. The results reveal that using polynomial features and kernel PCA significantly enhanced model performance, with R values soaring to 93.27 and 94.65% during training and 88.51 and 88.77% during testing, respectively. This indicates their strong ability to capture the hidden nonlinear relationships within data. Conversely, discretization exhibited the least effectiveness, with the highest normalized root mean square error values of 14.2 (training) and 16.8 (testing) and normalized mean absolute error values of 11.6 (training) and 13.6 (testing), suggesting a potential loss of essential data granularity. Additionally, the study found that machine learning techniques generally surpassed traditional regression models, with higher R values being a consistent trend. These findings offer a nuanced understanding of the importance of preprocessing choice in concrete strength prediction and provide valuable insights for the concrete industry and data scientists, emphasizing the critical role of data preprocessing in achieving optimal model accuracy in materials science. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.