Back to Search Start Over

Using machine learning techniques for exploration and classification of laboratory data.

Authors :
Trulson, Inga
Holdenrieder, Stefan
Hoffmann, Georg
Source :
Journal of Laboratory Medicine; 2024, Vol. 48 Issue 5, p203-214, 12p
Publication Year :
2024

Abstract

The study aims to acquaint readers with six widely used machine learning (ML) techniques (Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), k-means, hierarchical clustering and the decision tree models (rpart and random forest)) that might be useful for the analysis of laboratory data. Utilizing a recently validated data set from lung cancer diagnostics, we investigate how ML can support the search for a suitable tumor marker panel for the differentiation of small cell (SCLC) and non-small cell lung cancer (NSCLC). The ML techniques used here effectively helped to gain a quick overview of the data structures and provide initial answers to the clinical questions. Dimensionality reduction techniques such as PCA and UMAP offered insightful visualization and impression of the data structure, suggesting the existence of two tumor groups with a large overlap of largely inconspicuous values. This impression was confirmed by a cluster analysis with the k-means algorithm, indicative of unsupervised learning. For supervised learning, decision tree models like rpart or random forest demonstrated their utility in differential diagnosis of the two tumor types. The rpart model, which constructs binary decision trees based on the recursive partitioning algorithm, suggests a tree involving four serum tumor markers (STMs), which were confirmed by the random forest approach. Both highlighted pro-gastrin-releasing peptide (ProGRP), neuron specific enolase (NSE), cytokeratin-19 fragment (CYFRA 21-1) and cancer antigen (CA) 72-4 as key tumor markers, aligning with the outcomes of the initial statistical analysis. Cross-validation of the two proposals showed a higher area under the receiver operating characteristic (AUROC) curve of 0.95 with a 95 % confidence interval (CI) of 0.92–0.97 for the random forest model compared to an AUROC curve of 0.88 (95 % CI: 0.83–0.93). ML can provide a useful overview of inherent medical data structures and distinguish significant from less pertinent features. While by no means replacing human medical and statistical expertise, ML can significantly accelerate the evaluation of medical data, supporting a more informed diagnostic dialogue between physicians and statisticians. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
25679430
Volume :
48
Issue :
5
Database :
Complementary Index
Journal :
Journal of Laboratory Medicine
Publication Type :
Academic Journal
Accession number :
180156704
Full Text :
https://doi.org/10.1515/labmed-2024-0100