Back to Search Start Over

Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

Authors :
Robert van Vorstenbosch
John Penders
Jane E. Hill
Georgios Stavropoulos
Frederik-Jan van Schooten
Daisy Jonkers
Agnieszka Smolinska
Farmacologie en Toxicologie
RS: NUTRIM - R3 - Respiratory & Age-related Health
Interne Geneeskunde
RS: NUTRIM - R2 - Liver and digestive health
Med Microbiol, Infect Dis & Infect Prev
Source :
Analytica Chimica Acta, 1183:339001. Elsevier Science
Publication Year :
2021
Publisher :
Elsevier Science, 2021.

Abstract

Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.

Details

Language :
English
ISSN :
18734324 and 00032670
Volume :
1183
Database :
OpenAIRE
Journal :
Analytica Chimica Acta
Accession number :
edsair.doi.dedup.....acafcf0ef53126db56f64365d5cfbbf7