Back to Search
Start Over
Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation
- Source :
- Analytica Chimica Acta, 1183:339001. Elsevier Science
- Publication Year :
- 2021
- Publisher :
- Elsevier Science, 2021.
-
Abstract
- Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.
- Subjects :
- SELECTION
PARTIAL LEAST-SQUARES
Sample (statistics)
computer.software_genre
Biochemistry
Biological Science Disciplines
Field (computer science)
Analytical Chemistry
Humans
Environmental Chemistry
Proximities
Spectroscopy
Data Management
Profiling (computer programming)
Variable behaviour
Chemistry
Data fusion
Sensor fusion
Classification
NMR
Random forest
Identification (information)
Crohn's disease
Stacking
Data Interpretation, Statistical
Test set
Principal component analysis
Data mining
computer
Subjects
Details
- Language :
- English
- ISSN :
- 18734324 and 00032670
- Volume :
- 1183
- Database :
- OpenAIRE
- Journal :
- Analytica Chimica Acta
- Accession number :
- edsair.doi.dedup.....acafcf0ef53126db56f64365d5cfbbf7