Back to Search
Start Over
Ranked MSD: A New Feature Ranking and Feature Selection Approach for Biomarker Identification
- Source :
- Lecture Notes in Computer Science, 3rd International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), 3rd International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), Aug 2019, Canterbury, United Kingdom. pp.147-167, ⟨10.1007/978-3-030-29726-8_10⟩, Lecture Notes in Computer Science ISBN: 9783030297251, CD-MAKE
- Publication Year :
- 2019
- Publisher :
- HAL CCSD, 2019.
-
Abstract
- International audience; In the era of big data when a huge amount of data is continuously being generated, it is common for situations to arise where the number of samples is much smaller than the number of features (variables) per sample. This phenomenon is often found in biomedical domains, where we may have relatively few patients, compared to the amount of data per patient. For example, gene expression data typically has between 10,000 and 60,000 features per sample. A separate issue arises from the “right to explanation” found in the European General Data Protection Regulation (GDPR), which may prevent the use of black-box models in applications where explainability is required. In such situations, there is a need for robust algorithms which can identify the relevant features from experimental data by discarding irrelevant ones, yielding a simpler subset that facilitates explanation. To address these needs, we have developed a new algorithm for feature ranking and feature selection, named Ranked MSD. We have tested our proposed approach on two real-world gene expression data sets, both of which relate to respiratory viral infections. This Ranked MSD feature selection algorithm is able to reduce the feature set size from 12,023 genes (features) to 65 genes on the first data set and from 20,737 genes to 31 genes on the second data set, in both cases without any significant loss in disease prediction accuracy. In an alternative configuration, our proposed algorithm is able to identify a small subset of features that gives better accuracy than that of the full feature set. Our proposed algorithm can also identify important biomarkers (genes) with their importance score for a particular disease and the identified top-ranked biomarkers can play a vital role in drug discovery and precision medicine.
- Subjects :
- Feature ranking
Computer science
business.industry
Big data
Experimental data
Pattern recognition
Sample (statistics)
Feature selection
Precision medicine
Classification
Data set
General Data Protection Regulation
Machine learning
Explainable AI
[INFO]Computer Science [cs]
Artificial intelligence
business
Respiratory viral infection
Subjects
Details
- Language :
- English
- ISBN :
- 978-3-030-29725-1
- ISBNs :
- 9783030297251
- Database :
- OpenAIRE
- Journal :
- Lecture Notes in Computer Science, 3rd International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), 3rd International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), Aug 2019, Canterbury, United Kingdom. pp.147-167, ⟨10.1007/978-3-030-29726-8_10⟩, Lecture Notes in Computer Science ISBN: 9783030297251, CD-MAKE
- Accession number :
- edsair.doi.dedup.....7c770cc5663771f05dd0cf29250bb43c