Back to Search
Start Over
HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction
- Source :
- Bioinformatics (Oxford, England).
- Publication Year :
- 2021
-
Abstract
- Motivation Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). ‘Hierarchy-unaware’ classifiers, also known as ‘flat’ methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while ‘hierarchy-aware’ approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. Results To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide ‘TPR-safe’ predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. Availability and implementation Fully tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.
- Subjects :
- Statistics and Probability
Computer science
Ontology (information science)
Machine learning
computer.software_genre
Biochemistry
03 medical and health sciences
0302 clinical medicine
Ensembles of classifiers
Controlled vocabulary
Protein function prediction
Molecular Biology
030304 developmental biology
0303 health sciences
business.industry
Modular design
Directed acyclic graph
Ensemble learning
Computer Science Applications
Computational Mathematics
Computational Theory and Mathematics
030220 oncology & carcinogenesis
Scalability
Artificial intelligence
business
computer
Subjects
Details
- ISSN :
- 13674811
- Database :
- OpenAIRE
- Journal :
- Bioinformatics (Oxford, England)
- Accession number :
- edsair.doi.dedup.....40b1cd5bdb45c4b44360a4acc035afe6