1. HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction
- Author
-
Giorgio Valentini, Peter N. Robinson, Alessandro Petrini, Jessica Gliozzo, Marco Frasca, Marco Notaro, and Elena Casiraghi
- Subjects
Statistics and Probability ,Computer science ,Ontology (information science) ,Machine learning ,computer.software_genre ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Ensembles of classifiers ,Controlled vocabulary ,Protein function prediction ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,business.industry ,Modular design ,Directed acyclic graph ,Ensemble learning ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,030220 oncology & carcinogenesis ,Scalability ,Artificial intelligence ,business ,computer - Abstract
Motivation Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). ‘Hierarchy-unaware’ classifiers, also known as ‘flat’ methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while ‘hierarchy-aware’ approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. Results To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide ‘TPR-safe’ predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. Availability and implementation Fully tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2021