Back to Search Start Over

HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction

Authors :
Giorgio Valentini
Peter N. Robinson
Alessandro Petrini
Jessica Gliozzo
Marco Frasca
Marco Notaro
Elena Casiraghi
Source :
Bioinformatics (Oxford, England).
Publication Year :
2021

Abstract

Motivation Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). ‘Hierarchy-unaware’ classifiers, also known as ‘flat’ methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while ‘hierarchy-aware’ approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. Results To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide ‘TPR-safe’ predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. Availability and implementation Fully tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Details

ISSN :
13674811
Database :
OpenAIRE
Journal :
Bioinformatics (Oxford, England)
Accession number :
edsair.doi.dedup.....40b1cd5bdb45c4b44360a4acc035afe6