Back to Search Start Over

Ten quick tips for sequence-based prediction of protein properties using machine learning

Authors :
Katharina Waury
Anton Feenstra
Qingzhen Hou
Dea Gogishvili
Computer Science
Bio Informatics (IBIVU)
AIMMS
Bioinformatics
Integrative Bioinformatics
Source :
Hou, Q, Waury, K, Gogishvili, D & Feenstra, K A 2022, ' Ten quick tips for sequence-based prediction of protein properties using machine learning ', PLoS Computational Biology, vol. 18, no. 12, e1010669, pp. 1-15 . https://doi.org/10.1371/journal.pcbi.1010669, PLoS Computational Biology, 18(12):e1010669, 1-15. Public Library of Science
Publication Year :
2022
Publisher :
Public Library of Science, 2022.

Abstract

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

Details

Language :
English
ISSN :
15537358 and 1553734X
Volume :
18
Issue :
12
Database :
OpenAIRE
Journal :
PLoS Computational Biology
Accession number :
edsair.doi.dedup.....aa4794c05d1a210f7cb458d01ec53110