Back to Search Start Over

Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.

Authors :
Lu, Alex X.
Lu, Amy X.
Pritišanac, Iva
Zarin, Taraneh
Forman-Kay, Julie D.
Moses, Alan M.
Source :
PLoS Computational Biology; 6/29/2022, Vol. 18 Issue 6, p1-31, 31p, 2 Diagrams, 4 Graphs
Publication Year :
2022

Abstract

A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call "reverse homology", exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homolog from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences. Author summary: Intrinsically disordered regions (IDRs) are widespread in proteins but are poorly understood on a systematic level because they evolve too rapidly for classic bioinformatics methods to be effective. We designed a neural network that learns what features (for example, electrostatic charge, or the presence of certain motifs) might be important to the function of IDRs, even when we don't have prior knowledge of function. Our neural network learns by exploiting principles of evolution. Important features tend to be conserved over species, so guessing what sequences evolved from the same common ancestor helps the neural network identify these features. Importantly, training a neural network this way can be defined as a fully automatic operation, so no manual effort is required. After our neural network is trained, we can apply interpretation techniques to understand what kinds of features are important to IDRs globally in the proteome, and to form hypotheses about specific IDRs. We show that many of the features our neural network learns are consistent with features we already know to be important to IDRs. We hope that our neural network can be applied to help biologists form hypotheses about poorly characterized IDRs. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
1553734X
Volume :
18
Issue :
6
Database :
Complementary Index
Journal :
PLoS Computational Biology
Publication Type :
Academic Journal
Accession number :
157708212
Full Text :
https://doi.org/10.1371/journal.pcbi.1010238