Back to Search Start Over

On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns.

Authors :
Cecil, Ryan M.
Sugden, Lauren A.
Source :
PLoS Computational Biology. 11/27/2023, Vol. 19 Issue 11, p1-20. 20p.
Publication Year :
2023

Abstract

A central challenge in population genetics is the detection of genomic footprints of selection. As machine learning tools including convolutional neural networks (CNNs) have become more sophisticated and applied more broadly, these provide a logical next step for increasing our power to learn and detect such patterns; indeed, CNNs trained on simulated genome sequences have recently been shown to be highly effective at this task. Unlike previous approaches, which rely upon human-crafted summary statistics, these methods are able to be applied directly to raw genomic data, allowing them to potentially learn new signatures that, if well-understood, could improve the current theory surrounding selective sweeps. Towards this end, we examine a representative CNN from the literature, paring it down to the minimal complexity needed to maintain comparable performance; this low-complexity CNN allows us to directly interpret the learned evolutionary signatures. We then validate these patterns in more complex models using metrics that evaluate feature importance. Our findings reveal that preprocessing steps, which determine how the population genetic data is presented to the model, play a central role in the learned prediction method. This results in models that mimic previously-defined summary statistics; in one case, the summary statistic itself achieves similarly high accuracy. For evolutionary processes that are less well understood than selective sweeps, we hope this provides an initial framework for using CNNs in ways that go beyond simply achieving high classification performance. Instead, we propose that CNNs might be useful as tools for learning novel patterns that can translate to easy-to-implement summary statistics available to a wider community of researchers. Author summary: The ever-increasing power and complexity of machine learning tools presents the scientific community with both unique opportunities and unique challenges. On the one hand, these data-driven approaches have led to state-of-the-art advances on a variety of research problems spanning many fields. On the other, these apparent performance improvements come at the cost of interpretability: it is difficult to know how a model makes its predictions. This is compounded by the computational sophistication of machine learning models which can lend an air of objectivity, often masking ways in which bias may be baked into the modeling decisions or the data itself. We present here a case study, examining these issues in the context of a central problem in population genetics: detecting patterns of selection from genome data. Through this application, we show how human decision-making can encourage the model to see what we want it to see in various ways. By understanding how these models work, and how they respond to the particular way in which data is presented, we have a chance of creating new frameworks that are capable of discovering novel patterns. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
1553734X
Volume :
19
Issue :
11
Database :
Academic Search Index
Journal :
PLoS Computational Biology
Publication Type :
Academic Journal
Accession number :
173857090
Full Text :
https://doi.org/10.1371/journal.pcbi.1010979