Back to Search Start Over

Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.

Authors :
Mo, Ziyi
Siepel, Adam
Source :
PLoS Genetics; 11/7/2023, Vol. 19 Issue 11, p1-22, 22p
Publication Year :
2023

Abstract

Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods—SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics. Author summary: Population genetic simulation is a powerful tool in the study of evolution. A number of supervised machine learning methods have been developed that take advantage of inexpensive simulations as training data. Despite their outstanding performance in benchmarks, these models can fail when the simulated training data deviate from the real data. In this work, we employed domain adaptation techniques to address this "simulation mis-specification" problem by training the machine learning model jointly with simulated and real data. We performed extensive benchmark experiments to demonstrate the improvement of the domain-adaptive models over standard machine learning models in the presence of different types of mis-specification. In addition, we applied dadaSIA, a domain-adaptive selection inference model, to improve the estimates of selection coefficients at selected loci in a European population. The domain adaptation framework proposed in our work is widely applicable to models relying on synthetic training data and therefore opens the door to many more applications in population genetics. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15537390
Volume :
19
Issue :
11
Database :
Complementary Index
Journal :
PLoS Genetics
Publication Type :
Academic Journal
Accession number :
173472467
Full Text :
https://doi.org/10.1371/journal.pgen.1011032