1. McClintock: An integrated pipeline for detecting transposable element insertions in whole genome shotgun sequencing data
- Author
-
Raquel S. Linheiro, Michael G. Nelson, and Casey M. Bergman
- Subjects
Transposable element ,Genetics ,0303 health sciences ,Shotgun sequencing ,Context (language use) ,Computational biology ,Biology ,Genome ,Pipeline (software) ,Replication (computing) ,03 medical and health sciences ,0302 clinical medicine ,Gene duplication ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
BackgroundTransposable element (TE) insertions are among the most challenging type of variants to detect in genomic data because of their repetitive nature and complex mechanisms of replication. Nevertheless, the recent availability of large resequencing datasets has spurred the development of many new methods to detect TE insertions in whole genome shotgun sequences. These methods generate output in diverse formats and have a large number of software and data dependencies, making their comparative evaluation challenging for potential users.ResultsHere we develop an integrated bioinformatics pipeline for the detection of TE insertions in whole genome shotgun data, called McClintock (https://github.com/bergmanlab/mcclintock), that automatically runs and generates standardized output for multiple TE detection methods. We demonstrate the utility of the McClintock system by performing comparative evaluation of six TE detection methods using simulated and real genome data from the model microbal eukaryote,Saccharomyces cerevisiae. We find substantial variation among McClintock component methods in their ability to detect non-reference insertions in the yeast genome, but show that non-reference TEs at nearly all biologically-realistic locations can be detected in simulated data by combining multiple methods that use split-read and read-pair evidence. In general, our results reveal that split-read methods detect fewer non-reference TE insertions than read-pair methods, but generally have much higher positional accuracy. Analysis of a large sample of real yeast genomes reveals that most, but not all, McClintock component methods can recover known aspects of TE biology in yeast such as the transpositional activity status of families, tRNA gene target preferences, and target site duplication structure, albeit with varying levels of positional accuracy.ConclusionsOur results suggest that no single TE detection method currently provides comprehensive detection of non-reference TEs, even in the context of a simplified model eukaryotic genome likeS. cerevisiae. In spite of these limitations, the McClintock system provides a framework for testing, developing and integrating results from multiple TE detection methods to achieve this ultimate aim, as well as useful guidance for yeast researchers to select appropriate TE detection tools.
- Published
- 2016
- Full Text
- View/download PDF