Back to Search
Start Over
RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
- Source :
- BMC Bioinformatics, BMC Bioinformatics, Vol 21, Iss 1, Pp 1-24 (2020)
- Publication Year :
- 2019
-
Abstract
- Background Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
- Subjects :
- Time Factors
Computer science
Assembly
0206 medical engineering
Statistics as Topic
Repetitive Sequences
Sequence assembly
De novo repeat identification
02 engineering and technology
Computational biology
Saccharomyces cerevisiae
lcsh:Computer applications to medicine. Medical informatics
Biochemistry
Genome
Structural variation
03 medical and health sciences
Mice
Structural Biology
Databases, Genetic
The high-frequency reads
Animals
Humans
lcsh:QH301-705.5
Molecular Biology
The high-frequency k-mers
030304 developmental biology
Sequence (medicine)
Gene Library
Repetitive Sequences, Nucleic Acid
0303 health sciences
Base Sequence
Genome, Human
Methodology Article
Applied Mathematics
High-Throughput Nucleotide Sequencing
Reproducibility of Results
Sequence Analysis, DNA
Reference Standards
Computer Science Applications
Identification (information)
Drosophila melanogaster
lcsh:Biology (General)
lcsh:R858-859.7
NGS reads
DNA microarray
Sequence Alignment
020602 bioinformatics
Software
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 21
- Issue :
- 1
- Database :
- OpenAIRE
- Journal :
- BMC bioinformatics
- Accession number :
- edsair.doi.dedup.....1a3befc69889120886584449ce90a2d4