1. UMI-Gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries
- Author
-
Caroline Bérard, Thierry Lecroq, Fabrice Jardin, Vincent Sater, Pierre-Julien Viailly, Philippe Ruminy, Élise Prieur-Gaston, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Génomique et Médecine Personnalisée du Cancer et des Maladies Neuropsychiatriques (GPMCND), Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Normandie Université (NU)-Institut National de la Santé et de la Recherche Médicale (INSERM), CCSD, Accord Elsevier, Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
Computer science ,lcsh:Biotechnology ,Biophysics ,Word error rate ,Biochemistry ,DNA sequencing ,chemistry.chemical_compound ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,lcsh:TP248.13-248.65 ,Variant calling ,Genetics ,Nucleotide ,Copy-number variation ,MIT License ,ComputingMilieux_MISCELLANEOUS ,Paired-end tag ,Simulation ,[INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM] ,030304 developmental biology ,chemistry.chemical_classification ,0303 health sciences ,Biological data ,Sequence analysis ,UMI ,Pipeline (software) ,Computer Science Applications ,Identifier ,chemistry ,NGS ,030220 oncology & carcinogenesis ,Simulator ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,DNA ,Research Article ,Biotechnology - Abstract
MotivationWith Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. A lot of different variant callers are publicly available and usually do a good job at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-reads based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy.ResultsWe present UMI-Gen, a UMI-based reads simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user.AvailabilityThe entire pipeline is available athttps://gitlab.com/vincent-sater/umigen-masterunder MIT license.Contactvincent.sater@gmail.com
- Published
- 2020
- Full Text
- View/download PDF