96 results on '"Chikhi, R"'
Search Results
2. Seedability: Optimizing alignment parameters for sensitive sequence comparison
- Author
-
Ayad, L.A.K. (Lorraine), Chikhi, R. (Rayan), Pissis, S. (Solon), Ayad, L.A.K. (Lorraine), Chikhi, R. (Rayan), and Pissis, S. (Solon)
- Abstract
Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results: The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
- Published
- 2023
- Full Text
- View/download PDF
3. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads
- Author
-
Denti, L, Khorsand, P, Bonizzoni, P, Hormozdiari, F, Chikhi, R, Luca Denti, Parsoa Khorsand, Paola Bonizzoni, Fereydoun Hormozdiari, Rayan Chikhi, Denti, L, Khorsand, P, Bonizzoni, P, Hormozdiari, F, Chikhi, R, Luca Denti, Parsoa Khorsand, Paola Bonizzoni, Fereydoun Hormozdiari, and Rayan Chikhi
- Abstract
Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.
- Published
- 2023
4. Critical Assessment of Metagenome Interpretation: the second round of challenges
- Author
-
Meyer, F, Fritz, A, Deng, Z-L, Koslicki, D, Lesker, TR, Gurevich, A, Robertson, G, Alser, M, Antipov, D, Beghini, F, Bertrand, D, Brito, JJ, Brown, CT, Buchmann, J, Buluc, A, Chen, B, Chikhi, R, Clausen, PTLC, Cristian, A, Dabrowski, PW, Darling, AE, Egan, R, Eskin, E, Georganas, E, Goltsman, E, Gray, MA, Hansen, LH, Hofmeyr, S, Huang, P, Irber, L, Jia, H, Jorgensen, TS, Kieser, SD, Klemetsen, T, Kola, A, Kolmogorov, M, Korobeynikov, A, Kwan, J, LaPierre, N, Lemaitre, C, Li, C, Limasset, A, Malcher-Miranda, F, Mangul, S, Marcelino, VR, Marchet, C, Marijon, P, Meleshko, D, Mende, DR, Milanese, A, Nagarajan, N, Nissen, J, Nurk, S, Oliker, L, Paoli, L, Peterlongo, P, Piro, VC, Porter, JS, Rasmussen, S, Rees, ER, Reinert, K, Renard, B, Robertsen, EM, Rosen, GL, Ruscheweyh, H-J, Sarwal, V, Segata, N, Seiler, E, Shi, L, Sun, F, Sunagawa, S, Sorensen, SJ, Thomas, A, Tong, C, Trajkovski, M, Tremblay, J, Uritskiy, G, Vicedomini, R, Wang, Z, Warren, A, Willassen, NP, Yelick, K, You, R, Zeller, G, Zhao, Z, Zhu, S, Zhu, J, Garrido-Oter, R, Gastmeier, P, Hacquard, S, Haeussler, S, Khaledi, A, Maechler, F, Mesny, F, Radutoiu, S, Schulze-Lefert, P, Smit, N, Strowig, T, Bremges, A, Sczyrba, A, McHardy, AC, Meyer, F, Fritz, A, Deng, Z-L, Koslicki, D, Lesker, TR, Gurevich, A, Robertson, G, Alser, M, Antipov, D, Beghini, F, Bertrand, D, Brito, JJ, Brown, CT, Buchmann, J, Buluc, A, Chen, B, Chikhi, R, Clausen, PTLC, Cristian, A, Dabrowski, PW, Darling, AE, Egan, R, Eskin, E, Georganas, E, Goltsman, E, Gray, MA, Hansen, LH, Hofmeyr, S, Huang, P, Irber, L, Jia, H, Jorgensen, TS, Kieser, SD, Klemetsen, T, Kola, A, Kolmogorov, M, Korobeynikov, A, Kwan, J, LaPierre, N, Lemaitre, C, Li, C, Limasset, A, Malcher-Miranda, F, Mangul, S, Marcelino, VR, Marchet, C, Marijon, P, Meleshko, D, Mende, DR, Milanese, A, Nagarajan, N, Nissen, J, Nurk, S, Oliker, L, Paoli, L, Peterlongo, P, Piro, VC, Porter, JS, Rasmussen, S, Rees, ER, Reinert, K, Renard, B, Robertsen, EM, Rosen, GL, Ruscheweyh, H-J, Sarwal, V, Segata, N, Seiler, E, Shi, L, Sun, F, Sunagawa, S, Sorensen, SJ, Thomas, A, Tong, C, Trajkovski, M, Tremblay, J, Uritskiy, G, Vicedomini, R, Wang, Z, Warren, A, Willassen, NP, Yelick, K, You, R, Zeller, G, Zhao, Z, Zhu, S, Zhu, J, Garrido-Oter, R, Gastmeier, P, Hacquard, S, Haeussler, S, Khaledi, A, Maechler, F, Mesny, F, Radutoiu, S, Schulze-Lefert, P, Smit, N, Strowig, T, Bremges, A, Sczyrba, A, and McHardy, AC
- Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.
- Published
- 2022
5. Critical Assessment of Metagenome Interpretation - the second round of challenges
- Author
-
Meyer, F., primary, Fritz, A., additional, Deng, Z.-L., additional, Koslicki, D., additional, Gurevich, A., additional, Robertson, G., additional, Alser, M., additional, Antipov, D., additional, Beghini, F., additional, Bertrand, D., additional, Brito, J. J., additional, Brown, C.T., additional, Buchmann, J., additional, Buluç, A., additional, Chen, B., additional, Chikhi, R., additional, Clausen, P. T., additional, Cristian, A., additional, Dabrowski, P. W., additional, Darling, A. E., additional, Egan, R., additional, Eskin, E., additional, Georganas, E., additional, Goltsman, E., additional, Gray, M. A., additional, Hansen, L. H., additional, Hofmeyr, S., additional, Huang, P., additional, Irber, L., additional, Jia, H., additional, Jørgensen, T. S., additional, Kieser, S. D., additional, Klemetsen, T., additional, Kola, A., additional, Kolmogorov, M., additional, Korobeynikov, A., additional, Kwan, J., additional, LaPierre, N., additional, Lemaitre, C., additional, Li, C., additional, Limasset, A., additional, Malcher-Miranda, F., additional, Mangul, S., additional, Marcelino, V. R., additional, Marchet, C., additional, Marijon, P., additional, Meleshko, D., additional, Mende, D. R., additional, Milanese, A., additional, Nagarajan, N., additional, Nissen, J., additional, Nurk, S., additional, Oliker, L., additional, Paoli, L., additional, Peterlongo, P., additional, Piro, V. C., additional, Porter, J. S., additional, Rasmussen, S., additional, Rees, E. R., additional, Reinert, K., additional, Renard, B., additional, Robertsen, E. M., additional, Rosen, G. L., additional, Ruscheweyh, H.-J., additional, Sarwal, V., additional, Segata, N., additional, Seiler, E., additional, Shi, L., additional, Sun, F., additional, Sunagawa, S., additional, Sørensen, S. J., additional, Thomas, A., additional, Tong, C., additional, Trajkovski, M., additional, Tremblay, J., additional, Uritskiy, G., additional, Vicedomini, R., additional, Wang, Zi., additional, Wang, Zhe., additional, Wang, Zho., additional, Warren, A., additional, Willassen, N. P., additional, Yelick, K., additional, You, R., additional, Zeller, G., additional, Zhao, Z., additional, Zhu, S., additional, Zhu, J., additional, Garrido-Oter, R., additional, Gastmeier, P., additional, Hacquard, S., additional, Häußler, S., additional, Khaledi, A., additional, Maechler, F., additional, Mesny, F., additional, Radutoiu, S., additional, Schulze-Lefert, P., additional, Smit, N., additional, Strowig, T., additional, Bremges, A., additional, Sczyrba, A., additional, and McHardy, A. C., additional
- Published
- 2021
- Full Text
- View/download PDF
6. Comparative genome analysis using sample-specific string detection in accurate long reads
- Author
-
Khorsand, P, Denti, L, Bonizzoni, P, Chikhi, R, Hormozdiari, F, Khorsand, Parsoa, Denti, Luca, Bonizzoni, Paola, Chikhi, Rayan, Hormozdiari, Fereydoun, Khorsand, P, Denti, L, Bonizzoni, P, Chikhi, R, Hormozdiari, F, Khorsand, Parsoa, Denti, Luca, Bonizzoni, Paola, Chikhi, Rayan, and Hormozdiari, Fereydoun
- Abstract
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome (‘samples-specific’ strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data).
- Published
- 2021
7. Automated strain separation in low-complexity metagenomes using long reads
- Author
-
Vicedomini, R, Quince, C, Darling, AE, Chikhi, R, Vicedomini, R, Quince, C, Darling, AE, and Chikhi, R
- Published
- 2021
8. STRONG: metagenomics strain resolution on assembly graphs
- Author
-
Quince, C, Nurk, S, Raguideau, S, James, R, Soyer, OS, Summers, JK, Limasset, A, Eren, AM, Chikhi, R, Darling, AE, Quince, C, Nurk, S, Raguideau, S, James, R, Soyer, OS, Summers, JK, Limasset, A, Eren, AM, Chikhi, R, and Darling, AE
- Abstract
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.
- Published
- 2021
9. Strainberry: automated strain separation in low-complexity metagenomes using long reads
- Author
-
Vicedomini, R, Quince, C, Darling, AE, Chikhi, R, Vicedomini, R, Quince, C, Darling, AE, and Chikhi, R
- Abstract
High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities.
- Published
- 2021
10. Automated strain separation in low-complexity metagenomes using long reads
- Author
-
Vicedomini, R., primary, Quince, C., additional, Darling, A. E., additional, and Chikhi, R., additional
- Published
- 2021
- Full Text
- View/download PDF
11. Computational pan-genomics: Status, promises and challenges
- Author
-
Marschall, T. (Tanja), Marz, M. (Manja), Abeel, T. (Thomas), Dijkstra, L. (Louis), Dutilh, B.E. (Bas), Ghaffaari, A. (Ali), Kersey, P. (Paul), Kloosterman, W.P. (Wigard), Mäkinen, V. (Veli), Novak, A.M. (Adam M.), Paten, B. (Benedict), Porubsky, D. (David), Rivals, E. (Eric), Alkan, C. (Can), Baaijens, J.A. (Jasmijn A.), Bakker, P.I.W. (Paul) de, Boeva, V. (Valentina), Bonnal, R.J.P. (Raoul J.P.), Chiaromonte, F. (Francesca), Chikhi, R. (Rayan), Ciccarelli, F.D. (Francesca D.), Cijvat, R. (Robin), Datema, E. (Erwin), Duijn, C.M. (Cornelia) van, Eichler, E.E. (Evan), Ernst, C. (Corinna), Eskin, E. (E.), Garrison, E. (Erik), El-Kebir, M. (Mohammed), Klau, G.W. (Gunnar W.), Korbel, J.O. (Jan), Lameijer, E.-W. (Eric-Wubbo), Langmead, B. (Benjamin), Martin, M. (Marcel), Medvedev, P. (Paul), Mu, J.C. (John C.), Neerincx, P.B.T. (Pieter B T), Ouwens, K. (Klaasjan), Peterlongo, P. (Pierre), Pisanti, N. (Nadia), Rahmann, S. (S.), Raphael, B.J. (Benjamin J.), Reinert, K. (Knut), Ridder, D. (Dick) de, de Ridder, J. (Jeroen), Schlesner, M. (Matthias), Schulz-Trieglaff, O. (Ole), Sanders, A.D. (Ashley D.), Sheikhizadeh, S. (Siavash), Shneider, C. (Carl), Smit, S. (Sandra), Valenzuela, D. (Daniel), Wang, J. (Jiayin), Wessels, L. (Lodewyk), Zhang, Y. (Ying), Guryev, V. (Victor), Vandin, F. (Fabio), Ye, K. (Kai), Schönhuth, A. (Alexander), Marschall, T. (Tanja), Marz, M. (Manja), Abeel, T. (Thomas), Dijkstra, L. (Louis), Dutilh, B.E. (Bas), Ghaffaari, A. (Ali), Kersey, P. (Paul), Kloosterman, W.P. (Wigard), Mäkinen, V. (Veli), Novak, A.M. (Adam M.), Paten, B. (Benedict), Porubsky, D. (David), Rivals, E. (Eric), Alkan, C. (Can), Baaijens, J.A. (Jasmijn A.), Bakker, P.I.W. (Paul) de, Boeva, V. (Valentina), Bonnal, R.J.P. (Raoul J.P.), Chiaromonte, F. (Francesca), Chikhi, R. (Rayan), Ciccarelli, F.D. (Francesca D.), Cijvat, R. (Robin), Datema, E. (Erwin), Duijn, C.M. (Cornelia) van, Eichler, E.E. (Evan), Ernst, C. (Corinna), Eskin, E. (E.), Garrison, E. (Erik), El-Kebir, M. (Mohammed), Klau, G.W. (Gunnar W.), Korbel, J.O. (Jan), Lameijer, E.-W. (Eric-Wubbo), Langmead, B. (Benjamin), Martin, M. (Marcel), Medvedev, P. (Paul), Mu, J.C. (John C.), Neerincx, P.B.T. (Pieter B T), Ouwens, K. (Klaasjan), Peterlongo, P. (Pierre), Pisanti, N. (Nadia), Rahmann, S. (S.), Raphael, B.J. (Benjamin J.), Reinert, K. (Knut), Ridder, D. (Dick) de, de Ridder, J. (Jeroen), Schlesner, M. (Matthias), Schulz-Trieglaff, O. (Ole), Sanders, A.D. (Ashley D.), Sheikhizadeh, S. (Siavash), Shneider, C. (Carl), Smit, S. (Sandra), Valenzuela, D. (Daniel), Wang, J. (Jiayin), Wessels, L. (Lodewyk), Zhang, Y. (Ying), Guryev, V. (Victor), Vandin, F. (Fabio), Ye, K. (Kai), and Schönhuth, A. (Alexander)
- Abstract
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations a
- Published
- 2018
- Full Text
- View/download PDF
12. Computational pan-genomics: status, promises and challenges.
- Author
-
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B.E., Ghaffaari, A., Kersey, P., Kloosterman, W.P., Mäkinen, V., Novak, A.M., Paten, B., Porubsky, D., Rivals, E., Alkan, C., Baaijens, J., Bakker, P.I. de, Boeva, V., Bonnal, R.J., Chiaromonte, F., Chikhi, R., Ciccarelli, F.D., Cijvat, R., Datema, E., Duijn, C.M. van, Eichler, E.E., Ernst, C., Eskin, E., Garrison, E., El-Kebir, M., Klau, G.W., Korbel, J.O., Lameijer, E.W., Langmead, B., Martin, M., Medvedev, P., Mu, J.C., Neerincx, P., Ouwens, K., Peterlongo, P., Pisanti, N., Rahmann, S., Raphael, B., Reinert, K., Ridder, D. de, Ridder, J. de, Schlesner, M., Schulz-Trieglaff, O., Sanders, A.D., Sheikhizadeh, S., Shneider, C., Smit, S., Valenzuela, D., Wang, J, Wessels, L., Zhang, Y, Guryev, V., Vandin, F., Ye, K., Schönhuth, A., Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B.E., Ghaffaari, A., Kersey, P., Kloosterman, W.P., Mäkinen, V., Novak, A.M., Paten, B., Porubsky, D., Rivals, E., Alkan, C., Baaijens, J., Bakker, P.I. de, Boeva, V., Bonnal, R.J., Chiaromonte, F., Chikhi, R., Ciccarelli, F.D., Cijvat, R., Datema, E., Duijn, C.M. van, Eichler, E.E., Ernst, C., Eskin, E., Garrison, E., El-Kebir, M., Klau, G.W., Korbel, J.O., Lameijer, E.W., Langmead, B., Martin, M., Medvedev, P., Mu, J.C., Neerincx, P., Ouwens, K., Peterlongo, P., Pisanti, N., Rahmann, S., Raphael, B., Reinert, K., Ridder, D. de, Ridder, J. de, Schlesner, M., Schulz-Trieglaff, O., Sanders, A.D., Sheikhizadeh, S., Shneider, C., Smit, S., Valenzuela, D., Wang, J, Wessels, L., Zhang, Y, Guryev, V., Vandin, F., Ye, K., and Schönhuth, A.
- Abstract
Contains fulltext : 190288.pdf (publisher's version ) (Open Access)
- Published
- 2018
13. Dualities in tree representations
- Author
-
Chikhi, R. (Rayan), Schönhuth, A. (Alexander), Chikhi, R. (Rayan), and Schönhuth, A. (Alexander)
- Abstract
A characterization of the tree T∗ such that BP(T∗) = ↔ DFUDS(T), the reversal of DFUDS(T) is given. An immediate consequence is a rigorous characterization of the tree T such that BP( T^) = DFUDS(T^). In summary, BP and DFUDS are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in BP and/or DFUDS. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.
- Published
- 2018
- Full Text
- View/download PDF
14. Computational pan-genomics: status, promises and challenges
- Author
-
The Computational Pan-Genomics Consortium, Marschall, T. (Tobias), Marz, M. (Manja), Abeel, T. (Thomas), Dijkstra, L.J. (Louis), Dutilh, B.E. (Bas), Ghaffaari, A. (Ali), Kersey, P. (Paul), Kloosterman, W.P. (Wigard), Mäkinen, V. (Veli), Novak, A.M. (Adam), Paten, B. (Benedict), Porubsky, D. (David), Rivals, E. (Eric), Alkan, C. (Can), Baaijens, J.A. (Jasmijn), Bakker, P.I.W. (Paul) de, Boeva, V. (Valentina), Bonnal, R.J.P. (Raoul), Chiaromonte, F. (Francesca), Chikhi, R. (Rayan), Ciccarelli, F.D. (Francesca), Cijvat, C.P. (Robin), Datema, E. (Erwin), Duijn, C.M. (Cornelia) van, Eichler, E.E. (Evan), Ernst, C. (Corinna), Eskin, E. (Eleazar), Garrison, E. (Erik), El-Kebir, M. (Mohammed), Klau, G.W. (Gunnar), Korbel, J.O. (Jan), Lameijer, E.-W. (Eric-Wubbo), Langmead, B. (Benjamin), Martin, M. (Marcel), Medvedev, P. (Paul), Mu, J.C. (John), Neerincx, P.B.T. (Pieter), Ouwens, K. (Klaasjan), Peterlongo, P. (Pierre), Pisanti, N. (Nadia), Rahmann, S. (Sven), Raphael, B.J. (Benjamin), Reinert, K. (Knut), Ridder, D. (Dick) de, Ridder, J. (Jeroen) de, Schlesner, M. (Matthias), Schulz-Trieglaff, O. (Ole), Sanders, A.D. (Ashley), Sheikhizadeh, S. (Siavash), Shneider, C. (Carl), Smit, S. (Sandra), Valenzuela, D. (Daniel), Wang, J. (Jiayin), Wessels, L.F.A. (Lodewyk), Zhang, Y. (Ying), Guryev, V. (Victor), Vandin, F. (Fabio), Ye, K. (Kai), Schönhuth, A. (Alexander), The Computational Pan-Genomics Consortium, Marschall, T. (Tobias), Marz, M. (Manja), Abeel, T. (Thomas), Dijkstra, L.J. (Louis), Dutilh, B.E. (Bas), Ghaffaari, A. (Ali), Kersey, P. (Paul), Kloosterman, W.P. (Wigard), Mäkinen, V. (Veli), Novak, A.M. (Adam), Paten, B. (Benedict), Porubsky, D. (David), Rivals, E. (Eric), Alkan, C. (Can), Baaijens, J.A. (Jasmijn), Bakker, P.I.W. (Paul) de, Boeva, V. (Valentina), Bonnal, R.J.P. (Raoul), Chiaromonte, F. (Francesca), Chikhi, R. (Rayan), Ciccarelli, F.D. (Francesca), Cijvat, C.P. (Robin), Datema, E. (Erwin), Duijn, C.M. (Cornelia) van, Eichler, E.E. (Evan), Ernst, C. (Corinna), Eskin, E. (Eleazar), Garrison, E. (Erik), El-Kebir, M. (Mohammed), Klau, G.W. (Gunnar), Korbel, J.O. (Jan), Lameijer, E.-W. (Eric-Wubbo), Langmead, B. (Benjamin), Martin, M. (Marcel), Medvedev, P. (Paul), Mu, J.C. (John), Neerincx, P.B.T. (Pieter), Ouwens, K. (Klaasjan), Peterlongo, P. (Pierre), Pisanti, N. (Nadia), Rahmann, S. (Sven), Raphael, B.J. (Benjamin), Reinert, K. (Knut), Ridder, D. (Dick) de, Ridder, J. (Jeroen) de, Schlesner, M. (Matthias), Schulz-Trieglaff, O. (Ole), Sanders, A.D. (Ashley), Sheikhizadeh, S. (Siavash), Shneider, C. (Carl), Smit, S. (Sandra), Valenzuela, D. (Daniel), Wang, J. (Jiayin), Wessels, L.F.A. (Lodewyk), Zhang, Y. (Ying), Guryev, V. (Victor), Vandin, F. (Fabio), Ye, K. (Kai), and Schönhuth, A. (Alexander)
- Abstract
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
- Published
- 2018
- Full Text
- View/download PDF
15. Computational pan-genomics: status, promises and challenges
- Author
-
Marschall, T, Marz, M, Abeel, T, Dijkstra, L, Dutilh, BE, Ghaffaari, A, Kersey, P, Kloosterman, WP, Makinen, V, Novak, AM, Paten, B, Porubsky, D, Rivals, E, Alkan, C, Baaijens, J A, de Bakker, PIW, Boeva, V, Bonnal, RJP, Chiaromonte, F, Chikhi, R, Ciccarelli, FD, Cijvat, R, Datema, E, Duijn, Cornelia, Eichler, EE, Ernst, C, Eskin, E, Garrison, E, El-Kebir, M, Klau, GW, Korbel, JO, Lameijer, EW, Langmead, B, Martin, M, Medvedev, P, Mu, JC, Neerincx, P, Ouwens, K, Peterlongo, P, Pisanti, N, Rahmann, S, Raphael, B, Reinert, K, Ridder, D, Ridder, J (Jannemarie), Schlesner, M, Schulz-Trieglaff, O, Sanders, AD, Sheikhizadeh, S, Shneider, C, Smit, S, Valenzuela, D, Wang, JY, Wessels, L, Zhang, Y, Guryev, V, Vandin, F, Ye, K, Schonhuth, A, Marschall, T, Marz, M, Abeel, T, Dijkstra, L, Dutilh, BE, Ghaffaari, A, Kersey, P, Kloosterman, WP, Makinen, V, Novak, AM, Paten, B, Porubsky, D, Rivals, E, Alkan, C, Baaijens, J A, de Bakker, PIW, Boeva, V, Bonnal, RJP, Chiaromonte, F, Chikhi, R, Ciccarelli, FD, Cijvat, R, Datema, E, Duijn, Cornelia, Eichler, EE, Ernst, C, Eskin, E, Garrison, E, El-Kebir, M, Klau, GW, Korbel, JO, Lameijer, EW, Langmead, B, Martin, M, Medvedev, P, Mu, JC, Neerincx, P, Ouwens, K, Peterlongo, P, Pisanti, N, Rahmann, S, Raphael, B, Reinert, K, Ridder, D, Ridder, J (Jannemarie), Schlesner, M, Schulz-Trieglaff, O, Sanders, AD, Sheikhizadeh, S, Shneider, C, Smit, S, Valenzuela, D, Wang, JY, Wessels, L, Zhang, Y, Guryev, V, Vandin, F, Ye, K, and Schonhuth, A
- Published
- 2018
16. Critical Assessment of Metagenome Interpretation - A benchmark of metagenomics software
- Author
-
Sczyrba, A, Hofmann, P, Belmann, P, Koslicki, D, Janssen, S, Dröge, J, Gregor, I, Majda, S, Fiedler, J, Dahms, E, Bremges, A, Fritz, A, Garrido-Oter, R, Jørgensen, TS, Shapiro, N, Blood, PD, Gurevich, A, Bai, Y, Turaev, D, Demaere, MZ, Chikhi, R, Nagarajan, N, Quince, C, Meyer, F, Balvočiutė, M, Hansen, LH, Sørensen, SJ, Chia, BKH, Denis, B, Froula, JL, Wang, Z, Egan, R, Don Kang, D, Cook, JJ, Deltel, C, Beckstette, M, Lemaitre, C, Peterlongo, P, Rizk, G, Lavenier, D, Wu, YW, Singer, SW, Jain, C, Strous, M, Klingenberg, H, Meinicke, P, Barton, MD, Lingner, T, Lin, HH, Liao, YC, Silva, GGZ, Cuevas, DA, Edwards, RA, Saha, S, Piro, VC, Renard, BY, Pop, M, Klenk, HP, Göker, M, Kyrpides, NC, Woyke, T, Vorholt, JA, Schulze-Lefert, P, Rubin, EM, Darling, AE, Rattei, T, McHardy, AC, Sczyrba, A, Hofmann, P, Belmann, P, Koslicki, D, Janssen, S, Dröge, J, Gregor, I, Majda, S, Fiedler, J, Dahms, E, Bremges, A, Fritz, A, Garrido-Oter, R, Jørgensen, TS, Shapiro, N, Blood, PD, Gurevich, A, Bai, Y, Turaev, D, Demaere, MZ, Chikhi, R, Nagarajan, N, Quince, C, Meyer, F, Balvočiutė, M, Hansen, LH, Sørensen, SJ, Chia, BKH, Denis, B, Froula, JL, Wang, Z, Egan, R, Don Kang, D, Cook, JJ, Deltel, C, Beckstette, M, Lemaitre, C, Peterlongo, P, Rizk, G, Lavenier, D, Wu, YW, Singer, SW, Jain, C, Strous, M, Klingenberg, H, Meinicke, P, Barton, MD, Lingner, T, Lin, HH, Liao, YC, Silva, GGZ, Cuevas, DA, Edwards, RA, Saha, S, Piro, VC, Renard, BY, Pop, M, Klenk, HP, Göker, M, Kyrpides, NC, Woyke, T, Vorholt, JA, Schulze-Lefert, P, Rubin, EM, Darling, AE, Rattei, T, and McHardy, AC
- Abstract
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
- Published
- 2017
17. GATB: Toolbox for developing efficient NGS software
- Author
-
Drezen, Erwan, Rizk, G, Chikhi, R, Deltel, Charles, Lemaitre, C, Peterlongo, P, Lavenier, D, Scalable, Optimized and Parallel Algorithms for Genomics (GenScale), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-GESTION DES DONNÉES ET DE LA CONNAISSANCE (IRISA-D7), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Dept. of Computer Science and Engineering, Pennsylvania State University (Penn State), Penn State System-Penn State System, ANR-12-EMMA- 0019-01, Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), and ANR-12-EMMA-0019,GATB,Boite à outils ' Assemblage pour la Génomique '(2012)
- Subjects
[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] - Abstract
International audience; The analysis of NGS data remains a time and space-consuming task. Many efforts have been made to provide efficient data structures for indexing the terabytes of data generated by the fast sequencing machines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genome assemblers, SNP callers, etc., make an intensive use of these data structures to keep their memory footprint as lower as possible. The overall efficiency of NGS software is brought by a smart combination of how data are represented inside the computer memory and how they are processed through the available processing units inside a processor. Developing such software is thus a real challenge, as it requires a large spectrum of competences from high-level data structure and algorithm concepts to tiny details of implementation. The GATB software toolbox aims to lighten the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speed-up the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is the de Bruijn graph, and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processor (laptop computer, small server) with a few GB of memory. From high-level C++ API, NGS programing designers can rapidly elaborate their own software based on state-of-the-art algorithms and data structures of the domain.
- Published
- 2014
18. Assemblathon 2 : Evaluating de novo methods of genome assembly in three vertebrate species
- Author
-
Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., Chitsaz, H., Chou, W. -C, Corbeil, J., Fabbro, C. D., Docking, T. R., Durbin, R., Earl, D., Emrich, S., Fedotov, P., Fonseca, N. A., Ganapathy, G., Gibbs, R. A., Gnerre, S., Godzaridis, E., Goldstein, S., Haimel, M., Hall, G., Haussler, D., Hiatt, J. B., Ho, I. Y., Howard, J., Hunt, M., Jackman, S. D., Jaffe, D. B., Jarvis, E. D., Jiang, H., Kazakov, S., Kersey, P. J., Kitzman, J. O., Knight, J. R., Koren, S., Lam, T. -W, Lavenier, D., Laviolette, F., Li, Y., Li, Z., Liu, B., Liu, Y., Luo, R., MacCallum, I., MacManes, M. D., Maillet, N., Melnikov, S., Naquin, D., Ning, Z., Otto, T. D., Paten, B., Paulo, O. S., Phillippy, A. M., Pina-Martins, F., Place, M., Przybylski, D., Qin, X., Qu, C., Ribeiro, F. J., Richards, S., Rokhsar, D. S., Ruby, J. G., Scalabrin, S., Schatz, M. C., Schwartz, D. C., Sergushichev, A., Sharpe, T., Shaw, T. I., Shendure, J., Shi, Y., Simpson, J. T., Song, H., Tsarev, F., Vezzi, F., Vicedomini, R., Vieira, B. M., Wang, J., Worley, K. C., Yin, S., Yiu, S. -M, Yuan, J., Zhang, G., Zhang, H., Zhou, S., Korf, I. F., Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., Chitsaz, H., Chou, W. -C, Corbeil, J., Fabbro, C. D., Docking, T. R., Durbin, R., Earl, D., Emrich, S., Fedotov, P., Fonseca, N. A., Ganapathy, G., Gibbs, R. A., Gnerre, S., Godzaridis, E., Goldstein, S., Haimel, M., Hall, G., Haussler, D., Hiatt, J. B., Ho, I. Y., Howard, J., Hunt, M., Jackman, S. D., Jaffe, D. B., Jarvis, E. D., Jiang, H., Kazakov, S., Kersey, P. J., Kitzman, J. O., Knight, J. R., Koren, S., Lam, T. -W, Lavenier, D., Laviolette, F., Li, Y., Li, Z., Liu, B., Liu, Y., Luo, R., MacCallum, I., MacManes, M. D., Maillet, N., Melnikov, S., Naquin, D., Ning, Z., Otto, T. D., Paten, B., Paulo, O. S., Phillippy, A. M., Pina-Martins, F., Place, M., Przybylski, D., Qin, X., Qu, C., Ribeiro, F. J., Richards, S., Rokhsar, D. S., Ruby, J. G., Scalabrin, S., Schatz, M. C., Schwartz, D. C., Sergushichev, A., Sharpe, T., Shaw, T. I., Shendure, J., Shi, Y., Simpson, J. T., Song, H., Tsarev, F., Vezzi, F., Vicedomini, R., Vieira, B. M., Wang, J., Worley, K. C., Yin, S., Yiu, S. -M, Yuan, J., Zhang, G., Zhang, H., Zhou, S., and Korf, I. F.
- Abstract
Background: The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results: In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions: Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another., QC 20170307
- Published
- 2013
- Full Text
- View/download PDF
19. Assemblathon 1: A competitive assessment of de novo short read assembly methods
- Author
-
Earl, D, Bradnam, K, St. John, J, Darling, A, Lin, D, Fass, J, Yu, HOK, Buffalo, V, Zerbino, DR, Diekhans, M, Nguyen, N, Ariyaratne, PN, Sung, WK, Ning, Z, Haimel, M, Simpson, JT, Fonseca, NA, Birol, I, Docking, TR, Ho, IY, Rokhsar, DS, Chikhi, R, Lavenier, D, Chapuis, G, Naquin, D, Maillet, N, Schatz, MC, Kelley, DR, Phillippy, AM, Koren, S, Yang, SP, Wu, W, Chou, WC, Srivastava, A, Shaw, TI, Ruby, JG, Skewes-Cox, P, Betegon, M, Dimon, MT, Solovyev, V, Seledtsov, I, Kosarev, P, Vorobyev, D, Ramirez-Gonzalez, R, Leggett, R, MacLean, D, Xia, F, Luo, R, Li, Z, Xie, Y, Liu, B, Gnerre, S, MacCallum, I, Przybylski, D, Ribeiro, FJ, Sharpe, T, Hall, G, Kersey, PJ, Durbin, R, Jackman, SD, Chapman, JA, Huang, X, DeRisi, JL, Caccamo, M, Li, Y, Jaffe, DB, Green, RE, Haussler, D, Korf, I, Paten, B, Earl, D, Bradnam, K, St. John, J, Darling, A, Lin, D, Fass, J, Yu, HOK, Buffalo, V, Zerbino, DR, Diekhans, M, Nguyen, N, Ariyaratne, PN, Sung, WK, Ning, Z, Haimel, M, Simpson, JT, Fonseca, NA, Birol, I, Docking, TR, Ho, IY, Rokhsar, DS, Chikhi, R, Lavenier, D, Chapuis, G, Naquin, D, Maillet, N, Schatz, MC, Kelley, DR, Phillippy, AM, Koren, S, Yang, SP, Wu, W, Chou, WC, Srivastava, A, Shaw, TI, Ruby, JG, Skewes-Cox, P, Betegon, M, Dimon, MT, Solovyev, V, Seledtsov, I, Kosarev, P, Vorobyev, D, Ramirez-Gonzalez, R, Leggett, R, MacLean, D, Xia, F, Luo, R, Li, Z, Xie, Y, Liu, B, Gnerre, S, MacCallum, I, Przybylski, D, Ribeiro, FJ, Sharpe, T, Hall, G, Kersey, PJ, Durbin, R, Jackman, SD, Chapman, JA, Huang, X, DeRisi, JL, Caccamo, M, Li, Y, Jaffe, DB, Green, RE, Haussler, D, Korf, I, and Paten, B
- Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/. © 2011 by Cold Spring Harbor Laboratory Press.
- Published
- 2011
20. Optimal Control for Anti-Braking System.
- Author
-
Chikhi, R., El Hadri, A., and Cadiou, J.C.
- Published
- 2005
- Full Text
- View/download PDF
21. Comparative genome analysis using sample-specific string detection in accurate long reads
- Author
-
Rayan Chikhi, Luca Denti, Parsoa Khorsand, Fereydoun Hormozdiari, Paola Bonizzoni, Stamatakis, Alexandros, Khorsand, P, Denti, L, Bonizzoni, P, Chikhi, R, and Hormozdiari, F
- Subjects
education.field_of_study ,business.industry ,Computer science ,Human Genome ,String (computer science) ,Population ,Bioengineering ,Pattern recognition ,Sample (statistics) ,Genomics ,General Medicine ,Computational biology ,Sample (graphics) ,Genome ,Genetics ,Long reads, FM-index, structural variant ,Artificial intelligence ,Human Genome Structural Variant Consortium ,business ,education ,FM-index ,Biotechnology ,Segmental duplication ,Reference genome - Abstract
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome (‘samples-specific’ strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information Supplementary data are available at Bioinformatics Advances online.
- Published
- 2021
22. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads
- Author
-
Luca Denti, Parsoa Khorsand, Paola Bonizzoni, Fereydoun Hormozdiari, Rayan Chikhi, Algorithmes pour les séquences biologiques - Sequence Bioinformatics, Institut Pasteur [Paris] (IP), University of California [Davis] (UC Davis), University of California (UC), Università degli Studi di Milano-Bicocca = University of Milano-Bicocca (UNIMIB), This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grants agreements No. 872539 and 956229 (P.B. and R.C.). This work has also been supported in part by NSF award DBI-2042518 to F.H. R.C was supported by ANR Transipedia, SeqDigger, GenoPIM, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, ANR-21-CE46-0012, PIA/ANR16-CONV-0005, and ANR-19-P3IA-0001). This project has received funding from the European Union’s Horizon Europe programme for research and innovation under grant agreement No. 101047160., ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), ANR-19-CE45-0008,SeqDigger,Moteur de recherche de donne´es de se´quenc¸age en ge´nomique environnementale(2019), ANR-21-CE46-0012,GenoPIM,Processing-in-Memory pour la génomique(2021), ANR-16-CONV-0005,INCEPTION,Institut Convergences pour l'étude de l'Emergence des Pathologies au Travers des Individus et des populatiONs(2016), ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019), European Project: 872539,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions, H2020-EU.1.3.3. - Stimulating innovation by means of cross-fertilisation of knowledge,H2020-MSCA-RISE-2019,PANGAIA(2020), European Project: 956229,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions,ALPACA(2021), Denti, L, Khorsand, P, Bonizzoni, P, Hormozdiari, F, and Chikhi, R
- Subjects
[SDV]Life Sciences [q-bio] ,Bioinformatics, Sequence Analysis, Structural Variations, PacBio HiFi ,Cell Biology ,Molecular Biology ,Biochemistry ,Biotechnology - Abstract
International audience; Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.
- Full Text
- View/download PDF
23. High-quality metagenome assembly from long accurate reads with metaMDBG.
- Author
-
Benoit G, Raguideau S, James R, Phillippy AM, Chikhi R, and Quince C
- Subjects
- Sequence Analysis, DNA methods, Algorithms, High-Throughput Nucleotide Sequencing methods, Software, Metagenome genetics, Metagenomics methods
- Abstract
We introduce metaMDBG, a metagenomics assembler for PacBio HiFi reads. MetaMDBG combines a de Bruijn graph assembly in a minimizer space with an iterative assembly over sequences of minimizers to address variations in genome coverage depth and an abundance-based filtering strategy to simplify strain complexity. For complex communities, we obtained up to twice as many high-quality circularized prokaryotic metagenome-assembled genomes as existing methods and had better recovery of viruses and plasmids., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
24. Reference-free structural variant detection in microbiomes via long-read co-assembly graphs.
- Author
-
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EPC, and Treangen TJ
- Subjects
- Metagenomics methods, Gene Transfer, Horizontal, Bacteria genetics, Algorithms, Microbiota genetics, Metagenome, Genome, Bacterial
- Abstract
Motivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining., Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux., Availability and Implementation: rhea is open source and available at: https://github.com/treangenlab/rhea., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
25. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.
- Author
-
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, and Baym M
- Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k -mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
- Published
- 2024
- Full Text
- View/download PDF
26. Petabase-Scale Homology Search for Structure Prediction.
- Author
-
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, and Steinegger M
- Subjects
- Sequence Alignment, Protein Conformation, Software, Algorithms, Sequence Analysis, Protein methods, Computational Biology methods, Proteins chemistry
- Abstract
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction., (Copyright © 2024 Cold Spring Harbor Laboratory Press; all rights reserved.)
- Published
- 2024
- Full Text
- View/download PDF
27. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.
- Author
-
Lemane T, Lezzoche N, Lecubin J, Pelletier E, Lescot M, Chikhi R, and Peterlongo P
- Subjects
- Oceans and Seas, Metagenome genetics, Databases, Nucleic Acid, Genomics, Seawater
- Abstract
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset., (© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.)
- Published
- 2024
- Full Text
- View/download PDF
28. Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs.
- Author
-
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EPC, and Treangen TJ
- Abstract
Bacterial genome dynamics are vital for understanding the mechanisms underlying microbial adaptation, growth, and their broader impact on host phenotype. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to absence of clear reference genomes and presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series. The log fold change in graph coverage between subsequent samples is then calculated to call SVs that are thriving or declining throughout the series. We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, which is particularly noticeable as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between subsequent time and temperature samples, suggesting host advantage. Our innovative approach leverages raw read patterns rather than references or MAGs to include all sequencing reads in analysis, and thus provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial genome dynamics., Competing Interests: Competing interests No competing interest is declared.
- Published
- 2024
- Full Text
- View/download PDF
29. The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies.
- Author
-
Willink B, Tunström K, Nilén S, Chikhi R, Lemane T, Takahashi M, Takahashi Y, Svensson EI, and Wheat CW
- Subjects
- Animals, Female, Male, Polymorphism, Genetic, Genomics, Odonata genetics
- Abstract
Sex-limited morphs can provide profound insights into the evolution and genomic architecture of complex phenotypes. Inter-sexual mimicry is one particular type of sex-limited polymorphism in which a novel morph resembles the opposite sex. While inter-sexual mimics are known in both sexes and a diverse range of animals, their evolutionary origin is poorly understood. Here, we investigated the genomic basis of female-limited morphs and male mimicry in the common bluetail damselfly. Differential gene expression between morphs has been documented in damselflies, but no causal locus has been previously identified. We found that male mimicry originated in an ancestrally sexually dimorphic lineage in association with multiple structural changes, probably driven by transposable element activity. These changes resulted in ~900 kb of novel genomic content that is partly shared by male mimics in a close relative, indicating that male mimicry is a trans-species polymorphism. More recently, a third morph originated following the translocation of part of the male-mimicry sequence into a genomic position ~3.5 mb apart. We provide evidence of balancing selection maintaining male mimicry, in line with previous field population studies. Our results underscore how structural variants affecting a handful of potentially regulatory genes and morph-specific genes can give rise to novel and complex phenotypic polymorphisms., (© 2023. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
30. Comparing methods for constructing and representing human pangenome graphs.
- Author
-
Andreace F, Lechat P, Dufresne Y, and Chikhi R
- Subjects
- Humans, Sequence Analysis, DNA methods, Genomics methods, Genome, Algorithms, Software
- Abstract
Background: As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs., Results: In this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci., Conclusion: This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application., (© 2023. The Author(s).)
- Published
- 2023
- Full Text
- View/download PDF
31. decOM: similarity-based microbial source tracking of ancient oral samples using k-mer-based methods.
- Author
-
Duitama González C, Vicedomini R, Lemane T, Rascovan N, Richard H, and Chikhi R
- Subjects
- Animals, Humans, Metagenome, Metagenomics methods
- Abstract
Background: The analysis of ancient oral metagenomes from archaeological human and animal samples is largely confounded by contaminant DNA sequences from modern and environmental sources. Existing methods for Microbial Source Tracking (MST) estimate the proportions of environmental sources, but do not perform well on ancient metagenomes. We developed a novel method called decOM for Microbial Source Tracking and classification of ancient and modern metagenomic samples using k-mer matrices., Results: We analysed a collection of 360 ancient oral, modern oral, sediment/soil and skin metagenomes, using stratified five-fold cross-validation. decOM estimates the contributions of these source environments in ancient oral metagenomic samples with high accuracy, outperforming two state-of-the-art methods for source tracking, FEAST and mSourceTracker., Conclusions: decOM is a high-accuracy microbial source tracking method, suitable for ancient oral metagenomic data sets. The decOM method is generic and could also be adapted for MST of other ancient and modern types of metagenomes. We anticipate that decOM will be a valuable tool for MST of ancient metagenomic studies. Video Abstract., (© 2023. The Author(s).)
- Published
- 2023
- Full Text
- View/download PDF
32. aKmerBroom: Ancient oral DNA decontamination using Bloom filters on k-mer sets.
- Author
-
Duitama González C, Rangavittal S, Vicedomini R, Chikhi R, and Richard H
- Abstract
Dental calculus samples are modeled as a mixture of DNA coming from dental plaque and contaminants. Current computational decontamination methods such as Recentrifuge and DeconSeq require either a reference database or sequenced negative controls, and therefore have limited use cases. We present a reference-free decontamination tool tailored for the removal of contaminant DNA of ancient oral sample called aKmerBroom. Our tool builds a Bloom filter of known ancient and modern oral k-mers, then scans an input set of ancient metagenomic reads using multiple passes to iteratively retain reads likely to be of oral origin. On synthetic data, aKmerBroom achieves over 89.53 % sensitivity and 94.00 % specificity. On real datasets, aKmerBroom shows higher read retainment ( + 60 % on average) than other methods. We anticipate aKmerBroom will be a valuable tool for the processing of ancient oral samples as it will prevent contaminated datasets from being completely discarded in downstream analyses., Competing Interests: The authors declare no competing interests., (© 2023 The Authors.)
- Published
- 2023
- Full Text
- View/download PDF
33. Seedability: optimizing alignment parameters for sensitive sequence comparison.
- Author
-
Ayad LAK, Chikhi R, and Pissis SP
- Abstract
Motivation: Most sequence alignment techniques make use of exact k -mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k -mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences., Results: The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments., Availability and Implementation: https://github.com/lorrainea/Seedability (distributed under GPL v3.0)., Competing Interests: None declared., (© The Author(s) 2023. Published by Oxford University Press.)
- Published
- 2023
- Full Text
- View/download PDF
34. Petascale Homology Search for Structure Prediction.
- Author
-
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, and Steinegger M
- Abstract
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.
- Published
- 2023
- Full Text
- View/download PDF
35. Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs.
- Author
-
Benoit G, Raguideau S, James R, Phillippy AM, Chikhi R, and Quince C
- Abstract
We introduce a novel metagenomics assembler for high-accuracy long reads. Our approach, implemented as metaMDBG, combines highly efficient de Bruijn graph assembly in minimizer space, with both a multi- k' approach for dealing with variations in genome coverage depth and an abundance-based filtering strategy for simplifying strain complexity. The resulting algorithm is more efficient than the state-of-the-art but with better assembly results. metaMDBG was 1.5 to 12 times faster than competing assemblers and requires between one-tenth and one-thirtieth of the memory across a range of data sets. We obtained up to twice as many high-quality circularised prokaryotic metagenome assembled genomes (MAGs) on the most complex communities, and a better recovery of viruses and plasmids. metaMDBG performs particularly well for abundant organisms whilst being robust to the presence of strain diversity. The result is that for the first time it is possible to efficiently reconstruct the majority of complex communities by abundance as near-complete MAGs., Competing Interests: Competing Interests Statement The authors declare no competing interests.
- Published
- 2023
- Full Text
- View/download PDF
36. Efficient mapping of accurate long reads in minimizer space with mapquik.
- Author
-
Ekim B, Sahlin K, Medvedev P, Berger B, and Chikhi R
- Subjects
- Humans, Algorithms, Sequence Analysis, DNA, Genome, Human, Software, High-Throughput Nucleotide Sequencing
- Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data., (© 2023 Ekim et al.; Published by Cold Spring Harbor Laboratory Press.)
- Published
- 2023
- Full Text
- View/download PDF
37. Hybrids of RNA viruses and viroid-like elements replicate in fungi.
- Author
-
Forgia M, Navarro B, Daghino S, Cervera A, Gisel A, Perotto S, Aghayeva DN, Akinyuwa MF, Gobbi E, Zheludev IN, Edgar RC, Chikhi R, Turina M, Babaian A, Di Serio F, and de la Peña M
- Subjects
- RNA, Viral genetics, Virus Replication genetics, RNA genetics, RNA-Dependent RNA Polymerase genetics, Fungi genetics, Viroids genetics, RNA, Catalytic genetics, RNA Viruses genetics
- Abstract
Earth's life may have originated as self-replicating RNA, and it has been argued that RNA viruses and viroid-like elements are remnants of such pre-cellular RNA world. RNA viruses are defined by linear RNA genomes encoding an RNA-dependent RNA polymerase (RdRp), whereas viroid-like elements consist of small, single-stranded, circular RNA genomes that, in some cases, encode paired self-cleaving ribozymes. Here we show that the number of candidate viroid-like elements occurring in geographically and ecologically diverse niches is much higher than previously thought. We report that, amongst these circular genomes, fungal ambiviruses are viroid-like elements that undergo rolling circle replication and encode their own viral RdRp. Thus, ambiviruses are distinct infectious RNAs showing hybrid features of viroid-like RNAs and viruses. We also detected similar circular RNAs, containing active ribozymes and encoding RdRps, related to mitochondrial-like fungal viruses, highlighting fungi as an evolutionary hub for RNA viruses and viroid-like elements. Our findings point to a deep co-evolutionary history between RNA viruses and subviral elements and offer new perspectives in the origin and evolution of primordial infectious agents, and RNA life., (© 2023. The Author(s).)
- Published
- 2023
- Full Text
- View/download PDF
38. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads.
- Author
-
Denti L, Khorsand P, Bonizzoni P, Hormozdiari F, and Chikhi R
- Subjects
- Humans, Sequence Analysis, DNA methods, Genome, Human, Repetitive Sequences, Nucleic Acid, High-Throughput Nucleotide Sequencing methods, Genomics methods
- Abstract
Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)-a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome., (© 2022. The Author(s), under exclusive licence to Springer Nature America, Inc.)
- Published
- 2023
- Full Text
- View/download PDF
39. k mdiff, large-scale and user-friendly differential k-mer analyses.
- Author
-
Lemane T, Chikhi R, and Peterlongo P
- Subjects
- Sequence Analysis, DNA, Genome-Wide Association Study, Genotype, Software, Algorithms
- Abstract
Summary: Genome wide association studies elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible., Availabilityand Implementation: https://github.com/tlemane/kmdiff., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2022. Published by Oxford University Press.)
- Published
- 2022
- Full Text
- View/download PDF
40. Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina).
- Author
-
Porrelli S, Gerbault-Seureau M, Rozzi R, Chikhi R, Curaudeau M, Ropiquet A, and Hassanin A
- Subjects
- Animals, Genomics, Base Sequence, Repetitive Sequences, Nucleic Acid, Buffaloes genetics, Genome
- Abstract
Genomic data for wild species of the genus Bubalus (Asian buffaloes) are still lacking while several whole genomes are currently available for domestic water buffaloes. To address this, we sequenced the genome of a wild endangered dwarf buffalo, the lowland anoa (Bubalus depressicornis), produced a draft genome assembly and made comparison to published buffalo genomes. The lowland anoa genome assembly was 2.56 Gbp long and contained 103,135 contigs, the longest contig being 337.39 kbp long. N50 and L50 values were 38.73 and 19.83 kbp, respectively, mean coverage was 44× and GC content was 41.74%. Two strategies were adopted to evaluate genome completeness: (1) determination of genomic features with de novo and homology-based predictions using annotations of chromosome-level genome assembly of the river buffalo and (2) employment of benchmarking against universal single-copy orthologs (BUSCO). Homology-based predictions identified 94.51% complete and 3.65% partial genomic features. De novo gene predictions identified 32,393 genes, representing 97.14% of the reference's annotated genes, whilst BUSCO search against the mammalian orthologs database identified 71.1% complete, 11.7% fragmented, and 17.2% missing orthologs, indicating a good level of completeness for downstream analyses. Repeat analyses indicated that the lowland anoa genome contains 42.12% of repetitive regions. The genome assembly of the lowland anoa is expected to contribute to comparative genome analyses among bovid species., (© The Author(s) 2022. Published by Oxford University Press on behalf of Genetics Society of America.)
- Published
- 2022
- Full Text
- View/download PDF
41. Mapping-friendly sequence reductions: Going beyond homopolymer compression.
- Author
-
Blassel L, Medvedev P, and Chikhi R
- Abstract
Sequencing errors continue to pose algorithmic challenges to methods working with sequencing data. One of the simplest and most prevalent techniques for ameliorating the detrimental effects of homopolymer expansion/contraction errors present in long reads is homopolymer compression. It collapses runs of repeated nucleotides, to remove some sequencing errors and improve mapping sensitivity. Though our intuitive understanding justifies why homopolymer compression works, it in no way implies that it is the best transformation that can be done. In this paper, we explore if there are transformations that can be applied in the same pre-processing manner as homopolymer compression that would achieve better alignment sensitivity. We introduce a more general framework than homopolymer compression, called mapping-friendly sequence reductions. We transform the reference and the reads using these reductions and then apply an alignment algorithm. We demonstrate that some mapping-friendly sequence reductions lead to improved mapping accuracy, outperforming homopolymer compression., Competing Interests: The authors declare no competing interests., (© 2022 The Author(s).)
- Published
- 2022
- Full Text
- View/download PDF
42. The K-mer File Format: a standardized and compact disk representation of sets of k-mers.
- Author
-
Dufresne Y, Lemane T, Marijon P, Peterlongo P, Rahman A, Kokot M, Medvedev P, Deorowicz S, and Chikhi R
- Subjects
- Sequence Analysis, DNA, Compact Disks, Software, Algorithms
- Abstract
Summary: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools., Availability and Implementation: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2022. Published by Oxford University Press.)
- Published
- 2022
- Full Text
- View/download PDF
43. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.
- Author
-
Lemane T, Medvedev P, Chikhi R, and Peterlongo P
- Abstract
Summary: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k -mers which approximates the desired set of all the non-erroneous k -mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k -mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks , a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k -mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k -mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k -mers present in several samples, improving the recovery of non-erroneous k -mers. Our experiments highlight that this technique preserves around 8× more k -mers than the usual yet crude filtering of low-abundance k -mers in a large metagenomics dataset., Availability and Implementation: https://github.com/tlemane/kmtricks., Supplementary Information: Supplementary data are available at Bioinformatics Advances online., (© The Author(s) 2022. Published by Oxford University Press.)
- Published
- 2022
- Full Text
- View/download PDF
44. Critical Assessment of Metagenome Interpretation: the second round of challenges.
- Author
-
Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, and McHardy AC
- Subjects
- Archaea genetics, Reproducibility of Results, Sequence Analysis, DNA, Software, Metagenome, Metagenomics methods
- Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses., (© 2022. The Author(s).)
- Published
- 2022
- Full Text
- View/download PDF
45. Petabase-scale sequence alignment catalyses viral discovery.
- Author
-
Edgar RC, Taylor B, Lin V, Altman T, Barbera P, Meleshko D, Lohr D, Novakovsky G, Buchfink B, Al-Shayeb B, Banfield JF, de la Peña M, Korobeynikov A, Chikhi R, and Babaian A
- Subjects
- Animals, Archives, Bacteriophages enzymology, Bacteriophages genetics, Biodiversity, Coronavirus classification, Coronavirus enzymology, Coronavirus genetics, Evolution, Molecular, Hepatitis Delta Virus enzymology, Hepatitis Delta Virus genetics, Humans, Models, Molecular, RNA Viruses classification, RNA Viruses enzymology, RNA-Dependent RNA Polymerase chemistry, RNA-Dependent RNA Polymerase genetics, Software, Cloud Computing, Databases, Genetic, RNA Viruses genetics, RNA Viruses isolation & purification, Sequence Alignment methods, Virology methods, Virome genetics
- Abstract
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially
1 . Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics., (© 2022. The Author(s), under exclusive licence to Springer Nature Limited.)- Published
- 2022
- Full Text
- View/download PDF
46. Recombination Marks the Evolutionary Dynamics of a Recently Endogenized Retrovirus.
- Author
-
Yang L, Malhotra R, Chikhi R, Elleder D, Kaiser T, Rong J, Medvedev P, and Poss M
- Subjects
- Animals, Biological Evolution, Evolution, Molecular, Phylogeny, Recombination, Genetic, Deer genetics, Endogenous Retroviruses genetics
- Abstract
All vertebrate genomes have been colonized by retroviruses along their evolutionary trajectory. Although endogenous retroviruses (ERVs) can contribute important physiological functions to contemporary hosts, such benefits are attributed to long-term coevolution of ERV and host because germline infections are rare and expansion is slow, and because the host effectively silences them. The genomes of several outbred species including mule deer (Odocoileus hemionus) are currently being colonized by ERVs, which provides an opportunity to study ERV dynamics at a time when few are fixed. We previously established the locus-specific distribution of cervid ERV (CrERV) in populations of mule deer. In this study, we determine the molecular evolutionary processes acting on CrERV at each locus in the context of phylogenetic origin, genome location, and population prevalence. A mule deer genome was de novo assembled from short- and long-insert mate pair reads and CrERV sequence generated at each locus. We report that CrERV composition and diversity have recently measurably increased by horizontal acquisition of a new retrovirus lineage. This new lineage has further expanded CrERV burden and CrERV genomic diversity by activating and recombining with existing CrERV. Resulting interlineage recombinants then endogenize and subsequently expand. CrERV loci are significantly closer to genes than expected if integration were random and gene proximity might explain the recent expansion of one recombinant CrERV lineage. Thus, in mule deer, retroviral colonization is a dynamic period in the molecular evolution of CrERV that also provides a burst of genomic diversity to the host population., (© The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.)
- Published
- 2021
- Full Text
- View/download PDF
47. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.
- Author
-
Ekim B, Berger B, and Chikhi R
- Subjects
- Humans, Metagenomics, Microcomputers, Sequence Analysis, DNA methods, Algorithms, Genomics
- Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/., Competing Interests: Declaration of interests The authors declare no competing interests., (Copyright © 2021 The Authors. Published by Elsevier Inc. All rights reserved.)
- Published
- 2021
- Full Text
- View/download PDF
48. STRONG: metagenomics strain resolution on assembly graphs.
- Author
-
Quince C, Nurk S, Raguideau S, James R, Soyer OS, Summers JK, Limasset A, Eren AM, Chikhi R, and Darling AE
- Subjects
- Bayes Theorem, Contig Mapping, Haplotypes, Metagenomics methods, Sequence Analysis, DNA, Algorithms, Genome, Bacterial, Metagenome, Microbial Consortia genetics, Software
- Abstract
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
49. Strainberry: automated strain separation in low-complexity metagenomes using long reads.
- Author
-
Vicedomini R, Quince C, Darling AE, and Chikhi R
- Subjects
- Bacteria classification, Bacteria genetics, Genomics methods, High-Throughput Nucleotide Sequencing, Species Specificity, Computational Biology methods, Genome, Bacterial genetics, Metagenome genetics, Metagenomics methods
- Abstract
High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
50. Disk compression of k-mer sets.
- Author
-
Rahman A, Chikhi R, and Medvedev P
- Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.