8 results on '"UClust"'
Search Results
2. Metabarcoding free‐living marine nematodes using curated 18S and CO1 reference sequence databases for species‐level taxonomic assignments
- Author
-
Lara Macheriotou, Katja Guilini, Tania Nara Bezerra, Bjorn Tytgat, Dinh Tu Nguyen, Thi Xuan Phuong Nguyen, Febe Noppe, Maickel Armenteros, Fehmi Boufahja, Annelien Rigaux, Ann Vanreusel, and Sofie Derycke
- Subjects
metabarcoding ,mock community ,Nematoda ,reference sequence database ,UClust ,USearch9 ,Ecology ,QH540-549.5 - Abstract
Abstract High‐throughput sequencing has the potential to describe biological communities with high efficiency yet comprehensive assessment of diversity with species‐level resolution remains one of the most challenging aspects of metabarcoding studies. We investigated the utility of curated ribosomal and mitochondrial nematode reference sequence databases for determining phylum‐specific species‐level clustering thresholds. We compiled 438 ribosomal and 290 mitochondrial sequences which identified 99% and 94% as the species delineation clustering threshold, respectively. These thresholds were evaluated in HTS data from mock communities containing 39 nematode species as well as environmental samples from Vietnam. We compared the taxonomic description of the mocks generated by two read‐merging and two clustering algorithms and the cluster‐free Dada2 pipeline. Taxonomic assignment with the RDP classifier was assessed under different training sets. Our results showed that 36/39 mock nematode species were identified across the molecular markers (18S: 32, JB2: 19, JB3: 21) in UClust_ref OTUs at their respective clustering thresholds, outperforming UParse_denovo and the commonly used 97% similarity. Dada2 generated the most realistic number of ASVs (18S: 83, JB2: 75, JB3: 82), collectively identifying 30/39 mock species. The ribosomal marker outperformed the mitochondrial markers in terms of species and genus‐level detections for both OTUs and ASVs. The number of taxonomic assignments of OTUs/ASVs was highest when the smallest reference database containing only nematode sequences was used and when sequences were truncated to the respective amplicon length. Overall, OTUs generated more species‐level detections, which were, however, associated with higher error rates compared to ASVs. Genus‐level assignments using ASVs exhibited higher accuracy and lower error rates compared to species‐level assignments, suggesting that this is the most reliable pipeline for rapid assessment of alpha diversity from environmental samples.
- Published
- 2019
- Full Text
- View/download PDF
3. Metabarcoding free‐living marine nematodes using curated 18S and CO1 reference sequence databases for species‐level taxonomic assignments.
- Author
-
Macheriotou, Lara, Guilini, Katja, Bezerra, Tania Nara, Tytgat, Bjorn, Nguyen, Dinh Tu, Phuong Nguyen, Thi Xuan, Noppe, Febe, Armenteros, Maickel, Boufahja, Fehmi, Rigaux, Annelien, Vanreusel, Ann, and Derycke, Sofie
- Subjects
- *
TAXONOMY , *BIOTIC communities , *CHEMICAL reactions , *T cells , *LYMPHOCYTES - Abstract
High‐throughput sequencing has the potential to describe biological communities with high efficiency yet comprehensive assessment of diversity with species‐level resolution remains one of the most challenging aspects of metabarcoding studies. We investigated the utility of curated ribosomal and mitochondrial nematode reference sequence databases for determining phylum‐specific species‐level clustering thresholds. We compiled 438 ribosomal and 290 mitochondrial sequences which identified 99% and 94% as the species delineation clustering threshold, respectively. These thresholds were evaluated in HTS data from mock communities containing 39 nematode species as well as environmental samples from Vietnam. We compared the taxonomic description of the mocks generated by two read‐merging and two clustering algorithms and the cluster‐free Dada2 pipeline. Taxonomic assignment with the RDP classifier was assessed under different training sets. Our results showed that 36/39 mock nematode species were identified across the molecular markers (18S: 32, JB2: 19, JB3: 21) in UClust_ref OTUs at their respective clustering thresholds, outperforming UParse_denovo and the commonly used 97% similarity. Dada2 generated the most realistic number of ASVs (18S: 83, JB2: 75, JB3: 82), collectively identifying 30/39 mock species. The ribosomal marker outperformed the mitochondrial markers in terms of species and genus‐level detections for both OTUs and ASVs. The number of taxonomic assignments of OTUs/ASVs was highest when the smallest reference database containing only nematode sequences was used and when sequences were truncated to the respective amplicon length. Overall, OTUs generated more species‐level detections, which were, however, associated with higher error rates compared to ASVs. Genus‐level assignments using ASVs exhibited higher accuracy and lower error rates compared to species‐level assignments, suggesting that this is the most reliable pipeline for rapid assessment of alpha diversity from environmental samples. Curated DNA barcode data were used to delineate nematode‐specific inter‐specific operational taxonomic unit (OTUs) clustering thresholds, which were subsequently tested on a artificial nematode community and environmental samples. Open‐reference clustering (UClust_ref) at these empirically derived thresholds outperformed denovo (UParse) as well as cluster‐free approaches (DADA2), resulting in the most accurate description of the nematode community. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads.
- Author
-
Ma Liang, Castle Raley, Xin Zheng, Geetha Kutty, Gogineni, Emile, Sherman, Brad T., Qiang Sun, Xiongfong Chen, Skelly, Thomas, Jones, Kristine, Stephens, Robert, Bin Zhou, Lau, William, Johnson, Calvin, Tomozumi Imamichi, Minkang Jiang, Robin Dewar, Lempicki, Richard A., Bao Tran, and Kovacs, Joseph A.
- Subjects
- *
GENES , *NUCLEOTIDE sequence , *GENETIC code , *BIOINFORMATICS , *GENOMES - Abstract
Background: Gene isoforms are commonly found in both prokaryotes and eukaryotes. Since each isoform may perform a specific function in response to changing environmental conditions, studying the dynamics of gene isoforms is important in understanding biological processes and disease conditions. However, genome-wide identification of gene isoforms is technically challenging due to the high degree of sequence identity among isoforms. Traditional targeted sequencing approach, involving Sanger sequencing of plasmid-cloned PCR products, has low throughput and is very tedious and time-consuming. Next-generation sequencing technologies such as Illumina and 454 achieve high throughput but their short read lengths are a critical barrier to accurate assembly of highly similar gene isoforms, and may result in ambiguities and false joining during sequence assembly. More recently, the third generation sequencer represented by the PacBio platform offers sufficient throughput and long reads covering the full length of typical genes, thus providing a potential to reliably profile gene isoforms. However, the PacBio long reads are error-prone and cannot be effectively analyzed by traditional assembly programs. Results: We present a clustering-based analysis pipeline integrated with PacBio sequencing data for profiling highly similar gene isoforms. This approach was first evaluated in comparison to de novo assembly of 454 reads using a benchmark admixture containing 10 known, cloned msg genes encoding the major surface glycoprotein of Pneumocystis jirovecii. All 10 msg isoforms were successfully reconstructed with the expected length (~1.5 kb) and correct sequence by the new approach, while 454 reads could not be correctly assembled using various assembly programs. When using an additional benchmark admixture containing 22 known P. jirovecii msg isoforms, this approach accurately reconstructed all but 4 these isoforms in their full-length (~3 kb); these 4 isoforms were present in low concentrations in the admixture. Finally, when applied to the original clinical sample from which the 22 known msg isoforms were cloned, this approach successfully identified not only all known isoforms accurately (~3 kb each) but also 48 novel isoforms. Conclusions: PacBio sequencing integrated with the clustering-based analysis pipeline achieves high-throughput and high-resolution discrimination of highly similar sequences, and can serve as a new approach for genome-wide characterization of gene isoforms and other highly repetitive sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
5. Metabarcoding free‐living marine nematodes using curated 18S and CO1 reference sequence databases for species‐level taxonomic assignments
- Author
-
Bjorn Tytgat, Maickel Armenteros, Sofie Derycke, Ann Vanreusel, Annelien Rigaux, Febe Noppe, Thi Xuan Phuong Nguyen, Katja Guilini, T.N. Bezerra, Fehmi Boufahja, Dinh Tu Nguyen, and Lara Macheriotou
- Subjects
0106 biological sciences ,Nematoda ,UClust ,Biology ,computer.software_genre ,010603 evolutionary biology ,01 natural sciences ,mock community ,03 medical and health sciences ,MAGNITUDE ,Species level ,lcsh:QH540-549.5 ,RDNA ,reference sequence database ,Cluster analysis ,DEEP-SEA ,Ecology, Evolution, Behavior and Systematics ,Original Research ,030304 developmental biology ,Nature and Landscape Conservation ,0303 health sciences ,Ecology ,Database ,16S RIBOSOMAL-RNA ,Biology and Life Sciences ,DNA ,Ribosomal RNA ,Amplicon ,16S ribosomal RNA ,Rapid assessment ,metabarcoding ,PATTERNS ,MORPHOLOGY ,Alpha diversity ,lcsh:Ecology ,USearch9 ,computer ,POPULATION GENETIC-STRUCTURE ,COMMUNITY ANALYSIS ,Reference genome - Abstract
High‐throughput sequencing has the potential to describe biological communities with high efficiency yet comprehensive assessment of diversity with species‐level resolution remains one of the most challenging aspects of metabarcoding studies. We investigated the utility of curated ribosomal and mitochondrial nematode reference sequence databases for determining phylum‐specific species‐level clustering thresholds. We compiled 438 ribosomal and 290 mitochondrial sequences which identified 99% and 94% as the species delineation clustering threshold, respectively. These thresholds were evaluated in HTS data from mock communities containing 39 nematode species as well as environmental samples from Vietnam. We compared the taxonomic description of the mocks generated by two read‐merging and two clustering algorithms and the cluster‐free Dada2 pipeline. Taxonomic assignment with the RDP classifier was assessed under different training sets. Our results showed that 36/39 mock nematode species were identified across the molecular markers (18S: 32, JB2: 19, JB3: 21) in UClust_ref OTUs at their respective clustering thresholds, outperforming UParse_denovo and the commonly used 97% similarity. Dada2 generated the most realistic number of ASVs (18S: 83, JB2: 75, JB3: 82), collectively identifying 30/39 mock species. The ribosomal marker outperformed the mitochondrial markers in terms of species and genus‐level detections for both OTUs and ASVs. The number of taxonomic assignments of OTUs/ASVs was highest when the smallest reference database containing only nematode sequences was used and when sequences were truncated to the respective amplicon length. Overall, OTUs generated more species‐level detections, which were, however, associated with higher error rates compared to ASVs. Genus‐level assignments using ASVs exhibited higher accuracy and lower error rates compared to species‐level assignments, suggesting that this is the most reliable pipeline for rapid assessment of alpha diversity from environmental samples.
- Published
- 2019
6. TBC: A clustering algorithm based on prokaryotic taxonomy.
- Author
-
Lee, Jae-Hak, Yi, Hana, Jeon, Yoon-Seong, Won, Sungho, and Chun, Jongsik
- Abstract
High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from . [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
7. Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads.
- Author
-
Liang M, Raley C, Zheng X, Kutty G, Gogineni E, Sherman BT, Sun Q, Chen X, Skelly T, Jones K, Stephens R, Zhou B, Lau W, Johnson C, Imamichi T, Jiang M, Dewar R, Lempicki RA, Tran B, Kovacs JA, and Huang DW
- Abstract
Background: Gene isoforms are commonly found in both prokaryotes and eukaryotes. Since each isoform may perform a specific function in response to changing environmental conditions, studying the dynamics of gene isoforms is important in understanding biological processes and disease conditions. However, genome-wide identification of gene isoforms is technically challenging due to the high degree of sequence identity among isoforms. Traditional targeted sequencing approach, involving Sanger sequencing of plasmid-cloned PCR products, has low throughput and is very tedious and time-consuming. Next-generation sequencing technologies such as Illumina and 454 achieve high throughput but their short read lengths are a critical barrier to accurate assembly of highly similar gene isoforms, and may result in ambiguities and false joining during sequence assembly. More recently, the third generation sequencer represented by the PacBio platform offers sufficient throughput and long reads covering the full length of typical genes, thus providing a potential to reliably profile gene isoforms. However, the PacBio long reads are error-prone and cannot be effectively analyzed by traditional assembly programs., Results: We present a clustering-based analysis pipeline integrated with PacBio sequencing data for profiling highly similar gene isoforms. This approach was first evaluated in comparison to de novo assembly of 454 reads using a benchmark admixture containing 10 known, cloned msg genes encoding the major surface glycoprotein of Pneumocystis jirovecii. All 10 msg isoforms were successfully reconstructed with the expected length (~1.5 kb) and correct sequence by the new approach, while 454 reads could not be correctly assembled using various assembly programs. When using an additional benchmark admixture containing 22 known P. jirovecii msg isoforms, this approach accurately reconstructed all but 4 these isoforms in their full-length (~3 kb); these 4 isoforms were present in low concentrations in the admixture. Finally, when applied to the original clinical sample from which the 22 known msg isoforms were cloned, this approach successfully identified not only all known isoforms accurately (~3 kb each) but also 48 novel isoforms., Conclusions: PacBio sequencing integrated with the clustering-based analysis pipeline achieves high-throughput and high-resolution discrimination of highly similar sequences, and can serve as a new approach for genome-wide characterization of gene isoforms and other highly repetitive sequences.
- Published
- 2016
- Full Text
- View/download PDF
8. Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads
- Author
-
Bin Zhou, William W. Lau, Brad T. Sherman, Min-Kang Jiang, Ma Liang, Calvin A. Johnson, Bao Tran, Qiang Sun, Geetha Kutty, Richard A. Lempicki, Xin Zheng, Kristine Jones, Castle Raley, Xiongfong Chen, Thomas Skelly, Robin L. Dewar, Robert M. Stephens, Da-Wei Huang, Emile Gogineni, Tomozumi Imamichi, and Joseph A. Kovacs
- Subjects
0301 basic medicine ,Gene isoform ,Bioinformatics analysis ,030106 microbiology ,Repetitive Sequences ,Sequence assembly ,Biology ,Biochemistry ,03 medical and health sciences ,symbols.namesake ,Gene isoforms ,Genetics ,Repetitive sequences ,Cluster analysis ,Gene ,Molecular Biology ,PacBio ,Sanger sequencing ,Pneumocystis ,Specific function ,Methodology ,Uclust ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,NGS ,Major surface glycoprotein ,symbols - Abstract
Background Gene isoforms are commonly found in both prokaryotes and eukaryotes. Since each isoform may perform a specific function in response to changing environmental conditions, studying the dynamics of gene isoforms is important in understanding biological processes and disease conditions. However, genome-wide identification of gene isoforms is technically challenging due to the high degree of sequence identity among isoforms. Traditional targeted sequencing approach, involving Sanger sequencing of plasmid-cloned PCR products, has low throughput and is very tedious and time-consuming. Next-generation sequencing technologies such as Illumina and 454 achieve high throughput but their short read lengths are a critical barrier to accurate assembly of highly similar gene isoforms, and may result in ambiguities and false joining during sequence assembly. More recently, the third generation sequencer represented by the PacBio platform offers sufficient throughput and long reads covering the full length of typical genes, thus providing a potential to reliably profile gene isoforms. However, the PacBio long reads are error-prone and cannot be effectively analyzed by traditional assembly programs. Results We present a clustering-based analysis pipeline integrated with PacBio sequencing data for profiling highly similar gene isoforms. This approach was first evaluated in comparison to de novo assembly of 454 reads using a benchmark admixture containing 10 known, cloned msg genes encoding the major surface glycoprotein of Pneumocystis jirovecii. All 10 msg isoforms were successfully reconstructed with the expected length (~1.5 kb) and correct sequence by the new approach, while 454 reads could not be correctly assembled using various assembly programs. When using an additional benchmark admixture containing 22 known P. jirovecii msg isoforms, this approach accurately reconstructed all but 4 these isoforms in their full-length (~3 kb); these 4 isoforms were present in low concentrations in the admixture. Finally, when applied to the original clinical sample from which the 22 known msg isoforms were cloned, this approach successfully identified not only all known isoforms accurately (~3 kb each) but also 48 novel isoforms. Conclusions PacBio sequencing integrated with the clustering-based analysis pipeline achieves high-throughput and high-resolution discrimination of highly similar sequences, and can serve as a new approach for genome-wide characterization of gene isoforms and other highly repetitive sequences. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0090-8) contains supplementary material, which is available to authorized users.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.