562 results on '"Computational Biology standards"'
Search Results
2. Computed Tomography-Based Radiomics Analysis for Prediction of Response to Neoadjuvant Chemotherapy in Breast Cancer Patients.
- Author
-
Duan Y, Yang G, Miao W, Song B, Wang Y, Yan L, Wu F, Zhang R, Mao Y, and Wang Z
- Subjects
- Neoadjuvant Therapy, Predictive Value of Tests, Retrospective Studies, Models, Statistical, Humans, Female, Adult, Middle Aged, Reproducibility of Results, Computational Biology standards, Tomography, X-Ray Computed standards, Breast Neoplasms diagnostic imaging, Breast Neoplasms drug therapy
- Abstract
Purpose: Previous studies have pointed out that magnetic resonance- and fluorodeoxyglucose positron emission tomography-based radiomics had a high predictive value for the response of the neoadjuvant chemotherapy (NAC) in breast cancer by respectively characterizing tumor heterogeneity of the relaxation time and the glucose metabolism. However, it is unclear whether computed tomography (CT)-based radiomics based on density heterogeneity can predict the response of NAC. This study aimed to develop and validate a CT-based radiomics nomogram to predict the response of NAC in breast cancer., Methods: A total of 162 breast cancer patients (110 in the training cohort and 52 in the validation cohort) who underwent CT scans before receiving NAC and had pathological response results were retrospectively enrolled. Grades 4 to 5 cases were classified as response to NAC. According to the Miller-Payne grading system, grades 1 to 3 cases were classified as nonresponse to NAC. Radiomics features were extracted, and the optimal radiomics features were obtained to construct a radiomics signature. Multivariate logistic regression was used to develop the clinical prediction model and the radiomics nomogram that incorporated clinical characteristics and radiomics score. We assessed the performance of different models, including calibration and clinical usefulness., Results: Eight optimal radiomics features were obtained. Human epidermal growth factor receptor 2 status and molecular subtype showed statistical differences between the response group and the nonresponse group. The radiomics nomogram had more favorable predictive efficacy than the clinical prediction model (areas under the curve, 0.82 vs 0.70 in the training cohort; 0.79 vs 0.71 in the validation cohort). The Delong test showed that there are statistical differences between the clinical prediction model and the radiomics nomogram ( z = 2.811, P = 0.005 in the training cohort). The decision curve analysis showed that the radiomics nomogram had higher overall net benefit than the clinical prediction model., Conclusion: The radiomics nomogram based on CT radiomics signature and clinical characteristics has favorable predictive efficacy for the response of NAC in breast cancer., Competing Interests: The authors declare no conflict of interest., (Copyright © 2023 Wolters Kluwer Health, Inc. All rights reserved.)
- Published
- 2023
- Full Text
- View/download PDF
3. AlphaFold's new rival? Meta AI predicts shape of 600 million proteins.
- Author
-
Callaway E
- Subjects
- Artificial Intelligence standards, Artificial Intelligence trends, Computational Biology methods, Computational Biology standards, Computational Biology trends, Proteins chemistry, Protein Folding, Software standards
- Published
- 2022
- Full Text
- View/download PDF
4. Adapt-Kcr: a novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture.
- Author
-
Li Z, Fang J, Wang S, Zhang L, Chen Y, and Pian C
- Subjects
- Algorithms, Benchmarking, Databases, Factual, Neural Networks, Computer, Phosphorylation, ROC Curve, Reproducibility of Results, User-Computer Interface, Computational Biology methods, Computational Biology standards, Deep Learning, Lysine metabolism, Protein Processing, Post-Translational, Software
- Abstract
Protein lysine crotonylation (Kcr) is an important type of posttranslational modification that is associated with a wide range of biological processes. The identification of Kcr sites is critical to better understanding their functional mechanisms. However, the existing experimental techniques for detecting Kcr sites are cost-ineffective, to a great need for new computational methods to address this problem. We here describe Adapt-Kcr, an advanced deep learning model that utilizes adaptive embedding and is based on a convolutional neural network together with a bidirectional long short-term memory network and attention architecture. On the independent testing set, Adapt-Kcr outperformed the current state-of-the-art Kcr prediction model, with an improvement of 3.2% in accuracy and 1.9% in the area under the receiver operating characteristic curve. Compared to other Kcr models, Adapt-Kcr additionally had a more robust ability to distinguish between crotonylation and other lysine modifications. Another model (Adapt-ST) was trained to predict phosphorylation sites in SARS-CoV-2, and outperformed the equivalent state-of-the-art phosphorylation site prediction model. These results indicate that self-adaptive embedding features perform better than handcrafted features in capturing discriminative information; when used in attention architecture, this could be an effective way of identifying protein Kcr sites. Together, our Adapt framework (including learning embedding features and attention architecture) has a strong potential for prediction of other protein posttranslational modification sites., (© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2022
- Full Text
- View/download PDF
5. scMRMA: single cell multiresolution marker-based annotation.
- Author
-
Li J, Sheng Q, Shyr Y, and Liu Q
- Subjects
- Algorithms, Cluster Analysis, Computational Biology standards, Databases, Genetic, Gene Expression Profiling standards, Humans, Molecular Sequence Annotation, Reproducibility of Results, Sequence Analysis, RNA standards, Biomarkers, Computational Biology methods, Gene Expression Profiling methods, Sequence Analysis, RNA methods, Single-Cell Analysis methods, Software
- Abstract
Single-cell RNA sequencing has become a powerful tool for identifying and characterizing cellular heterogeneity. One essential step to understanding cellular heterogeneity is determining cell identities. The widely used strategy predicts identities by projecting cells or cell clusters unidirectionally against a reference to find the best match. Here, we develop a bidirectional method, scMRMA, where a hierarchical reference guides iterative clustering and deep annotation with enhanced resolutions. Taking full advantage of the reference, scMRMA greatly improves the annotation accuracy. scMRMA achieved better performance than existing methods in four benchmark datasets and successfully revealed the expansion of CD8 T cell populations in squamous cell carcinoma after anti-PD-1 treatment., (© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2022
- Full Text
- View/download PDF
6. Assessment of Inter-Laboratory Differences in SARS-CoV-2 Consensus Genome Assemblies between Public Health Laboratories in Australia.
- Author
-
Foster CSP, Stelzer-Braid S, Deveson IW, Bull RA, Yeang M, Au JP, Ruiz Silva M, van Hal SJ, Rockett RJ, Sintchenko V, Kim KW, and Rawlinson WD
- Subjects
- Australia, Computational Biology methods, Humans, Phylogeny, SARS-CoV-2 classification, Whole Genome Sequencing, Computational Biology standards, Consensus, Genome, Viral, Laboratories standards, Public Health, SARS-CoV-2 genetics
- Abstract
Whole-genome sequencing of viral isolates is critical for informing transmission patterns and for the ongoing evolution of pathogens, especially during a pandemic. However, when genomes have low variability in the early stages of a pandemic, the impact of technical and/or sequencing errors increases. We quantitatively assessed inter-laboratory differences in consensus genome assemblies of 72 matched SARS-CoV-2-positive specimens sequenced at different laboratories in Sydney, Australia. Raw sequence data were assembled using two different bioinformatics pipelines in parallel, and resulting consensus genomes were compared to detect laboratory-specific differences. Matched genome sequences were predominantly concordant, with a median pairwise identity of 99.997%. Identified differences were predominantly driven by ambiguous site content. Ignoring these produced differences in only 2.3% (5/216) of pairwise comparisons, each differing by a single nucleotide. Matched samples were assigned the same Pango lineage in 98.2% (212/216) of pairwise comparisons, and were mostly assigned to the same phylogenetic clade. However, epidemiological inference based only on single nucleotide variant distances may lead to significant differences in the number of defined clusters if variant allele frequency thresholds for consensus genome generation differ between laboratories. These results underscore the need for a unified, best-practices approach to bioinformatics between laboratories working on a common outbreak problem.
- Published
- 2022
- Full Text
- View/download PDF
7. Are we there yet? A systematic literature review of Open Educational Resources in Africa: A combined content and bibliometric analysis.
- Author
-
Tlili A, Altinay F, Huang R, Altinay Z, Olivier J, Mishra S, Jemni M, and Burgos D
- Subjects
- Africa, Bibliometrics, Humans, Research Personnel statistics & numerical data, Biological Science Disciplines education, Computational Biology standards, Education, Distance standards, Research Personnel education
- Abstract
Although several studies have been conducted to summarize the progress of open educational resources (OER) in specific regions, only a limited number of studies summarize OER in Africa. Therefore, this paper presents a systematic literature review to explore trends, themes, and patterns in this emerging area of study, using content and bibliometric analysis. Findings indicated three major strands of OER research in Africa: (1) OER adoption is only limited to specific African countries, calling for more research and collaboration between African countries in this field to ensure educational equity; (2) most of the OER initiatives in Africa have focused on the creation process and neglected other important perspectives, such as dissemination and open educational practices (OEP) using OER; and (3) on top of the typical challenges for OER adoption (e.g., infrastructure), other personal challenges were identified within the African context, including culture, language, and personality. The findings of this study suggest that more initiatives and cross-collaborations with African and non-African countries in the field of OER are needed to facilitate OER adoption in the region. Additionally, it is suggested that researchers and practitioners should consider individual differences, such as language, personality and culture, when promoting and designing OER for different African countries. Finally, the findings can promote social justice by providing insights and future research paths that different stakeholders (e.g., policy makers, educators, practitioners, etc.) should focus on to promote OER in Africa., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2022
- Full Text
- View/download PDF
8. iProX in 2021: connecting proteomics data sharing with big data.
- Author
-
Chen T, Ma J, Liu Y, Chen Z, Xiao N, Lu Y, Fu Y, Yang C, Li M, Wu S, Wang X, Li D, He F, Hermjakob H, and Zhu Y
- Subjects
- Big Data, Computational Biology standards, Information Dissemination, Databases, Protein, Proteome genetics, Proteomics, Software
- Abstract
The rapid development of proteomics studies has resulted in large volumes of experimental data. The emergence of big data platform provides the opportunity to handle these large amounts of data. The integrated proteome resource, iProX (https://www.iprox.cn), which was initiated in 2017, has been greatly improved with an up-to-date big data platform implemented in 2021. Here, we describe the main iProX developments since its first publication in Nucleic Acids Research in 2019. First, a hyper-converged architecture with high scalability supports the submission process. A hadoop cluster can store large amounts of proteomics datasets, and a distributed, RESTful-styled Elastic Search engine can query millions of records within one second. Also, several new features, including the Universal Spectrum Identifier (USI) mechanism proposed by ProteomeXchange, RESTful Web Service API, and a high-efficiency reanalysis pipeline, have been added to iProX for better open data sharing. By the end of August 2021, 1526 datasets had been submitted to iProX, reaching a total data volume of 92.42TB. With the implementation of the big data platform, iProX can support PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities that meet the requirements of the fast growing field of proteomics., (© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2022
- Full Text
- View/download PDF
9. PmiREN2.0: from data annotation to functional exploration of plant microRNAs.
- Author
-
Guo Z, Kuang Z, Zhao Y, Deng Y, He H, Wan M, Tao Y, Wang D, Wei J, Li L, and Yang X
- Subjects
- Computational Biology standards, Gene Expression Regulation, Plant genetics, Genome, Plant genetics, MicroRNAs classification, Databases, Genetic, MicroRNAs genetics, Plants genetics, Software
- Abstract
Nearly 200 plant genomes have been sequenced over the last two years, and new functions of plant microRNAs (miRNAs) have been revealed. Therefore, timely update of the plant miRNA databases by incorporating miRNAs from the newly sequenced species and functional information is required to provide useful resources for advancing plant miRNA research. Here we report the update of PmiREN2.0 (https://pmiren.com/) with an addition of 19 363 miRNA entries from 91 plants, doubling the amount of data in the original version. Meanwhile, abundant regulatory information centred on miRNAs was added, including predicted upstream transcription factors through binding motifs scanning and elaborate annotation of miRNA targets. As an example, a genome-wide regulatory network centred on miRNAs was constructed for Arabidopsis. Furthermore, phylogenetic trees of conserved miRNA families were built to expand the understanding of miRNA evolution across the plant lineages. These data are helpful to deduce the regulatory relationships concerning miRNA functions in diverse plants. Beside the new data, a suite of design tools was incorporated to facilitate experimental practice. Finally, a forum named 'PmiREN Community' was added for discussion and resource and new discovery sharing. With these upgrades, PmiREN2.0 should serve the community better and accelerate miRNA research in plants., (© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2022
- Full Text
- View/download PDF
10. ECO: the Evidence and Conclusion Ontology, an update for 2022.
- Author
-
Nadendla S, Jackson R, Munro J, Quaglia F, Mészáros B, Olley D, Hobbs ET, Goralski SM, Chibucos M, Mungall CJ, Tosatto SCE, Erill I, and Giglio MG
- Subjects
- Humans, Molecular Sequence Annotation, Computational Biology standards, Databases, Genetic, Gene Ontology, Software
- Abstract
The Evidence and Conclusion Ontology (ECO) is a community resource that provides an ontology of terms used to capture the type of evidence that supports biomedical annotations and assertions. Consistent capture of evidence information with ECO allows tracking of annotation provenance, establishment of quality control measures, and evidence-based data mining. ECO is in use by dozens of data repositories and resources with both specific and general areas of focus. ECO is continually being expanded and enhanced in response to user requests as well as our aim to adhere to community best-practices for ontology development. The ECO support team engages in multiple collaborations with other ontologies and annotating groups. Here we report on recent updates to the ECO ontology itself as well as associated resources that are available through this project. ECO project products are freely available for download from the project website (https://evidenceontology.org/) and GitHub (https://github.com/evidenceontology/evidenceontology). ECO is released into the public domain under a CC0 1.0 Universal license., (© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2022
- Full Text
- View/download PDF
11. Population-tailored mock genome enables genomic studies in species without a reference genome.
- Author
-
Sabadin F, Carvalho HF, Galli G, and Fritsche-Neto R
- Subjects
- Animals, Chimera genetics, Datasets as Topic, Genome, Genome-Wide Association Study methods, Genome-Wide Association Study standards, Genomics methods, Genomics standards, Genotype, Phenotype, Reference Standards, Reproducibility of Results, Selection, Genetic, Species Specificity, Zea mays classification, Zea mays genetics, Computational Biology methods, Computational Biology standards, Genotyping Techniques methods, Genotyping Techniques standards, Models, Genetic
- Abstract
Based on molecular markers, genomic prediction enables us to speed up breeding schemes and increase the response to selection. There are several high-throughput genotyping platforms able to deliver thousands of molecular markers for genomic study purposes. However, even though its widely applied in plant breeding, species without a reference genome cannot fully benefit from genomic tools and modern breeding schemes. We used a method to assemble a population-tailored mock genome to call single-nucleotide polymorphism (SNP) markers without an available reference genome, and for the first time, we compared the results with standard genotyping platforms (array and genotyping-by-sequencing (GBS) using a reference genome) for performance in genomic prediction models. Our results indicate that using a population-tailored mock genome to call SNP delivers reliable estimates for the genomic relationship between genotypes. Furthermore, genomic prediction estimates were comparable to standard approaches, especially when considering only additive effects. However, mock genomes were slightly worse than arrays at predicting traits influenced by dominance effects, but still performed as well as standard GBS methods that use a reference genome. Nevertheless, the array-based SNP markers methods achieved the best predictive ability and reliability to estimate variance components. Overall, the mock genomes can be a worthy alternative for genomic selection studies, especially for those species where the reference genome is not available., (© 2021. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.)
- Published
- 2022
- Full Text
- View/download PDF
12. DEVELOPMENT OF A TRIPLEX REAL-TIME PCR ASSAY TO DETECT ECHINOCOCCUS SPECIES IN CANID FECAL SAMPLES.
- Author
-
Zhang X, Jian Y, Guo Z, Duo H, and Wei Y
- Subjects
- Animals, Computational Biology standards, DNA, Helminth isolation & purification, Dogs, Echinococcosis parasitology, Echinococcus classification, Echinococcus genetics, Foxes parasitology, Limit of Detection, Multiplex Polymerase Chain Reaction methods, Multiplex Polymerase Chain Reaction standards, Reproducibility of Results, Sensitivity and Specificity, Soil parasitology, Canidae parasitology, Dog Diseases parasitology, Echinococcosis veterinary, Echinococcus isolation & purification, Feces parasitology, Multiplex Polymerase Chain Reaction veterinary
- Abstract
Echinococcosis is a zoonotic disease with great significance to public health, and appropriate detection and control strategies should be adopted to mitigate its impact. Most cases of echinococcosis are believed to be transmitted by the consumption of food and/or water contaminated with canid stool containing Echinococcus spp. eggs. Studies assessing Echinococcus multilocularis, Echinococcus granulosus sensu stricto, and Echinococcus shiquicus coinfection from contaminated water-derived, soil-derived, and food-borne samples are scarce, which may be due to the lack of optimized laboratory detection methods. The present study aimed to develop and evaluate a novel triplex TaqMan-minor groove binder probe for real-time polymerase chain reaction (rtPCR) to simultaneously detect the 3 Echinococcus spp. mentioned above from canid fecal samples in the Qinghai-Tibetan Plateau area (QTPA). The efficiency and linearity of each signal channel in the triplex rtPCR assay were within acceptable limits for the range of concentrations tested. Furthermore, the method was shown to have good repeatability (standard deviation ≤0.32 cycle threshold), and the limit of detection was estimated to be 10 copies plasmid/μl reaction. In summary, the evaluation of the present method shows that the newly developed triplex rtPCR assay is a highly specific, precise, consistent, and stable method that could be used in epidemiological investigations of echinococcosis., (© American Society of Parasitologists 2022.)
- Published
- 2022
- Full Text
- View/download PDF
13. Predicting deleterious missense genetic variants via integrative supervised nonnegative matrix tri-factorization.
- Author
-
Arani AA, Sehhati M, and Tabatabaiefar MA
- Subjects
- Algorithms, Computational Biology standards, Humans, ROC Curve, Reproducibility of Results, Systems Biology methods, Computational Biology methods, Genetic Association Studies, Mutation, Missense, Supervised Machine Learning
- Abstract
Among an assortment of genetic variations, Missense are major ones which a small subset of them may led to the upset of the protein function and ultimately end in human diseases. Various machine learning methods were declared to differentiate deleterious and benign missense variants by means of a large number of features, including structure, sequence, interaction networks, gene disease associations as well as phenotypes. However, development of a reliable and accurate algorithm for merging heterogeneous information is highly needed as it could be captured all information of complex interactions on network that genes participate in. In this study we proposed a new method based on the non-negative matrix tri-factorization clustering method. We outlined two versions of the proposed method: two-source and three-source algorithms. Two-source algorithm aggregates individual deleteriousness prediction methods and PPI network, and three-source algorithm incorporates gene disease associations into the other sources already mentioned. Four benchmark datasets were employed for internally and externally validation of both algorithms of our predictor. The results at all datasets confirmed that, our method outperforms most state of the art variant prediction tools. Two key features of our variant effect prediction method are worth mentioning. Firstly, despite the fact that the incorporation of gene disease information at three-source algorithm can improve prediction performance by comparison with two-source algorithm, our method did not hinder by type 2 circularity error unlike some recent ensemble-based prediction methods. Type 2 circularity error occurs when the predictor annotates variants on the basis of the genes located on. Secondly, the performance of our predictor is superior over other ensemble-based methods for variants positioned on genes in which we do not have enough information about their pathogenicity., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
14. A global view of standards for open image data formats and repositories.
- Author
-
Swedlow JR, Kankaanpää P, Sarkans U, Goscinski W, Galloway G, Malacrida L, Sullivan RP, Härtel S, Brown CM, Wood C, Keppler A, Paina F, Loos B, Zullino S, Longo DL, Aime S, and Onami S
- Subjects
- Animals, Artificial Intelligence, Computational Biology instrumentation, Databases, Factual, Diagnostic Imaging instrumentation, Humans, Image Processing, Computer-Assisted methods, Information Storage and Retrieval methods, Microscopy methods, Proteomics standards, Societies, Scientific, Software, Spectrum Analysis, Raman, User-Computer Interface, Computational Biology methods, Computational Biology standards, Diagnostic Imaging methods, Diagnostic Imaging standards, Metadata standards
- Published
- 2021
- Full Text
- View/download PDF
15. High-accuracy protein structure prediction in CASP14.
- Author
-
Pereira J, Simpkin AJ, Hartmann MD, Rigden DJ, Keegan RM, and Lupas AN
- Subjects
- Computational Biology methods, Computational Biology standards, Databases, Protein, Reproducibility of Results, Sequence Analysis, Protein, Models, Molecular, Protein Conformation, Proteins chemistry, Proteins metabolism, Software
- Abstract
The application of state-of-the-art deep-learning approaches to the protein modeling problem has expanded the "high-accuracy" category in CASP14 to encompass all targets. Building on the metrics used for high-accuracy assessment in previous CASPs, we evaluated the performance of all groups that submitted models for at least 10 targets across all difficulty classes, and judged the usefulness of those produced by AlphaFold2 (AF2) as molecular replacement search models with AMPLE. Driven by the qualitative diversity of the targets submitted to CASP, we also introduce DipDiff as a new measure for the improvement in backbone geometry provided by a model versus available templates. Although a large leap in high-accuracy is seen due to AF2, the second-best method in CASP14 out-performed the best in CASP13, illustrating the role of community-based benchmarking in the development and evolution of the protein structure prediction field., (© 2021 The Authors. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.)
- Published
- 2021
- Full Text
- View/download PDF
16. Integration of VarSome API in an existing bioinformatic pipeline for automated ACMG interpretation of clinical variants.
- Author
-
Sorrentino E, Cristofoli F, Modena C, Paolacci S, Bertelli M, and Marceddu G
- Subjects
- Computational Biology methods, Computational Biology standards, Genomics methods, High-Throughput Nucleotide Sequencing methods, Humans, Search Engine methods, Sequence Analysis, DNA methods, Sequence Analysis, DNA standards, Algorithms, Genetic Variation genetics, Genomics standards, High-Throughput Nucleotide Sequencing standards, Practice Guidelines as Topic standards, Search Engine standards
- Abstract
Objective: While the bioinformatic workflow, from quality control to annotation, is quite standardized, the interpretation of variants is still a challenge. The decreasing cost of massively parallel NGS has produced hundreds of variants per patient to analyze and interpret. The ACMG "Standards and guidelines for the interpretation of sequence variants", widely adopted in clinical settings, assume that the clinician has a comprehensive knowledge of the literature and the disease., Materials and Methods: To semi-automatize the application of the guidelines, we decided to develop an algorithm that exploits VarSome, a widely used platform that interprets variants on the basis of information from more than 70 genome databases., Results: Here we explain how we integrated VarSome API into our existing clinical diagnostic pipeline for NGS data to obtain validated reproducible results as indicated by accuracy, sensitivity and specificity., Conclusions: We validated the automated pipeline to be sure that it was doing what we expected. We obtained 100% sensitivity, specificity and accuracy, confirming that it was suitable for use in a diagnostic setting.
- Published
- 2021
- Full Text
- View/download PDF
17. OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies.
- Author
-
Moore J, Allan C, Besson S, Burel JM, Diel E, Gault D, Kozlowski K, Lindner D, Linkert M, Manz T, Moore W, Pape C, Tischer C, and Swedlow JR
- Subjects
- Benchmarking, Computational Biology methods, Data Compression, Databases, Factual, Information Storage and Retrieval, Internet, Microscopy methods, Programming Languages, SARS-CoV-2, Computational Biology instrumentation, Computational Biology standards, Metadata, Microscopy instrumentation, Microscopy standards, Software
- Abstract
The rapid pace of innovation in biological imaging and the diversity of its applications have prevented the establishment of a community-agreed standardized data format. We propose that complementing established open formats such as OME-TIFF and HDF5 with a next-generation file format such as Zarr will satisfy the majority of use cases in bioimaging. Critically, a common metadata format used in all these vessels can deliver truly findable, accessible, interoperable and reusable bioimaging data., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
18. Variant Selection and Interpretation: An Example of Modified VarSome Classifier of ACMG Guidelines in the Diagnostic Setting.
- Author
-
Cristofoli F, Sorrentino E, Guerri G, Miotto R, Romanelli R, Zulian A, Cecchin S, Paolacci S, Miertus J, Bertelli M, Maltese PE, Chiurazzi P, Stuppia L, Castori M, and Marceddu G
- Subjects
- Computational Biology standards, Genetic Testing standards, Humans, Genetic Variation genetics, Genome, Human genetics, Genomics standards
- Abstract
Variant interpretation is challenging as it involves combining different levels of evidence in order to evaluate the role of a specific variant in the context of a patient's disease. Many in-depth refinements followed the original 2015 American College of Medical Genetics (ACMG) guidelines to overcome subjective interpretation of criteria and classification inconsistencies. Here, we developed an ACMG-based classifier that retrieves information for variant interpretation from the VarSome Stable-API environment and allows molecular geneticists involved in clinical reporting to introduce the necessary changes to criterion strength and to add or exclude criteria assigned automatically, ultimately leading to the final variant classification. We also developed a modified ACMG checklist to assist molecular geneticists in adjusting criterion strength and in adding literature-retrieved or patient-specific information, when available. The proposed classifier is an example of integration of automation and human expertise in variant curation, while maintaining the laboratory analytical workflow and the established bioinformatics pipeline.
- Published
- 2021
- Full Text
- View/download PDF
19. Genetic and Bioinformatic Strategies to Improve Diagnosis in Three Inherited Bleeding Disorders in Bogotá, Colombia.
- Author
-
Lago J, Groot H, Navas D, Lago P, Gamboa M, Calderón D, and Polanía-Villanueva DC
- Subjects
- Blood Coagulation Disorders, Inherited genetics, Colombia, Computational Biology economics, Computational Biology standards, Costs and Cost Analysis, Factor IX chemistry, Genetic Testing economics, Genetic Testing standards, Hemophilia A diagnosis, Humans, Protein Domains, Sensitivity and Specificity, von Willebrand Factor chemistry, Blood Coagulation Disorders, Inherited diagnosis, Computational Biology methods, Factor IX genetics, Genetic Testing methods, Hemophilia A genetics, von Willebrand Factor genetics
- Abstract
Inherited bleeding disorders (IBDs) are the most frequent congenital diseases in the Colombian population; three of them are hemophilia A (HA), hemophilia B (HB), and von Willebrand Disease (VWD). Currently, diagnosis relies on multiple clinical laboratory assays to assign a phenotype. Due to the lack of accessibility to these tests, patients can receive an incomplete diagnosis. In these cases, genetic studies reinforce the clinical diagnosis. The present study characterized the molecular genetic basis of 11 HA, three HB, and five VWD patients by sequencing the F8, F9 , or the VWF gene. Twelve variations were found in HA patients, four in HB patients, and 19 in WVD patients. From these variations a total of 25 novel variations were found. Disease-causing variations were used as positive controls for validation of the high-resolution melting (HRM) variant-scanning technique. This approach is a low-cost genetic diagnostic method proposed to be incorporated in developing countries. For the data analysis, we developed an accessible open-source code in Python that improves HRM data analysis with better sensitivity of 95% and without bias when using different HRM equipment and software. Analysis of amplicons with a length greater than 300 bp can be performed by implementing an analysis by denaturation domains.
- Published
- 2021
- Full Text
- View/download PDF
20. Predicting potential small molecule-miRNA associations based on bounded nuclear norm regularization.
- Author
-
Chen X, Zhou C, Wang CC, and Zhao Y
- Subjects
- Algorithms, Area Under Curve, Computational Biology standards, Drug Discovery standards, Humans, MicroRNAs genetics, ROC Curve, Reproducibility of Results, Small Molecule Libraries, Computational Biology methods, Drug Discovery methods, Ligands, MicroRNAs chemistry
- Abstract
Mounting evidence has demonstrated the significance of taking microRNAs (miRNAs) as the target of small molecule (SM) drugs for disease treatment. Given the fact that exploring new SM-miRNA associations through biological experiments is extremely expensive, several computing models have been constructed to reveal the possible SM-miRNA associations. Here, we built a computing model of Bounded Nuclear Norm Regularization for SM-miRNA Associations prediction (BNNRSMMA). Specifically, we first constructed a heterogeneous SM-miRNA network utilizing miRNA similarity, SM similarity, confirmed SM-miRNA associations and defined a matrix to represent the heterogeneous network. Then, we constructed a model to complete this matrix by minimizing its nuclear norm. The Alternating Direction Method of Multipliers was adopted to minimize the nuclear norm and obtain predicted scores. The main innovation lies in two aspects. During completion, we limited all elements of the matrix within the interval of (0,1) to make sure they have practical significance. Besides, instead of strictly fitting all known elements, a regularization term was incorporated to tolerate the noise in integrated similarities. Furthermore, four kinds of cross-validations on two datasets and two types of case studies were performed to evaluate the predictive performance of BNNRSMMA. Finally, BNNRSMMA attained areas under the curve of 0.9822 (0.8433), 0.9793 (0.8852), 0.8253 (0.7350) and 0.9758 ± 0.0029 (0.8759 ± 0.0041) under global leave-one-out cross-validation (LOOCV), miRNA-fixed LOOCV, SM-fixed LOOCV and 5-fold cross-validation based on Dataset 1(Dataset 2), respectively. With regard to case studies, plenty of predicted associations have been verified by experimental literatures. All these results confirmed that BNNRSMMA is a reliable tool for inferring associations., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
21. XOmiVAE: an interpretable deep learning model for cancer classification using high-dimensional omics data.
- Author
-
Withnell E, Zhang X, Sun K, and Guo Y
- Subjects
- Algorithms, Area Under Curve, Biomarkers, Tumor, Cluster Analysis, Computational Biology standards, Databases, Genetic, Female, Gene Expression Profiling methods, Gene Regulatory Networks, Genomics standards, Humans, Male, Neoplasms metabolism, ROC Curve, Reproducibility of Results, Signal Transduction, Computational Biology methods, Deep Learning, Genomics methods, Machine Learning, Neoplasms diagnosis, Neoplasms etiology
- Abstract
The lack of explainability is one of the most prominent disadvantages of deep learning applications in omics. This 'black box' problem can undermine the credibility and limit the practical implementation of biomedical deep learning models. Here we present XOmiVAE, a variational autoencoder (VAE)-based interpretable deep learning model for cancer classification using high-dimensional omics data. XOmiVAE is capable of revealing the contribution of each gene and latent dimension for each classification prediction and the correlation between each gene and each latent dimension. It is also demonstrated that XOmiVAE can explain not only the supervised classification but also the unsupervised clustering results from the deep learning network. To the best of our knowledge, XOmiVAE is one of the first activation level-based interpretable deep learning models explaining novel clusters generated by VAE. The explainable results generated by XOmiVAE were validated by both the performance of downstream tasks and the biomedical knowledge. In our experiments, XOmiVAE explanations of deep learning-based cancer classification and clustering aligned with current domain knowledge including biological annotation and academic literature, which shows great potential for novel biomedical knowledge discovery from deep learning models., (© The Author(s) 2021. Published by Oxford University Press.)
- Published
- 2021
- Full Text
- View/download PDF
22. A machine learning framework to predict antibiotic resistance traits and yet unknown genes underlying resistance to specific antibiotics in bacterial strains.
- Author
-
Sunuwar J and Azad RK
- Subjects
- Algorithms, Bacteria genetics, Computational Biology standards, Genotype, Phenotype, Reproducibility of Results, Anti-Bacterial Agents pharmacology, Bacteria drug effects, Computational Biology methods, Drug Resistance, Bacterial drug effects, Machine Learning
- Abstract
Recently, the frequency of observing bacterial strains without known genetic components underlying phenotypic resistance to antibiotics has increased. There are several strains of bacteria lacking known resistance genes; however, they demonstrate resistance phenotype to drugs of that family. Although such strains are fewer compared to the overall population, they pose grave emerging threats to an already heavily challenged area of antimicrobial resistance (AMR), where death tolls have reached ~700 000 per year and a grim projection of ~10 million deaths per year by 2050 looms. Considering the fact that development of novel antibiotics is not keeping pace with the emergence and dissemination of resistance, there is a pressing need to decipher yet unknown genetic mechanisms of resistance, which will enable developing strategies for the best use of available interventions and show the way for the development of new drugs. In this study, we present a machine learning framework to predict novel AMR factors that are potentially responsible for resistance to specific antimicrobial drugs. The machine learning framework utilizes whole-genome sequencing AMR genetic data and antimicrobial susceptibility testing phenotypic data to predict resistance phenotypes and rank AMR genes by their importance in discriminating the resistance from the susceptible phenotypes. In summary, we present here a bioinformatics framework for training machine learning models, evaluating their performances, selecting the best performing model(s) and finally predicting the most important AMR loci for the resistance involved., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
23. The accurate prediction and characterization of cancerlectin by a combined machine learning and GO analysis.
- Author
-
Tang F, Zhang L, Xu L, Zou Q, and Feng H
- Subjects
- Algorithms, Amino Acid Sequence, Computational Biology standards, Databases, Genetic, Disease Susceptibility, Lectins chemistry, Neoplasms diagnosis, Neoplasms etiology, Peptides chemistry, Peptides metabolism, Phylogeny, Protein Binding, Protein Interaction Domains and Motifs, Structure-Activity Relationship, Workflow, Computational Biology methods, Gene Ontology, Lectins metabolism, Machine Learning, Neoplasms metabolism
- Abstract
Cancerlectins, lectins linked to tumor progression, have become the focus of cancer therapy research for their carbohydrate-binding specificity. However, the specific characterization for cancerlectins involved in tumor progression is still unclear. By taking advantage of the g-gap tripeptide and tetrapeptide composition feature descriptors, we increased the accuracy of the classification model of cancerlectin and lectin to 98.54% and 95.38%, respectively. About 36 cancerlectin and 135 lectin features were selected for functional characterization by P/N feature ranking method, which particularly selects the features in positive samples. The specific protein domains of cancerlectins are found to be p-GalNAc-T, crystal and annexin by comparing with lectins through the exclusion method. Moreover, the combined GO analysis showed that the conserved cation binding sites of cancerlectin specific domains are covered by selected feature peptides, suggesting that the capability of cation binding, critical for enzyme activity and stability, could be the key characteristic of cancerlectins in tumor progression. These results will help to identify potential cancerlectin and provide clues for mechanism study of cancerlectin in tumor progression., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
24. Matrix factorization-based data fusion for the prediction of RNA-binding proteins and alternative splicing event associations during epithelial-mesenchymal transition.
- Author
-
Qiu Y, Ching WK, and Zou Q
- Subjects
- Algorithms, Computational Biology standards, Humans, ROC Curve, Reproducibility of Results, Software, Alternative Splicing, Computational Biology methods, Epithelial-Mesenchymal Transition genetics, Gene Expression Regulation, Neoplastic, RNA-Binding Proteins metabolism
- Abstract
Motivation: The epithelial-mesenchymal transition (EMT) is a cellular-developmental process activated during tumor metastasis. Transcriptional regulatory networks controlling EMT are well studied; however, alternative RNA splicing also plays a critical regulatory role during this process. Unfortunately, a comprehensive understanding of alternative splicing (AS) and the RNA-binding proteins (RBPs) that regulate it during EMT remains largely unknown. Therefore, a great need exists to develop effective computational methods for predicting associations of RBPs and AS events. Dramatically increasing data sources that have direct and indirect information associated with RBPs and AS events have provided an ideal platform for inferring these associations., Results: In this study, we propose a novel method for RBP-AS target prediction based on weighted data fusion with sparse matrix tri-factorization (WDFSMF in short) that simultaneously decomposes heterogeneous data source matrices into low-rank matrices to reveal hidden associations. WDFSMF can select and integrate data sources by assigning different weights to those sources, and these weights can be assigned automatically. In addition, WDFSMF can identify significant RBP complexes regulating AS events and eliminate noise and outliers from the data. Our proposed method achieves an area under the receiver operating characteristic curve (AUC) of $90.78\%$, which shows that WDFSMF can effectively predict RBP-AS event associations with higher accuracy compared with previous methods. Furthermore, this study identifies significant RBPs as complexes for AS events during EMT and provides solid ground for further investigation into RNA regulation during EMT and metastasis. WDFSMF is a general data fusion framework, and as such it can also be adapted to predict associations between other biological entities., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
25. Cell fate conversion prediction by group sparse optimization method utilizing single-cell and bulk OMICs data.
- Author
-
Qin J, Hu Y, Yao JC, Leung RWT, Zhou Y, Qin Y, and Wang J
- Subjects
- Algorithms, Animals, Binding Sites, Cell Lineage genetics, Cell Physiological Phenomena genetics, Chromatin Immunoprecipitation Sequencing, Computational Biology standards, Embryonic Stem Cells cytology, Embryonic Stem Cells metabolism, Enhancer Elements, Genetic, Gene Expression Profiling, Gene Expression Regulation, Genomics standards, Humans, Mice, Promoter Regions, Genetic, Protein Binding, Single-Cell Analysis standards, Transcription Factors metabolism, Transcriptome, Workflow, Computational Biology methods, Genomics methods, Single-Cell Analysis methods
- Abstract
Cell fate conversion by overexpressing defined factors is a powerful tool in regenerative medicine. However, identifying key factors for cell fate conversion requires laborious experimental efforts; thus, many of such conversions have not been achieved yet. Nevertheless, cell fate conversions found in many published studies were incomplete as the expression of important gene sets could not be manipulated thoroughly. Therefore, the identification of master transcription factors for complete and efficient conversion is crucial to render this technology more applicable clinically. In the past decade, systematic analyses on various single-cell and bulk OMICs data have uncovered numerous gene regulatory mechanisms, and made it possible to predict master gene regulators during cell fate conversion. By virtue of the sparse structure of master transcription factors and the group structure of their simultaneous regulatory effects on the cell fate conversion process, this study introduces a novel computational method predicting master transcription factors based on group sparse optimization technique integrating data from multi-OMICs levels, which can be applicable to both single-cell and bulk OMICs data with a high tolerance of data sparsity. When it is compared with current prediction methods by cross-referencing published and validated master transcription factors, it possesses superior performance. In short, this method facilitates fast identification of key regulators, give raise to the possibility of higher successful conversion rate and in the hope of reducing experimental cost., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
26. Detecting methylation quantitative trait loci using a methylation random field method.
- Author
-
Lyu C, Huang M, Liu N, Chen Z, Lupo PJ, Tycko B, Witte JS, Hobbs CA, and Li M
- Subjects
- Algorithms, Alleles, Bayes Theorem, Computational Biology standards, CpG Islands, Data Analysis, Epigenomics standards, Genotype, Humans, Organ Specificity genetics, Polymorphism, Single Nucleotide, Computational Biology methods, DNA Methylation, Epigenesis, Genetic, Epigenomics methods, Quantitative Trait Loci
- Abstract
DNA methylation may be regulated by genetic variants within a genomic region, referred to as methylation quantitative trait loci (mQTLs). The changes of methylation levels can further lead to alterations of gene expression, and influence the risk of various complex human diseases. Detecting mQTLs may provide insights into the underlying mechanism of how genotypic variations may influence the disease risk. In this article, we propose a methylation random field (MRF) method to detect mQTLs by testing the association between the methylation level of a CpG site and a set of genetic variants within a genomic region. The proposed MRF has two major advantages over existing approaches. First, it uses a beta distribution to characterize the bimodal and interval properties of the methylation trait at a CpG site. Second, it considers multiple common and rare genetic variants within a genomic region to identify mQTLs. Through simulations, we demonstrated that the MRF had improved power over other existing methods in detecting rare variants of relatively large effect, especially when the sample size is small. We further applied our method to a study of congenital heart defects with 83 cardiac tissue samples and identified two mQTL regions, MRPS10 and PSORS1C1, which were colocalized with expression QTL in cardiac tissue. In conclusion, the proposed MRF is a useful tool to identify novel mQTLs, especially for studies with limited sample sizes., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
27. ARAMIS: From systematic errors of NGS long reads to accurate assemblies.
- Author
-
Sacristán-Horcajada E, González-de la Fuente S, Peiró-Pastor R, Carrasco-Ramiro F, Amils R, Requena JM, Berenguer J, and Aguado B
- Subjects
- Algorithms, Base Composition, Computational Biology standards, Genomics methods, Sequence Analysis, DNA methods, Sequence Analysis, DNA standards, Workflow, Computational Biology methods, High-Throughput Nucleotide Sequencing, INDEL Mutation, Software
- Abstract
NGS long-reads sequencing technologies (or third generation) such as Pacific BioSciences (PacBio) have revolutionized the sequencing field over the last decade improving multiple genomic applications like de novo genome assemblies. However, their error rate, mostly involving insertions and deletions (indels), is currently an important concern that requires special attention to be solved. Multiple algorithms are available to fix these sequencing errors using short reads (such as Illumina), although they require long processing times and some errors may persist. Here, we present Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads. As a proof OF concept, six organisms were selected based on their different GC content, size and genome complexity, and their PacBio-assembled genomes were corrected thoroughly by this pipeline. We found that the presence of systematic sequencing errors in long-reads PacBio sequences affecting homopolymeric regions, and that the type of indel error introduced during PacBio sequencing are related to the GC content of the organism. The lack of knowledge of this fact leads to the existence of numerous published studies where such errors have been found and should be resolved since they may contain incorrect biological information. ARAMIS yields better results with less computational resources needed than other correction tools and gives the possibility of detecting the nature of the found indel errors found and its distribution along the genome. The source code of ARAMIS is available at https://github.com/genomics-ngsCBMSO/ARAMIS.git., (© The Author(s) 2021. Published by Oxford University Press.)
- Published
- 2021
- Full Text
- View/download PDF
28. Improved pathogenicity prediction for rare human missense variants.
- Author
-
Wu Y, Li R, Sun S, Weile J, and Roth FP
- Subjects
- Genetic Predisposition to Disease, Humans, Phenotype, Precision Medicine, Software, Algorithms, Computational Biology standards, Disease etiology, Mutation, Missense
- Abstract
The success of personalized genomic medicine depends on our ability to assess the pathogenicity of rare human variants, including the important class of missense variation. There are many challenges in training accurate computational systems, e.g., in finding the balance between quantity, quality, and bias in the variant sets used as training examples and avoiding predictive features that can accentuate the effects of bias. Here, we describe VARITY, which judiciously exploits a larger reservoir of training examples with uncertain accuracy and representativity. To limit circularity and bias, VARITY excludes features informed by variant annotation and protein identity. To provide a rationale for each prediction, we quantified the contribution of features and feature combinations to the pathogenicity inference of each variant. VARITY outperformed all previous computational methods evaluated, identifying at least 10% more pathogenic variants at thresholds achieving high (90% precision) stringency., Competing Interests: Declaration of interests F.P.R. is a scientific advisor holding shares in Constantiam Biosciences and BioSymetrics and a Ranomics shareholder. S.S. is currently employed by Sanofi Pasteur (Canada). The authors declare no other competing interests., (Copyright © 2021 The Authors. Published by Elsevier Inc. All rights reserved.)
- Published
- 2021
- Full Text
- View/download PDF
29. Keeping checks on machine learning.
- Subjects
- Big Data, Reproducibility of Results, Computational Biology methods, Computational Biology standards, Machine Learning standards, Research Design standards
- Published
- 2021
- Full Text
- View/download PDF
30. Reproducibility standards for machine learning in the life sciences.
- Author
-
Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, and Hicks SC
- Subjects
- Reproducibility of Results, Software, Computational Biology methods, Computational Biology standards, Machine Learning standards
- Published
- 2021
- Full Text
- View/download PDF
31. DOME: recommendations for supervised machine learning validation in biology.
- Author
-
Walsh I, Fishman D, Garcia-Gasulla D, Titma T, Pollastri G, Harrow J, Psomopoulos FE, and Tosatto SCE
- Subjects
- Algorithms, Humans, Models, Biological, Computational Biology methods, Computational Biology standards, Guidelines as Topic, Research Design, Supervised Machine Learning
- Published
- 2021
- Full Text
- View/download PDF
32. Avoiding a replication crisis in deep-learning-based bioimage analysis.
- Author
-
Laine RF, Arganda-Carreras I, Henriques R, and Jacquemet G
- Subjects
- Microscopy methods, Microscopy standards, Biomedical Research methods, Biomedical Research standards, Computational Biology methods, Computational Biology standards, Deep Learning standards, Image Processing, Computer-Assisted standards
- Published
- 2021
- Full Text
- View/download PDF
33. PacBio sequencing output increased through uniform and directional fivefold concatenation.
- Author
-
Kanwar N, Blanco C, Chen IA, and Seelig B
- Subjects
- Animals, Base Sequence, Computational Biology standards, Gene Library, Mice, Molecular Sequence Annotation, Sequence Analysis, DNA standards, Sequence Analysis, Protein, Computational Biology methods, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods
- Abstract
Advances in sequencing technology have allowed researchers to sequence DNA with greater ease and at decreasing costs. Main developments have focused on either sequencing many short sequences or fewer large sequences. Methods for sequencing mid-sized sequences of 600-5,000 bp are currently less efficient. For example, the PacBio Sequel I system yields ~ 100,000-300,000 reads with an accuracy per base pair of 90-99%. We sought to sequence several DNA populations of ~ 870 bp in length with a sequencing accuracy of 99% and to the greatest depth possible. We optimised a simple, robust method to concatenate genes of ~ 870 bp five times and then sequenced the resulting DNA of ~ 5,000 bp by PacBioSMRT long-read sequencing. Our method improved upon previously published concatenation attempts, leading to a greater sequencing depth, high-quality reads and limited sample preparation at little expense. We applied this efficient concatenation protocol to sequence nine DNA populations from a protein engineering study. The improved method is accompanied by a simple and user-friendly analysis pipeline, DeCatCounter, to sequence medium-length sequences efficiently at one-fifth of the cost., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
34. Metagenomics: a path to understanding the gut microbiome.
- Author
-
Yen S and Johnson JS
- Subjects
- Humans, Metagenome genetics, Microbiota genetics, RNA, Ribosomal, 16S genetics, Computational Biology standards, Gastrointestinal Microbiome genetics, High-Throughput Nucleotide Sequencing standards, Metagenomics
- Abstract
The gut microbiome is a major determinant of host health, yet it is only in the last 2 decades that the advent of next-generation sequencing has enabled it to be studied at a genomic level. Shotgun sequencing is beginning to provide insight into the prokaryotic as well as eukaryotic and viral components of the gut community, revealing not just their taxonomy, but also the functions encoded by their collective metagenome. This revolution in understanding is being driven by continued development of sequencing technologies and in consequence necessitates reciprocal development of computational approaches that can adapt to the evolving nature of sequence datasets. In this review, we provide an overview of current bioinformatic strategies for handling metagenomic sequence data and discuss their strengths and limitations. We then go on to discuss key technological developments that have the potential to once again revolutionise the way we are able to view and hence understand the microbiome., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
35. Highly accurate protein structure prediction for the human proteome.
- Author
-
Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A, Velankar S, Kleywegt GJ, Bateman A, Evans R, Pritzel A, Figurnov M, Ronneberger O, Bates R, Kohl SAA, Potapenko A, Ballard AJ, Romera-Paredes B, Nikolov S, Jain R, Clancy E, Reiman D, Petersen S, Senior AW, Kavukcuoglu K, Birney E, Kohli P, Jumper J, and Hassabis D
- Subjects
- Datasets as Topic standards, Diacylglycerol O-Acyltransferase chemistry, Glucose-6-Phosphatase chemistry, Humans, Membrane Proteins chemistry, Protein Folding, Reproducibility of Results, Computational Biology standards, Deep Learning standards, Models, Molecular, Protein Conformation, Proteome chemistry
- Abstract
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure
1 . Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2 , at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective., (© 2021. The Author(s).)- Published
- 2021
- Full Text
- View/download PDF
36. Highly accurate protein structure prediction with AlphaFold.
- Author
-
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, and Hassabis D
- Subjects
- Amino Acid Sequence, Computational Biology methods, Computational Biology standards, Databases, Protein, Deep Learning standards, Models, Molecular, Reproducibility of Results, Sequence Alignment, Neural Networks, Computer, Protein Conformation, Protein Folding, Proteins chemistry
- Abstract
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort
1-4 , the structures of around 100,000 unique proteins have been determined5 , but this represents a small fraction of the billions of known protein sequences6,7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8 -has been an important open research problem for more than 50 years9 . Despite recent progress10-14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm., (© 2021. The Author(s).)- Published
- 2021
- Full Text
- View/download PDF
37. Gene name errors: Lessons not learned.
- Author
-
Abeysooriya M, Soria M, Kasu MS, and Ziemann M
- Subjects
- Humans, PubMed, Software, Terminology as Topic, Computational Biology standards, Genes genetics, Molecular Sequence Annotation standards
- Abstract
Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2021
- Full Text
- View/download PDF
38. Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources.
- Author
-
Hanussek M, Bartusch F, and Krüger J
- Subjects
- Algorithms, Benchmarking, Cloud Computing, Computational Biology standards, Computational Biology statistics & numerical data, Computers, Computing Methodologies, Data Interpretation, Statistical, Databases, Factual statistics & numerical data, High-Throughput Nucleotide Sequencing, Humans, Image Interpretation, Computer-Assisted, Machine Learning, Sequence Alignment, Software, User-Computer Interface, Computational Biology methods
- Abstract
The large amount of biological data available in the current times, makes it necessary to use tools and applications based on sophisticated and efficient algorithms, developed in the area of bioinformatics. Further, access to high performance computing resources is necessary, to achieve results in reasonable time. To speed up applications and utilize available compute resources as efficient as possible, software developers make use of parallelization mechanisms, like multithreading. Many of the available tools in bioinformatics offer multithreading capabilities, but more compute power is not always helpful. In this study we investigated the behavior of well-known applications in bioinformatics, regarding their performance in the terms of scaling, different virtual environments and different datasets with our benchmarking tool suite BOOTABLE. The tool suite includes the tools BBMap, Bowtie2, BWA, Velvet, IDBA, SPAdes, Clustal Omega, MAFFT, SINA and GROMACS. In addition we added an application using the machine learning framework TensorFlow. Machine learning is not directly part of bioinformatics but applied to many biological problems, especially in the context of medical images (X-ray photographs). The mentioned tools have been analyzed in two different virtual environments, a virtual machine environment based on the OpenStack cloud software and in a Docker environment. The gained performance values were compared to a bare-metal setup and among each other. The study reveals, that the used virtual environments produce an overhead in the range of seven to twenty-five percent compared to the bare-metal environment. The scaling measurements showed, that some of the analyzed tools do not benefit from using larger amounts of computing resources, whereas others showed an almost linear scaling behavior. The findings of this study have been generalized as far as possible and should help users to find the best amount of resources for their analysis. Further, the results provide valuable information for resource providers to handle their resources as efficiently as possible and raise the user community's awareness of the efficient usage of computing resources., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2021
- Full Text
- View/download PDF
39. Review of Methodological Approaches to Human Milk Small Extracellular Vesicle Proteomics.
- Author
-
Vahkal B, Kraft J, Ferretti E, Chung M, Beaulieu JF, and Altosaar I
- Subjects
- Computational Biology methods, Computational Biology standards, Exosomes chemistry, Humans, Mass Spectrometry methods, Mass Spectrometry standards, Proteomics standards, Reproducibility of Results, Ultracentrifugation methods, Ultracentrifugation standards, Extracellular Vesicles chemistry, Milk Proteins analysis, Milk, Human chemistry, Proteomics methods
- Abstract
Proteomics can map extracellular vesicles (EVs), including exosomes, across disease states between organisms and cell types. Due to the diverse origin and cargo of EVs, tailoring methodological and analytical techniques can support the reproducibility of results. Proteomics scans are sensitive to in-sample contaminants, which can be retained during EV isolation procedures. Contaminants can also arise from the biological origin of exosomes, such as the lipid-rich environment in human milk. Human milk (HM) EVs and exosomes are emerging as a research interest in health and disease, though the experimental characterization and functional assays remain varied. Past studies of HM EV proteomes have used data-dependent acquisition methods for protein detection, however, improvements in data independent acquisition could allow for previously undetected EV proteins to be identified by mass spectrometry. Depending on the research question, only a specific population of proteins can be compared and measured using isotope and other labelling techniques. In this review, we summarize published HM EV proteomics protocols and suggest a methodological workflow with the end-goal of effective and reproducible analysis of human milk EV proteomes.
- Published
- 2021
- Full Text
- View/download PDF
40. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.
- Author
-
Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, Haussler D, and Corbett-Detig R
- Subjects
- Algorithms, Computational Biology standards, Databases, Genetic, Genome, Viral, Humans, Molecular Sequence Annotation, Mutation, Web Browser, COVID-19 epidemiology, COVID-19 virology, Computational Biology methods, Phylogeny, SARS-CoV-2 classification, SARS-CoV-2 genetics, Software
- Abstract
As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering in a new era of 'genomic contact tracing'-that is, using viral genomes to trace local transmission dynamics. However, because the viral phylogeny is already so large-and will undoubtedly grow many fold-placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach greatly improves the speed of phylogenetic placement of new samples and data visualization, making it possible to complete the placements under the constraints of real-time contact tracing. Thus, our method addresses an important need for maintaining a fully updated reference phylogeny. We make these tools available to the research community through the University of California Santa Cruz SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for SARS-CoV-2 specifically for laboratories worldwide.
- Published
- 2021
- Full Text
- View/download PDF
41. nPhase: an accurate and contiguous phasing method for polyploids.
- Author
-
Abou Saada O, Tsouris A, Eberlein C, Friedrich A, and Schacherer J
- Subjects
- Algorithms, Computational Biology standards, Databases, Genetic, High-Throughput Nucleotide Sequencing, Humans, Reproducibility of Results, Sequence Analysis, DNA, Workflow, Computational Biology methods, Genomics methods, Polyploidy, Software
- Abstract
While genome sequencing and assembly are now routine, we do not have a full, precise picture of polyploid genomes. No existing polyploid phasing method provides accurate and contiguous haplotype predictions. We developed nPhase, a ploidy agnostic tool that leverages long reads and accurate short reads to solve alignment-based phasing for samples of unspecified ploidy ( https://github.com/OmarOakheart/nPhase ). nPhase is validated by tests on simulated and real polyploids. nPhase obtains on average over 95% accuracy and a contiguous 1.25 haplotigs per haplotype to cover more than 90% of each chromosome (heterozygosity rate ≥ 0.5%). nPhase allows population genomics and hybrid studies of polyploids.
- Published
- 2021
- Full Text
- View/download PDF
42. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications.
- Author
-
Formenti G, Rhie A, Balacco J, Haase B, Mountcastle J, Fedrigo O, Brown S, Capodiferro MR, Al-Ajli FO, Ambrosini R, Houde P, Koren S, Oliver K, Smith M, Skelton J, Betteridge E, Dolucan J, Corton C, Bista I, Torrance J, Tracey A, Wood J, Uliano-Silva M, Howe K, McCarthy S, Winkler S, Kwak W, Korlach J, Fungtammasan A, Fordham D, Costa V, Mayes S, Chiara M, Horner DS, Myers E, Durbin R, Achilli A, Braun EL, Phillippy AM, and Jarvis ED
- Subjects
- Animals, Computational Biology methods, Computational Biology standards, Evolution, Molecular, High-Throughput Nucleotide Sequencing, Gene Duplication, Genome, Mitochondrial, Genomics methods, Repetitive Sequences, Nucleic Acid, Vertebrates genetics
- Abstract
Background: Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly., Results: As part of the Vertebrate Genomes Project (VGP) we develop mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (> 10 kbp, PacBio or Nanopore) and short (100-300 bp, Illumina) reads. Our pipeline leads to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We observe that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we identify errors, missing sequences, and incomplete genes in those references, particularly in repetitive regions. Our assemblies also identify novel gene region duplications. The presence of repeats and duplications in over half of the species herein assembled indicates that their occurrence is a principle of mitochondrial structure rather than an exception, shedding new light on mitochondrial genome evolution and organization., Conclusions: Our results indicate that even in the "simple" case of vertebrate mitogenomes the completeness of many currently available reference sequences can be further improved, and caution should be exercised before claiming the complete assembly of a mitogenome, particularly from short reads alone.
- Published
- 2021
- Full Text
- View/download PDF
43. Bias in RNA-seq Library Preparation: Current Challenges and Solutions.
- Author
-
Shi H, Zhou Y, Jia E, Pan M, Bai Y, and Ge Q
- Subjects
- Bias, Computational Biology standards, Humans, Workflow, Gene Library, RNA analysis, RNA genetics, RNA-Seq methods, RNA-Seq standards
- Abstract
Although RNA sequencing (RNA-seq) has become the most advanced technology for transcriptome analysis, it also confronts various challenges. As we all know, the workflow of RNA-seq is extremely complicated and it is easy to produce bias. This may damage the quality of RNA-seq dataset and lead to an incorrect interpretation for sequencing result. Thus, our detailed understanding of the source and nature of these biases is essential for the interpretation of RNA-seq data, finding methods to improve the quality of RNA-seq experimental, or development bioinformatics tools to compensate for these biases. Here, we discuss the sources of experimental bias in RNA-seq. And for each type of bias, we discussed the method for improvement, in order to provide some useful suggestions for researcher in RNA-seq experimental., Competing Interests: There are no conflicts of interest to declare., (Copyright © 2021 Huajuan Shi et al.)
- Published
- 2021
- Full Text
- View/download PDF
44. Personalized genome structure via single gamete sequencing.
- Author
-
Lyu R, Tsui V, McCarthy DJ, and Crismani W
- Subjects
- Animals, Chromosome Mapping, Computational Biology methods, Computational Biology standards, Data Interpretation, Statistical, Genetic Heterogeneity, Genomics standards, Humans, Precision Medicine standards, Reproducibility of Results, Sensitivity and Specificity, Single-Cell Analysis standards, Whole Genome Sequencing, Genome, Genomics methods, Germ Cells metabolism, High-Throughput Nucleotide Sequencing methods, Precision Medicine methods, Single-Cell Analysis methods
- Abstract
Genetic maps have been fundamental to building our understanding of disease genetics and evolutionary processes. The gametes of an individual contain all of the information required to perform a de novo chromosome-scale assembly of an individual's genome, which historically has been performed with populations and pedigrees. Here, we discuss how single-cell gamete sequencing offers the potential to merge the advantages of short-read sequencing with the ability to build personalized genetic maps and open up an entirely new space in personalized genetics.
- Published
- 2021
- Full Text
- View/download PDF
45. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments.
- Author
-
Jin H and Liu Z
- Subjects
- Computational Biology standards, Gene Expression Profiling methods, RNA-Seq standards, Reproducibility of Results, Computational Biology methods, RNA-Seq methods, Software
- Abstract
Background: Deconvolution analyses have been widely used to track compositional alterations of cell types in gene expression data. Although a large number of novel methods have been developed, due to a lack of understanding of the effects of modeling assumptions and tuning parameters, it is challenging for researchers to select an optimal deconvolution method suitable for the targeted biological conditions., Results: To systematically reveal the pitfalls and challenges of deconvolution analyses, we investigate the impact of several technical and biological factors including simulation model, quantification unit, component number, weight matrix, and unknown content by constructing three benchmarking frameworks. These frameworks cover comparative analysis of 11 popular deconvolution methods under 1766 conditions., Conclusions: We provide new insights to researchers for future application, standardization, and development of deconvolution tools on RNA-seq data.
- Published
- 2021
- Full Text
- View/download PDF
46. Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing.
- Author
-
Huang YT, Liu PY, and Shih PW
- Subjects
- Bacteria genetics, Fungi genetics, Genomics standards, Metagenome, Metagenomics, Nanopore Sequencing methods, Viruses genetics, Computational Biology methods, Computational Biology standards, Genomics methods, Nanopore Sequencing standards, Sequence Analysis, DNA methods, Sequence Analysis, DNA standards
- Abstract
Nanopore sequencing has been widely used for the reconstruction of microbial genomes. Owing to higher error rates, errors on the genome are corrected via neural networks trained by Nanopore reads. However, the systematic errors usually remain uncorrected. This paper designs a model that is trained by homologous sequences for the correction of Nanopore systematic errors. The developed program, Homopolish, outperforms Medaka and HELEN in bacteria, viruses, fungi, and metagenomic datasets. When combined with Medaka/HELEN, the genome quality can exceed Q50 on R9.4 flow cells. We show that Nanopore-only sequencing can produce high-quality microbial genomes sufficient for downstream analysis.
- Published
- 2021
- Full Text
- View/download PDF
47. MTSplice predicts effects of genetic variants on tissue-specific splicing.
- Author
-
Cheng J, Çelik MH, Kundaje A, and Gagneur J
- Subjects
- Autistic Disorder genetics, Brain metabolism, Computational Biology standards, Exons, Gene Expression Profiling, Gene Expression Regulation, Humans, Introns, Organ Specificity, Alternative Splicing, Computational Biology methods, Genetic Variation, Software
- Abstract
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
- Published
- 2021
- Full Text
- View/download PDF
48. Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.
- Author
-
Yi H, Lin Y, Lin C, and Jin W
- Subjects
- Algorithms, Bacteria genetics, Computational Biology standards, Databases, Genetic, Genome, Bacterial, High-Throughput Nucleotide Sequencing, Metagenomics standards, Sequence Analysis, DNA, Computational Biology methods, Metagenomics methods, Software
- Abstract
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.
- Published
- 2021
- Full Text
- View/download PDF
49. seqQscorer: automated quality control of next-generation sequencing data using machine learning.
- Author
-
Albrecht S, Sprang M, Andrade-Navarro MA, and Fontaine JF
- Subjects
- Algorithms, Computational Biology standards, Databases, Genetic, Genomics methods, Genomics standards, ROC Curve, Reproducibility of Results, Workflow, Computational Biology methods, High-Throughput Nucleotide Sequencing methods, Machine Learning, Quality Control, Software
- Abstract
Controlling quality of next-generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterize common NGS quality features and develop a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal and external functional genomics datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at https://github.com/salbrec/seqQscorer .
- Published
- 2021
- Full Text
- View/download PDF
50. VarBen: Generating in Silico Reference Data Sets for Clinical Next-Generation Sequencing Bioinformatics Pipeline Evaluation.
- Author
-
Li Z, Fang S, Zhang R, Yu L, Zhang J, Bu D, Sun L, Zhao Y, and Li J
- Subjects
- Computational Biology standards, Genetic Association Studies standards, Genome-Wide Association Study methods, Genome-Wide Association Study standards, Humans, INDEL Mutation, Mutation, Polymorphism, Single Nucleotide, Reproducibility of Results, Computational Biology methods, Genetic Association Studies methods, Genetic Predisposition to Disease, Genetic Variation, High-Throughput Nucleotide Sequencing, Software
- Abstract
Next-generation sequencing is increasingly being adopted as a valuable method for the detection of somatic variants in clinical oncology. However, it is still challenging to reach a satisfactory level of robustness and standardization in clinical practice when using the currently available bioinformatics pipelines to detect variants from raw sequencing data. Moreover, appropriate reference data sets are lacking for clinical bioinformatics pipeline development, validation, and proficiency testing. Herein, we developed the Variant Benchmark tool (VarBen), an open-source software for variant simulation to generate customized reference data sets by directly editing the original sequencing reads. VarBen can introduce a variety of variants, including single-nucleotide variants, small insertions and deletions, and large structural variants, into targeted, exome, or whole-genome sequencing data, and can handle sequencing data from both the Illumina and Ion Torrent sequencing platforms. To demonstrate the feasibility and robustness of VarBen, we performed variant simulation on different sequencing data sets and compared the simulated variants with real-world data. The validation study showed that the simulated data are highly comparable to real-world data and that VarBen is a reliable tool for variant simulation. In addition, our collaborative study of somatic variant calling in 20 laboratories emphasizes the need for laboratories to evaluate their bioinformatics pipelines with customized reference data sets. VarBen may help users develop and validate their bioinformatics pipelines using locally generated sequencing data., (Copyright © 2021 Association for Molecular Pathology and American Society for Investigative Pathology. Published by Elsevier Inc. All rights reserved.)
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.