Descriptor: "bioinformatics" / Journal: bioinformatics - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"bioinformatics"' showing total 6,287 results

Start Over Descriptor "bioinformatics" Journal bioinformatics

6,287 results on '"bioinformatics"'

1. GRIEVOUS: your command-line general for resolving cross-dataset genotype inconsistencies

Author: Talwar, James V, Klie, Adam, Pagadala, Meghana S, and Carter, Hannah
Subjects: Biological Sciences, Genetics, Networking and Information Technology R&D (NITRD), Human Genome, Polymorphism, Single Nucleotide, Software, Genotype, Genome-Wide Association Study, Humans, Databases, Genetic, Alleles, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryHarmonizing variant indexing and allele assignments across datasets is crucial for data integrity in cross-dataset studies such as multi-cohort genome-wide association studies, meta-analyses, and the development, validation, and application of polygenic risk scores. Ensuring this indexing and allele consistency is a laborious, time-consuming, and error-prone process requiring a certain degree of computational proficiency. Here, we introduce GRIEVOUS, a command-line tool for cross-dataset variant homogenization. By means of an internal database and a custom indexing methodology, GRIEVOUS identifies, formats, and aligns all biallelic single nucleotide polymorphisms (SNPs) across all summary statistic and genotype files of interest. Upon completion of dataset harmonization, GRIEVOUS can also be used to extract the maximal set of biallelic SNPs common to all datasets.Availability and implementationGRIEVOUS and all supporting documentation and tutorials can be found at https://github.com/jvtalwar/GRIEVOUS. It is freely and publicly available under the MIT license and can be installed via pip.
Published: 2024

2. SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

Author: Gjoni, Ketrin and Pollard, Katherine S
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Networking and Information Technology R&D (NITRD), Machine Learning and Artificial Intelligence, Bioengineering, Humans, Computational Biology, Mutagenesis, Computer Simulation, Software, Machine Learning, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryThe increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences.Availability and implementationSuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.
Published: 2024

3. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Author: Caufield, J Harry, Hegde, Harshad, Emonet, Vincent, Harris, Nomi L, Joachimiak, Marcin P, Matentzoglu, Nicolas, Kim, HyeongSik, Moxon, Sierra, Reese, Justin T, Haendel, Melissa A, Robinson, Peter N, and Mungall, Christopher J
Subjects: Data Management and Data Science, Information and Computing Sciences, Networking and Information Technology R&D (NITRD), Generic health relevance, Semantics, Knowledge Bases, Databases, Factual, Mathematical Sciences, Biological Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCreating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas.ResultsHere we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM.Availability and implementationSPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.
Published: 2024

4. Identifying nucleotide-binding leucine-rich repeat receptor and pathogen effector pairing using transfer-learning and bilinear attention network.

Author: Qiao, Baixue, Wang, Shuda, Hou, Mingjun, Chen, Haodi, Zhou, Zhengwenyang, Xie, Xueying, Pang, Shaozi, Yang, Chunxue, Yang, Fenglong, Zou, Quan, and Sun, Shanwen
Subjects: *MACHINE learning, *PLANT breeding, *BIOLOGY, *FORECASTING, *BIOINFORMATICS, *DEEP learning
Abstract: Motivation Nucleotide-binding leucine-rich repeat (NLR) family is a class of immune receptors capable of detecting and defending against pathogen invasion. They have been widely used in crop breeding. Notably, the correspondence between NLRs and effectors (CNE) determines the applicability and effectiveness of NLRs. Unfortunately, CNE data is very scarce. In fact, we've found a substantial 91 291 NLRs confirmed via wet experiments and bioinformatics methods but only 387 CNEs are recognized, which greatly restricts the potential application of NLRs. Results We propose a deep learning algorithm called ProNEP to identify NLR-effector pairs in a high-throughput manner. Specifically, we conceptualized the CNE prediction task as a protein–protein interaction (PPI) prediction task. Then, ProNEP predicts the interaction between NLRs and effectors by combining the transfer learning with a bilinear attention network. ProNEP achieves superior performance against state-of-the-art models designed for PPI predictions. Based on ProNEP, we conduct extensive identification of potential CNEs for 91 291 NLRs. With the rapid accumulation of genomic data, we expect that this tool will be widely used to predict CNEs in new species, advancing biology, immunology, and breeding. Availability and implementation The ProNEP is available at http://nerrd.cn/#/prediction. The project code is available at https://github.com/QiaoYJYJ/ProNEP. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. dsRID: in silico identification of dsRNA regions using long-read RNA-seq data

Author: Yamamoto, Ryo, Liu, Zhiheng, Choudhury, Mudra, and Xiao, Xinshu
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Alzheimer's Disease including Alzheimer's Disease Related Dementias (AD/ADRD), Human Genome, Alzheimer's Disease, Aging, Brain Disorders, Dementia, Neurosciences, Neurodegenerative, Acquired Cognitive Impairment, 1.1 Normal biological development and functioning, Underpinning research, Humans, RNA, Double-Stranded, RNA-Seq, Sequence Analysis, RNA, Base Sequence, Genome, Software, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationDouble-stranded RNAs (dsRNAs) are potent triggers of innate immune responses upon recognition by cytosolic dsRNA sensor proteins. Identification of endogenous dsRNAs helps to better understand the dsRNAome and its relevance to innate immunity related to human diseases.ResultsHere, we report dsRID (double-stranded RNA identifier), a machine-learning-based method to predict dsRNA regions in silico, leveraging the power of long-read RNA-sequencing (RNA-seq) and molecular traits of dsRNAs. Using models trained with PacBio long-read RNA-seq data derived from Alzheimer's disease (AD) brain, we show that our approach is highly accurate in predicting dsRNA regions in multiple datasets. Applied to an AD cohort sequenced by the ENCODE consortium, we characterize the global dsRNA profile with potentially distinct expression patterns between AD and controls. Together, we show that dsRID provides an effective approach to capture global dsRNA profiles using long-read RNA-seq data.Availability and implementationSoftware implementation of dsRID, and genomic coordinates of regions predicted by dsRID in all samples are available at the GitHub repository: https://github.com/gxiaolab/dsRID.
Published: 2023

6. cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory

Author: Johnson, Brian, Shuai, Yubo, Schweinsberg, Jason, and Curtius, Kit
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Cancer, Good Health and Well Being, Humans, Software, Neoplasms, Sequence Analysis, DNA, Phylogeny, Clone Cells, Mutation, Clonal Evolution, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationWhile evolutionary approaches to medicine show promise, measuring evolution itself is difficult due to experimental constraints and the dynamic nature of body systems. In cancer evolution, continuous observation of clonal architecture is impossible, and longitudinal samples from multiple timepoints are rare. Increasingly available DNA sequencing datasets at single-cell resolution enable the reconstruction of past evolution using mutational history, allowing for a better understanding of dynamics prior to detectable disease. There is an unmet need for an accurate, fast, and easy-to-use method to quantify clone growth dynamics from these datasets.ResultsWe derived methods based on coalescent theory for estimating the net growth rate of clones using either reconstructed phylogenies or the number of shared mutations. We applied and validated our analytical methods for estimating the net growth rate of clones, eliminating the need for complex simulations used in previous methods. When applied to hematopoietic data, we show that our estimates may have broad applications to improve mechanistic understanding and prognostic ability. Compared to clones with a single or unknown driver mutation, clones with multiple drivers have significantly increased growth rates (median 0.94 versus 0.25 per year; P = 1.6×10-6). Further, stratifying patients with a myeloproliferative neoplasm (MPN) by the growth rate of their fittest clone shows that higher growth rates are associated with shorter time to MPN diagnosis (median 13.9 versus 26.4 months; P = 0.0026).Availability and implementationWe developed a publicly available R package, cloneRate, to implement our methods (Package website: https://bdj34.github.io/cloneRate/). Source code: https://github.com/bdj34/cloneRate/.
Published: 2023

7. BioCoder: a benchmark for bioinformatics code generation with large language models.

Author: Tang, Xiangru, Qian, Bill, Gao, Rick, Chen, Jiakang, Chen, Xinyun, and Gerstein, Mark B
Subjects: *LANGUAGE models, *GENERATIVE pre-trained transformers, *MODELS & modelmaking, *BIOINFORMATICS, *STEVEDORES
Abstract: Summary Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). Availability and implementation All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Predicting protein functions using positive-unlabeled ranking with ontology-based priors.

Author: Zhapa-Camacho, Fernando, Tang, Zhenwei, Kulmanov, Maxat, and Hoehndorf, Robert
Subjects: *GENE ontology, *SET functions, *CLASSIFICATION, *ANNOTATIONS, *BIOINFORMATICS
Abstract: Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. Availability and implementation Data and code are available at https://github.com/bio-ontology-research-group/PU-GO. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. PyWGCNA: a Python package for weighted gene co-expression network analysis

Author: Rezaie, Narges, Reese, Farilie, and Mortazavi, Ali
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Neurosciences, Human Genome, Biotechnology, Gene Expression Profiling, RNA-Seq, Gene Regulatory Networks, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationWeighted gene co-expression network analysis (WGCNA) is frequently used to identify modules of genes that are co-expressed across many RNA-seq samples. However, the current R implementation is slow, is not designed to compare modules between multiple WGCNA networks, and its results can be hard to interpret as well as to visualize. We introduce the PyWGCNA Python package, which is designed to identify co-expression modules from large RNA-seq datasets. PyWGCNA has a faster implementation than the R version of WGCNA and several additional downstream analysis modules for functional enrichment analysis using GO, KEGG, and REACTOME, inter-module analysis of protein-protein interactions, as well as comparison of multiple co-expression modules to each other and/or external lists of genes such as marker genes from single cell.ResultsWe apply PyWGCNA to two distinct datasets of brain bulk RNA-seq from MODEL-AD to identify modules associated with the genotypes. We compare the resulting modules to each other to find shared co-expression signatures in the form of modules with significant overlap across the datasets.Availability and implementationThe PyWGCNA library for Python 3 is available on PyPi at pypi.org/project/PyWGCNA and on GitHub at github.com/mortazavilab/PyWGCNA. The data underlying this article are available in GitHub at github.com/mortazavilab/PyWGCNA/tutorials/5xFAD_paper.
Published: 2023

10. KG-Hub—building and exchanging biological knowledge graphs

Author: Caufield, J Harry, Putman, Tim, Schaper, Kevin, Unni, Deepak R, Hegde, Harshad, Callahan, Tiffany J, Cappelletti, Luca, Moxon, Sierra AT, Ravanmehr, Vida, Carbon, Seth, Chan, Lauren E, Cortes, Katherina, Shefchek, Kent A, Elsarboukh, Glass, Balhoff, Jim, Fontana, Tommaso, Matentzoglu, Nicolas, Bruskiewich, Richard M, Thessen, Anne E, Harris, Nomi L, Munoz-Torres, Monica C, Haendel, Melissa A, Robinson, Peter N, Joachimiak, Marcin P, Mungall, Christopher J, and Reese, Justin T
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Networking and Information Technology R&D (NITRD), Machine Learning and Artificial Intelligence, Humans, Pattern Recognition, Automated, COVID-19, Biological Ontologies, Rare Diseases, Machine Learning, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationKnowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking.ResultsHere we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification.Availability and implementationhttps://kghub.org.
Published: 2023

11. Accelerating open modification spectral library searching on tensor core in high-dimensional space

Author: Kang, Jaeyoung, Xu, Weihong, Bittremieux, Wout, Moshiri, Niema, and Rosing, Tajana
Subjects: Information and Computing Sciences, Biotechnology, Bioengineering, Tandem Mass Spectrometry, Databases, Protein, Software, Peptides, Search Engine, Algorithms, Peptide Library, Mathematical Sciences, Biological Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationDriven by technological advances, the throughput and cost of mass spectrometry (MS) proteomics experiments have improved by orders of magnitude in recent decades. Spectral library searching is a common approach to annotating experimental mass spectra by matching them against large libraries of reference spectra corresponding to known peptides. An important disadvantage, however, is that only peptides included in the spectral library can be found, whereas novel peptides, such as those with unexpected post-translational modifications (PTMs), will remain unknown. Open modification searching (OMS) is an increasingly popular approach to annotate modified peptides based on partial matches against their unmodified counterparts. Unfortunately, this leads to very large search spaces and excessive runtimes, which is especially problematic considering the continuously increasing sizes of MS proteomics datasets.ResultsWe propose an OMS algorithm, called HOMS-TC, that fully exploits parallelism in the entire pipeline of spectral library searching. We designed a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss. This process can be easily parallelized since each dimension is calculated independently. HOMS-TC processes two stages of existing cascade search in parallel and selects the most similar spectra while considering PTMs. We accelerate HOMS-TC on NVIDIA's tensor core units, which is emerging and readily available in the recent graphics processing unit (GPU). Our evaluation shows that HOMS-TC is 31× faster on average than alternative search engines and provides comparable accuracy to competing search tools.Availability and implementationHOMS-TC is freely available under the Apache 2.0 license as an open-source software project at https://github.com/tycheyoung/homs-tc.
Published: 2023

12. A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing

Author: Prodanov, Timofey and Bansal, Vikas
Subjects: Human Genome, Genetics, 2.1 Biological and endogenous factors, Aetiology, Generic health relevance, Good Health and Well Being, Humans, Segmental Duplications, Genomic, DNA Copy Number Variations, Whole Genome Sequencing, Benchmarking, Genome, Human, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationLow-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover > 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases.MethodsWe describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy.ResultsOn simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F1 = 0.947) than other callers (best F1 = 0.908) across seven human genomes.Availability and implementationParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC.
Published: 2023

13. twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis

Author: Wang, Xinran, Lu, Zeyun, Bhattacharya, Arjun, Pasaniuc, Bogdan, and Mancuso, Nicholas
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Genetics, Human Genome, Biotechnology, Prevention, Humans, Transcriptome, Genome-Wide Association Study, Gene Expression Profiling, Computer Simulation, Software, Polymorphism, Single Nucleotide, Genetic Predisposition to Disease, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryGenome-wide association studies (GWASs) have identified numerous genetic variants associated with complex disease risk; however, most of these associations are non-coding, complicating identifying their proximal target gene. Transcriptome-wide association studies (TWASs) have been proposed to mitigate this gap by integrating expression quantitative trait loci (eQTL) data with GWAS data. Numerous methodological advancements have been made for TWAS, yet each approach requires ad hoc simulations to demonstrate feasibility. Here, we present twas_sim, a computationally scalable and easily extendable tool for simplified performance evaluation and power analysis for TWAS methods.Availability and implementationSoftware and documentation are available at https://github.com/mancusolab/twas_sim.
Published: 2023

14. ViralConsensus: a fast and memory-efficient tool for calling viral consensus genome sequences directly from read alignment data

Author: Moshiri, Niema
Subjects: Human Genome, Networking and Information Technology R&D (NITRD), Genetics, Infection, Sequence Analysis, DNA, Consensus, High-Throughput Nucleotide Sequencing, Software, Genome, Viral, Algorithms, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationIn viral molecular epidemiology, reconstruction of consensus genomes from sequence data is critical for tracking mutations and variants of concern. However, as the number of samples that are sequenced grows rapidly, compute resources needed to reconstruct consensus genomes can become prohibitively large.ResultsViralConsensus is a fast and memory-efficient tool for calling viral consensus genome sequences directly from read alignment data. ViralConsensus is orders of magnitude faster and more memory-efficient than existing methods. Further, unlike existing methods, ViralConsensus can pipe data directly from a read mapper via standard input and performs viral consensus calling on-the-fly, making it an ideal tool for viral sequencing pipelines.Availability and implementationViralConsensus is freely available at https://github.com/niemasd/ViralConsensus as an open-source software project.
Published: 2023

15. kb_DRAM: annotation and metabolic profiling of genomes with DRAM in KBase

Author: Shaffer, Michael, Borton, Mikayla A, Bolduc, Ben, Faria, José P, Flynn, Rory M, Ghadermazi, Parsa, Edirisinghe, Janaka N, Wood-Charlson, Elisha M, Miller, Christopher S, Chan, Siu Hung Joshua, Sullivan, Matthew B, Henry, Christopher S, and Wrighton, Kelly C
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Human Genome, Networking and Information Technology R&D (NITRD), Genetics, Molecular Sequence Annotation, Software, Bacteria, Archaea, Metabolomics, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: Microbial genome annotation is the process of identifying structural and functional elements in DNA sequences and subsequently attaching biological information to those elements. DRAM is a tool developed to annotate bacterial, archaeal, and viral genomes derived from pure cultures or metagenomes. DRAM goes beyond traditional annotation tools by distilling multiple gene annotations to genome level summaries of functional potential. Despite these benefits, a downside of DRAM is the requirement of large computational resources, which limits its accessibility. Further, it did not integrate with downstream metabolic modeling tools that require genome annotation. To alleviate these constraints, DRAM and the viral counterpart, DRAM-v, are now available and integrated with the freely accessible KBase cyberinfrastructure. With kb_DRAM users can generate DRAM annotations and functional summaries from microbial or viral genomes in a point-and-click interface, as well as generate genome-scale metabolic models from DRAM annotations.Availability and implementationFor kb_DRAM users, the kb_DRAM apps on KBase can be found in the catalog at https://narrative.kbase.us/#catalog/modules/kb_DRAM. For kb_DRAM users, a tutorial workflow with all documentation is available at https://narrative.kbase.us/narrative/129480. For kb_DRAM developers, software is available at https://github.com/shafferm/kb_DRAM.
Published: 2023

16. Multivariate genome-wide association analysis by iterative hard thresholding

Author: Chu, Benjamin B, Ko, Seyoon, Zhou, Jin J, Jensen, Aubrey, Zhou, Hua, Sinsheimer, Janet S, and Lange, Kenneth
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Genetics, Human Genome, Genome-Wide Association Study, Software, Algorithms, Computer Simulation, Phenotype, Polymorphism, Single Nucleotide, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationIn a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive.ResultsWe present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits.Availability and implementationSoftware, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.
Published: 2023

17. NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

Author: Pillich, Rudolf T, Chen, Jing, Churas, Christopher, Fong, Dylan, Gyori, Benjamin M, Ideker, Trey, Karis, Klas, Liu, Sophie N, Ono, Keiichiro, Pico, Alexander, and Pratt, Dexter
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Biotechnology, Networking and Information Technology R&D (NITRD), Underpinning research, 1.5 Resources and infrastructure (underpinning), Generic health relevance, Computational Biology, Software, Protein Interaction Maps, Publications, Databases, Factual, Internet, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationThe investigation of sets of genes using biological pathways is a common task for researchers and is supported by a wide variety of software tools. This type of analysis generates hypotheses about the biological processes that are active or modulated in a specific experimental context.ResultsThe Network Data Exchange Integrated Query (NDEx IQuery) is a new tool for network and pathway-based gene set interpretation that complements or extends existing resources. It combines novel sources of pathways, integration with Cytoscape, and the ability to store and share analysis results. The NDEx IQuery web application performs multiple gene set analyses based on diverse pathways and networks stored in NDEx. These include curated pathways from WikiPathways and SIGNOR, published pathway figures from the last 27 years, machine-assembled networks using the INDRA system, and the new NCI-PID v2.0, an updated version of the popular NCI Pathway Interaction Database. NDEx IQuery's integration with MSigDB and cBioPortal now provides pathway analysis in the context of these two resources.Availability and implementationNDEx IQuery is available at https://www.ndexbio.org/iquery and is implemented in Javascript and Java.
Published: 2023

18. Haptools: a toolkit for admixture and haplotype analysis

Author: Massarat, Arya R, Lamkin, Michael, Reeve, Ciara, Williams, Amy L, D’Antonio, Matteo, and Gymrek, Melissa
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Genetics, Human Genome, Haplotypes, Software, Genome-Wide Association Study, Genomics, Genome, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryLeveraging local ancestry and haplotype information in genome-wide association studies and downstream analyses can improve the utility of genomics for individuals from diverse and recently admixed ancestries. However, most existing simulation, visualization and variant analysis frameworks are based on variant-level analysis and do not automatically handle these features. We present haptools, an open-source toolkit for performing local ancestry aware and haplotype-based analysis of complex traits. Haptools supports fast simulation of admixed genomes, visualization of admixture tracks, simulation of haplotype- and local ancestry-specific phenotype effects and a variety of file operations and statistics computed in a haplotype-aware manner.Availability and implementationHaptools is freely available at https://github.com/cast-genomics/haptools.DocumentationDetailed documentation is available at https://haptools.readthedocs.io.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2023

19. Optimal gap-affine alignment in O(s) space

Author: Marco-Sola, Santiago, Eizenga, Jordan M, Guarracino, Andrea, Paten, Benedict, Garrison, Erik, and Moreto, Miquel
Subjects: Information and Computing Sciences, Biological Sciences, Bioinformatics and Computational Biology, Mathematical Sciences, Biotechnology, Genetics, Human Genome, Networking and Information Technology R&D (NITRD), Algorithms, Genomics, Computational Biology, Genome, Sequence Analysis, DNA, Software, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationPairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement.ResultsIn this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times.Availability and implementationAll code is publicly available at https://github.com/smarco/BiWFA-paper.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2023

20. The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information

Author: Morris, John H, Soman, Karthik, Akbas, Rabia E, Zhou, Xiaoyuan, Smith, Brett, Meng, Elaine C, Huang, Conrad C, Cerono, Gabriel, Schenk, Gundolf, Rizk-Jackson, Angela, Harroud, Adil, Sanders, Lauren, Costes, Sylvain V, Bharat, Krish, Chakraborty, Arjun, Pico, Alexander R, Mardirossian, Taline, Keiser, Michael, Tang, Alice, Hardi, Josef, Shi, Yongmei, Musen, Mark, Israni, Sharat, Huang, Sui, Rose, Peter W, Nelson, Charlotte A, and Baranzini, Sergio E
Subjects: Good Health and Well Being, Precision Medicine, Pattern Recognition, Automated, Databases, Factual, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationKnowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information.ResultsIn this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts.Availability and implementationThe SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2023

21. PrioriTree: a utility for improving phylodynamic analyses in BEAST

Author: Gao, Jiansi, May, Michael R, Rannala, Bruce, and Moore, Brian R
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Good Health and Well Being, Software, Bayes Theorem, Disease Outbreaks, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryPhylodynamic methods are central to studies of the geographic and demographic history of disease outbreaks. Inference under discrete-geographic phylodynamic models-which involve many parameters that must be inferred from minimal information-is inherently sensitive to our prior beliefs about the model parameters. We present an interactive utility, PrioriTree, to help researchers identify and accommodate prior sensitivity in discrete-geographic inferences. Specifically, PrioriTree provides a suite of functions to generate input files for-and summarize output from-BEAST analyses for performing robust Bayesian inference, data-cloning analyses and assessing the relative and absolute fit of candidate discrete-geographic (prior) models to empirical datasets.Availability and implementationPrioriTree is distributed as an R package available at https://github.com/jsigao/prioritree, with a comprehensive user manual provided at https://bookdown.org/jsigao/prioritree_manual/.
Published: 2023

22. MIDAS2: Metagenomic Intra-species Diversity Analysis System

Author: Zhao, Chunyu, Dimitrov, Boris, Goldman, Miriam, Nayfach, Stephen, and Pollard, Katherine S
Subjects: Human Genome, Genetics, Metagenome, Software, Metagenomics, Genotype, Databases, Factual, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: SummaryThe Metagenomic Intra-Species Diversity Analysis System (MIDAS) is a scalable metagenomic pipeline that identifies single nucleotide variants (SNVs) and gene copy number variants in microbial populations. Here, we present MIDAS2, which addresses the computational challenges presented by increasingly large reference genome databases, while adding functionality for building custom databases and leveraging paired-end reads to improve SNV accuracy. This fast and scalable reengineering of the MIDAS pipeline enables thousands of metagenomic samples to be efficiently genotyped.Availability and implementationThe source code is available at https://github.com/czbiohub/MIDAS2.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2023

23. Poisson hurdle model-based method for clustering microbiome features

Author: Qiao, Zhili, Barnes, Elle, Tringe, Susannah, Schachtman, Daniel P, and Liu, Peng
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Information and Computing Sciences, Statistics, Mathematical Sciences, Bioengineering, Microbiome, Algorithms, Computer Simulation, Microbiota, Cluster Analysis, High-Throughput Nucleotide Sequencing, Software, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationHigh-throughput sequencing technologies have greatly facilitated microbiome research and have generated a large volume of microbiome data with the potential to answer key questions regarding microbiome assembly, structure and function. Cluster analysis aims to group features that behave similarly across treatments, and such grouping helps to highlight the functional relationships among features and may provide biological insights into microbiome networks. However, clustering microbiome data are challenging due to the sparsity and high dimensionality.ResultsWe propose a model-based clustering method based on Poisson hurdle models for sparse microbiome count data. We describe an expectation-maximization algorithm and a modified version using simulated annealing to conduct the cluster analysis. Moreover, we provide algorithms for initialization and choosing the number of clusters. Simulation results demonstrate that our proposed methods provide better clustering results than alternative methods under a variety of settings. We also apply the proposed method to a sorghum rhizosphere microbiome dataset that results in interesting biological findings.Availability and implementationR package is freely available for download at https://cran.r-project.org/package=PHclust.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2023

24. BTR: a bioinformatics tool recommendation system.

Author: Green, Ryan, Qu, Xufeng, Liu, Jinze, and Yu, Tingting
Subjects: *RECOMMENDER systems, *GRAPH neural networks, *NATURAL language processing, *DEEP learning, *BIOINFORMATICS, *SOURCE code
Abstract: Motivation The rapid expansion of Bioinformatics research has led to a proliferation of computational tools for scientific analysis pipelines. However, constructing these pipelines is a demanding task, requiring extensive domain knowledge and careful consideration. As the Bioinformatics landscape evolves, researchers, both novice and expert, may feel overwhelmed in unfamiliar fields, potentially leading to the selection of unsuitable tools during workflow development. Results In this article, we introduce the Bioinformatics Tool Recommendation system (BTR), a deep learning model designed to recommend suitable tools for a given workflow-in-progress. BTR leverages recent advances in graph neural network technology, representing the workflow as a graph to capture essential context. Natural language processing techniques enhance tool recommendations by analyzing associated tool descriptions. Experiments demonstrate that BTR outperforms the existing Galaxy tool recommendation system, showcasing its potential to streamline scientific workflow construction. Availability and implementation The Python source code is available at https://github.com/ryangreenj/bioinformatics_tool_recommendation. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. GBZ file format for pangenome graphs

Author: Sirén, Jouni and Paten, Benedict
Subjects: Information and Computing Sciences, Biological Sciences, Bioinformatics and Computational Biology, High-Throughput Nucleotide Sequencing, Software, Data Compression, Libraries, Mathematical Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationPangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space efficiently.ResultsWe propose the GBZ file format based on data structures used in the Giraffe short-read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems.Availability and implementationC++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

26. FastMix: a versatile data integration pipeline for cell type-specific biomarker inference.

Author: Zhang, Yun, Sun, Hao, Mandava, Aishwarya, Aevermann, Brian D, Kollmann, Tobias R, Scheuermann, Richard H, Qiu, Xing, and Qian, Yu
Subjects: Human Genome, Biotechnology, Networking and Information Technology R&D (NITRD), Genetics, Biomarkers, Cross-Sectional Studies, Data Analysis, Software, Transcriptome, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationFlow cytometry (FCM) and transcription profiling are the two widely used assays in translational immunology research. However, there is no data integration pipeline for analyzing these two types of assays together with experiment variables for biomarker inference. Current FCM data analysis mainly relies on subjective manual gating analysis, which is difficult to be directly integrated with other automated computational methods. Existing deconvolutional analysis of bulk transcriptomics relies on predefined marker genes in the transcriptomics data, which are unavailable for novel cell types and does not utilize the FCM data that provide canonical phenotypic definitions of the cell types.ResultsWe developed a novel analytics pipeline-FastMix-for computational immunology, which integrates flow cytometry, bulk transcriptomics and clinical covariates for identifying cell type-specific gene expression signatures and biomarker genes. FastMix addresses the 'large p, small n' problem in the gene expression and flow cytometry integration analysis via a linear mixed effects model (LMER) for both cross-sectional and longitudinal studies. Its novel moment-based estimator not only reduces bias in parameter estimation but also is more efficient than iterative optimization. The FastMix pipeline also includes a cutting-edge flow cytometry data analysis method-DAFi-for identifying cell populations of interest and their characteristics. Simulation studies showed that FastMix produced smaller type I/II errors than competing methods. Validation using real data of two vaccine studies showed that FastMix identified a consistent set of signature genes as in independent single-cell RNA-seq analysis, producing additional interesting findings.Availability and implementationSource code of FastMix is publicly available at https://github.com/terrysun0302/FastMix.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

27. Random field modeling of multi-trait multi-locus association for detecting methylation quantitative trait loci.

Author: Lyu, Chen, Huang, Manyan, Liu, Nianjun, Chen, Zhongxue, Lupo, Philip J, Tycko, Benjamin, Witte, John S, Hobbs, Charlotte A, and Li, Ming
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Genetics, Cardiovascular, Biotechnology, Heart Disease, Human Genome, Quantitative Trait Loci, Genome-Wide Association Study, Methylation, Phenotype, Genomics, DNA Methylation, Polymorphism, Single Nucleotide, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCpG sites within the same genomic region often share similar methylation patterns and tend to be co-regulated by multiple genetic variants that may interact with one another.ResultsWe propose a multi-trait methylation random field (multi-MRF) method to evaluate the joint association between a set of CpG sites and a set of genetic variants. The proposed method has several advantages. First, it is a multi-trait method that allows flexible correlation structures between neighboring CpG sites (e.g. distance-based correlation). Second, it is also a multi-locus method that integrates the effect of multiple common and rare genetic variants. Third, it models the methylation traits with a beta distribution to characterize their bimodal and interval properties. Through simulations, we demonstrated that the proposed method had improved power over some existing methods under various disease scenarios. We further illustrated the proposed method via an application to a study of congenital heart defects (CHDs) with 83 cardiac tissue samples. Our results suggested that gene BACE2, a methylation quantitative trait locus (QTL) candidate, colocalized with expression QTLs in artery tibial and harbored genetic variants with nominal significant associations in two genome-wide association studies of CHD.Availability and implementationhttps://github.com/chenlyu2656/Multi-MRF.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

28. TIMSCONVERT: a workflow to convert trapped ion mobility data to open data formats.

Author: Luu, Gordon T, Freitas, Michael A, Lizama-Chamu, Itzel, McCaughey, Catherine S, Sanchez, Laura M, and Wang, Mingxun
Subjects: Bioengineering, Networking and Information Technology R&D (NITRD), Workflow, Ecosystem, Software, Mass Spectrometry, Data Analysis, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationAdvances in mass spectrometry have led to the development of mass spectrometers with ion mobility spectrometry capabilities and dual-source instrumentation; however, the current software ecosystem lacks interoperability with downstream data analysis using open-source software and pipelines.ResultsHere, we present TIMSCONVERT, a data conversion high-throughput workflow from timsTOF Pro/fleX mass spectrometer raw data files to mzML and imzML formats that incorporates ion mobility data while maintaining compatibility with data analysis tools. We showcase several examples using data acquired across different experiments and acquisition modalities on the timsTOF fleX MS.Availability and implementationTIMSCONVERT and its documentation can be found at https://github.com/gtluu/timsconvert and is available as a standalone command-line interface tool for Windows and Linux, NextFlow workflow and online in the Global Natural Products Social (GNPS) platform.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

29. matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2

Author: Ye, Cheng, Thornlow, Bryan, Hinrichs, Angie, Kramer, Alexander, Mirchandani, Cade, Torvi, Devika, Lanfear, Robert, Corbett-Detig, Russell, and Turakhia, Yatish
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Emerging Infectious Diseases, Coronaviruses, Infectious Diseases, Good Health and Well Being, Humans, Phylogeny, SARS-CoV-2, Pandemics, COVID-19, Software, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationPhylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the coronavirus disease 2019 (COVID-19) pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic.ResultsHere, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences.Availability and implementationThe matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

30. CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Author: Roux de Bézieux, Hector, Lima, Leandro, Perraudeau, Fanny, Mary, Arnaud, Dudoit, Sandrine, and Jacob, Laurent
Subjects: Biotechnology, Genetics, Human Genome, Algorithms, Bacteria, Genome-Wide Association Study, Sequence Analysis, DNA, Software, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationGenome-wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single-nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.ResultsHere, we overcome this issue by testing covariates built from closed connected subgraphs (CCSs) of the de Bruijn graph defined over genomic k-mers. These covariates capture polymorphic genes as a single entity, improving k-mer-based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all CCSs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.Availability and implementationWe provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

31. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

Author: Zhang, Pinglu, Liu, Huan, Wei, Yanming, Zhai, Yixiao, Tian, Qinzhong, and Zou, Quan
Subjects: *SEQUENCE alignment, *NUCLEOTIDE sequence, *SOURCE code, *RESEARCH personnel, *BIOINFORMATICS
Abstract: Motivation In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. Results FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. Availability and implementation Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data.

Author: Song, Dongyuan, Xi, Nan Miles, Li, Jingyi Jessica, and Wang, Lin
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Software, Transcriptome, Algorithms, Data Analysis, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryThe number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data.Availability and implementationscSampler is implemented in Python and is published under the MIT source license. It can be installed by "pip install scsampler" and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

33. Phitest for analyzing the homogeneity of single-cell populations.

Author: Li, Wei Vivian
Subjects: Genetics, Human Genome, Generic health relevance, Animals, Software, Cluster Analysis, Single-Cell Analysis, Algorithms, Transcriptome, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationSingle-cell RNA sequencing technologies facilitate the characterization of transcriptomic landscapes in diverse species, tissues and cell types with unprecedented molecular resolution. In order to better understand animal development, physiology, and pathology, unsupervised clustering analysis is often used to identify relevant cell populations. Although considerable progress has been made in terms of clustering algorithms in recent years, it remains challenging to evaluate the quality of the inferred single-cell clusters, which can greatly impact downstream analysis and interpretation.ResultsWe propose a bioinformatics tool named Phitest to analyze the homogeneity of single-cell populations. Phitest is able to distinguish between homogeneous and heterogeneous cell populations, providing an objective and automatic method to optimize the performance of single-cell clustering analysis.Availability and implementationThe PhitestR package is freely available on both Github (https://github.com/Vivianstats/PhitestR) and the Comprehensive R Archive Network (CRAN). There is no new genomic data associated with this article. Published data used in the analysis are described in detail in the Supplementary Data.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

34. CellWalkR: an R package for integrating and visualizing single-cell and bulk data to resolve regulatory elements

Author: Przytycki, Pawel F and Pollard, Katherine S
Subjects: Mathematical Sciences, Biological Sciences, Statistics, Genetics, Software, Regulatory Sequences, Nucleic Acid, Data Interpretation, Statistical, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryCellWalkR is an R package that integrates single-cell open chromatin data with cell type labels and bulk epigenetic data to identify cell type-specific regulatory regions. A Graphics Processing Unit (GPU) implementation and downsampling strategies enable thousands of cells to be processed in seconds. CellWalkR's user-friendly interface provides interactive analysis and visualization of cell labels and regulatory region mappings.Availability and implementationCellWalkR is freely available as an R package under a GNU GPL-2.0 License and can be accessed from https://github.com/PFPrzytycki/CellWalkR with an accompanying vignette.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

35. From viral evolution to spatial contagion: a biologically modulated Hawkes model.

Author: Holbrook, Andrew J, Ji, Xiang, and Suchard, Marc A
Subjects: Biodefense, Prevention, Bioengineering, Infectious Diseases, Emerging Infectious Diseases, Vaccine Related, Genetics, 2.5 Research design and methodologies (aetiology), Aetiology, Infection, Good Health and Well Being, Humans, Bayes Theorem, Phylogeny, Hemorrhagic Fever, Ebola, Disease Outbreaks, Genome, Viral, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: SummaryMutations sometimes increase contagiousness for evolving pathogens. During an epidemic, scientists use viral genome data to infer a shared evolutionary history and connect this history to geographic spread. We propose a model that directly relates a pathogen's evolution to its spatial contagion dynamics-effectively combining the two epidemiological paradigms of phylogenetic inference and self-exciting process modeling-and apply this phylogenetic Hawkes process to a Bayesian analysis of 23 421 viral cases from the 2014 to 2016 Ebola outbreak in West Africa. The proposed model is able to detect individual viruses with significantly elevated rates of spatiotemporal propagation for a subset of 1610 samples that provide genome data. Finally, to facilitate model application in big data settings, we develop massively parallel implementations for the gradient and Hessian of the log-likelihood and apply our high-performance computing framework within an adaptively pre-conditioned Hamiltonian Monte Carlo routine.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

36. SCONCE: a method for profiling copy number alterations in cancer evolution using single-cell whole genome sequencing

Author: Hui, Sandra and Nielsen, Rasmus
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Human Genome, Cancer, Biotechnology, Humans, DNA Copy Number Variations, Whole Genome Sequencing, Neoplasms, Algorithms, Genome, High-Throughput Nucleotide Sequencing, Software, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCopy number alterations (CNAs) are a significant driver in cancer growth and development, but remain poorly characterized on the single cell level. Although genome evolution in cancer cells is Markovian through evolutionary time, CNAs are not Markovian along the genome. However, existing methods call copy number profiles with Hidden Markov Models or change point detection algorithms based on changes in observed read depth, corrected by genome content and do not account for the stochastic evolutionary process.ResultsWe present a theoretical framework to use tumor evolutionary history to accurately call CNAs in a principled manner. To model the tumor evolutionary process and account for technical noise from low coverage single-cell whole genome sequencing data, we developed SCONCE, a method based on a Hidden Markov Model to analyze read depth data from tumor cells using matched normal cells as negative controls. Using a combination of public data sets and simulations, we show SCONCE accurately decodes copy number profiles, and provides a useful tool for understanding tumor evolution.Availabilityand implementationSCONCE is implemented in C++11 and is freely available from https://github.com/NielsenBerkeleyLab/sconce.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

37. MIMOSA2: a metabolic network-based tool for inferring mechanism-supported relationships in microbiome-metabolome data.

Author: Noecker, Cecilia, Eng, Alexander, Muller, Efrat, and Borenstein, Elhanan
Subjects: Genetics, Nutrition, Bioengineering, Biotechnology, Human Genome, Animals, Humans, Metabolome, Metabolomics, Microbiota, Software, Metabolic Networks and Pathways, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationRecent technological developments have facilitated an expansion of microbiome-metabolome studies, in which samples are assayed using both genomic and metabolomic technologies to characterize the abundances of microbial taxa and metabolites. A common goal of these studies is to identify microbial species or genes that contribute to differences in metabolite levels across samples. Previous work indicated that integrating these datasets with reference knowledge on microbial metabolic capacities may enable more precise and confident inference of microbe-metabolite links.ResultsWe present MIMOSA2, an R package and web application for model-based integrative analysis of microbiome-metabolome datasets. MIMOSA2 uses genomic and metabolic reference databases to construct a community metabolic model based on microbiome data and uses this model to predict differences in metabolite levels across samples. These predictions are compared with metabolomics data to identify putative microbiome-governed metabolites and taxonomic contributors to metabolite variation. MIMOSA2 supports various input data types and customization with user-defined metabolic pathways. We establish MIMOSA2's ability to identify ground truth microbial mechanisms in simulation datasets, compare its results with experimentally inferred mechanisms in honeybee microbiota, and demonstrate its application in two human studies of inflammatory bowel disease. Overall, MIMOSA2 combines reference databases, a validated statistical framework, and a user-friendly interface to facilitate modeling and evaluating relationships between members of the microbiota and their metabolic products.Availability and implementationMIMOSA2 is implemented in R under the GNU General Public License v3.0 and is freely available as a web server at http://elbo-spice.cs.tau.ac.il/shiny/MIMOSA2shiny/ and as an R package from http://www.borensteinlab.com/software_MIMOSA2.html.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

38. EcoPLOT: dynamic analysis of biogeochemical data

Author: Sanchez, Christopher D, Brown, J Benjamin, Gal-Oz, Omree, and Singer, Esther
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Life Below Water, Software, Algorithms, Language, Data Interpretation, Statistical, Microbiota, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationWe have created EcoPLOT (parameterized linkage of omics-driven technologies), a web-app for the dynamic, interactive analysis of biogeochemical datasets that combines state-of-the-art analysis tools to statistically and graphically explore environmental, geochemical and microbiome datasets. Using the iterative random forest, a machine learning algorithm, EcoPLOT allows for the de novo discovery of drivers which exhibit significant impact on plant, microbial or soil dynamics.Availability and implementationEcoPLOT is built entirely within the R language. It can be accessed through any system where R is installed, including Windows, Mac and most Linux systems. EcoPLOT is free to use and can be accessed at https://github.com/cdsanchez18/EcoPLOT.
Published: 2022

39. AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees

Author: Pipes, Lenore and Nielsen, Rasmus
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Software, Phylogeny, Base Sequence, Algorithms, Cluster Analysis, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationClustering is a fundamental task in the analysis of nucleotide sequences. Despite the exponential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. Traditional clustering methods have mostly focused on optimizing high speed clustering of highly similar sequences. We develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences.ResultsWe describe a clustering program AncestralClust, which is developed for clustering divergent sequences. We compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. We show that, in divergent datasets, AncestralClust has higher accuracy and more even cluster sizes than current popular methods.Availability and implementationAncestralClust is an Open Source program available at https://github.com/lpipes/ancestralclust.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2022

40. A fast data-driven method for genotype imputation, phasing and local ancestry inference: MendelImpute.jl

Author: Chu, Benjamin B, Sobel, Eric M, Wasiolek, Rory, Ko, Seyoon, Sinsheimer, Janet S, Zhou, Hua, and Lange, Kenneth
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Mathematical Sciences, Statistics, Genotype, Software, Haplotypes, Polymorphism, Single Nucleotide, Data Compression, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCurrent methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers.ResultsWe introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing.Availability and implementationSoftware, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

41. UCSC Cell Browser: visualize your single-cell data

Author: Speir, Matthew L, Bhaduri, Aparna, Markov, Nikolay S, Moreno, Pablo, Nowakowski, Tomasz J, Papatheodorou, Irene, Pollen, Alex A, Raney, Brian J, Seninge, Lucas, Kent, W James, and Haeussler, Maximilian
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Networking and Information Technology R&D (NITRD), Genetics, Biotechnology, Bioengineering, Generic health relevance, Genomics, Software, Databases, Genetic, Metadata, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryAs the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets.Availability and implementationWe provide the UCSC Cell Browser as a free website where scientists can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

42. efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Author: Zayed, Ahmed A, Lücking, Dominik, Mohssen, Mohamed, Cronin, Dylan, Bolduc, Ben, Gregory, Ann C, Hargreaves, Katherine R, Piehowski, Paul D, White Iii, Richard A, Huang, Eric L, Adkins, Joshua N, Roux, Simon, Moraru, Cristina, and Sullivan, Matthew B
Subjects: Genetics, Infection, Animals, Viral Proteins, Software, Metagenomics, Viruses, Microbiota, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: MotivationViruses infect, reprogram and kill microbes, leading to profound ecosystem consequences, from elemental cycling in oceans and soils to microbiome-modulated diseases in plants and animals. Although metagenomic datasets are increasingly available, identifying viruses in them is challenging due to poor representation and annotation of viral sequences in databases.ResultsHere, we establish efam, an expanded collection of Hidden Markov Model (HMM) profiles that represent viral protein families conservatively identified from the Global Ocean Virome 2.0 dataset. This resulted in 240 311 HMM profiles, each with at least 2 protein sequences, making efam >7-fold larger than the next largest, pan-ecosystem viral HMM profile database. Adjusting the criteria for viral contig confidence from 'conservative' to 'eXtremely Conservative' resulted in 37 841 HMM profiles in our efam-XC database. To assess the value of this resource, we integrated efam-XC into VirSorter viral discovery software to discover viruses from less-studied, ecologically distinct oxygen minimum zone (OMZ) marine habitats. This expanded database led to an increase in viruses recovered from every tested OMZ virome by ∼24% on average (up to ∼42%) and especially improved the recovery of often-missed shorter contigs (
Published: 2021

43. IntAct App: a Cytoscape application for molecular interaction network visualisation and analysis

Author: Ragueneau, Eliot, Shrivastava, Anjali, Morris, John H, del-Toro, Noemi, Hermjakob, Henning, and Porras, Pablo
Subjects: Networking and Information Technology R&D (NITRD), Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: SummaryIntAct App is a Cytoscape 3 application that grants in-depth access to IntAct's molecular interaction data. It build networks where nodes are interacting molecules (mainly proteins, but also genes, RNA, chemicals…) and edges represent evidence of interaction. Users can query a network by providing its molecules, identified by different fields and optionally include all their interacting partners in the resulting network. The app offers three visualizations: one only displaying interactions, another representing every evidence and the last one emphasizing evidence where mutated versions of proteins were used. Users can also filter networks and click on nodes and edges to access all their related details. Finally, the application supports automation of its main features via Cytoscape commands.Availability and implementationImplementation available at https://apps.cytoscape.org/apps/intactapp, while the source code is available at https://github.com/EBI-IntAct/IntactApp.
Published: 2021

44. Reactome and the Gene Ontology: digital convergence of data resources

Author: Good, Benjamin M, Van Auken, Kimberly, Hill, David P, Mi, Huaiyu, Carbon, Seth, Balhoff, James P, Albou, Laurent-Philippe, Thomas, Paul D, Mungall, Christopher J, Blake, Judith A, and D’Eustachio, Peter
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationGene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions.ResultsTo convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

45. PICS2: next-generation fine mapping via probabilistic identification of causal SNPs

Author: Taylor, Kimberly E, Ansel, K Mark, Marson, Alexander, Criswell, Lindsey A, and Farh, Kyle Kai-How
Subjects: Biological Sciences, Genetics, Human Genome, Generic health relevance, Polymorphism, Single Nucleotide, Genome-Wide Association Study, Genotype, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: SummaryThe Probabilistic Identification of Causal SNPs (PICS) algorithm and web application was developed as a fine-mapping tool to determine the likelihood that each single nucleotide polymorphism (SNP) in LD with a reported index SNP is a true causal polymorphism. PICS is notable for its ability to identify candidate causal SNPs within a locus using only the index SNP, which are widely available from published GWAS, whereas other methods require full summary statistics or full genotype data. However, the original PICS web application operates on a single SNP at a time, with slow performance, severely limiting its usability. We have developed a next-generation PICS tool, PICS2, which enables performance of PICS analyses of large batches of index SNPs with much faster performance. Additional updates and extensions include use of LD reference data generated from 1000 Genomes phase 3; annotation of variant consequences; annotation of GTEx eQTL genes and downloadable PICS SNPs from GTEx eQTLs; the option of generating PICS probabilities from experimental summary statistics; and generation of PICS SNPs from all SNPs of the GWAS catalog, automatically updated weekly. These free and easy-to-use resources will enable efficient determination of candidate loci for biological studies to investigate the true causal variants underlying disease processes.Availability and implementationPICS2 is available at https://pics2.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

46. Epidemiological modeling in StochSS Live!

Author: Jiang, Richard, Jacob, Bruno, Geiger, Matthew, Matthew, Sean, Rumsey, Bryan, Singh, Prashant, Wrede, Fredrik, Yi, Tau-Mu, Drawert, Brian, Hellander, Andreas, and Petzold, Linda
Subjects: Networking and Information Technology R&D (NITRD), Bioengineering, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: SummaryWe present StochSS Live!, a web-based service for modeling, simulation and analysis of a wide range of mathematical, biological and biochemical systems. Using an epidemiological model of COVID-19, we demonstrate the power of StochSS Live! to enable researchers to quickly develop a deterministic or a discrete stochastic model, infer its parameters and analyze the results.Availability and implementationStochSS Live! is freely available at https://live.stochss.org/.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

47. Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography

Author: Du, Xuefeng, Wang, Haohan, Zhu, Zhenxi, Zeng, Xiangrui, Chang, Yi-Wei, Zhang, Jing, Xing, Eric, and Xu, Min
Subjects: Information and Computing Sciences, Machine Learning, Generic health relevance, Mathematical Sciences, Biological Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that visualizes the structural and spatial organization of macromolecules at a near-native state in single cells, which has broad applications in life science. However, the systematic structural recognition and recovery of macromolecules captured by cryo-ET are difficult due to high structural complexity and imaging limits. Deep learning-based subtomogram classification has played critical roles for such tasks. As supervised approaches, however, their performance relies on sufficient and laborious annotation on a large training dataset.ResultsTo alleviate this major labeling burden, we proposed a Hybrid Active Learning (HAL) framework for querying subtomograms for labeling from a large unlabeled subtomogram pool. Firstly, HAL adopts uncertainty sampling to select the subtomograms that have the most uncertain predictions. This strategy enforces the model to be aware of the inductive bias during classification and subtomogram selection, which satisfies the discriminativeness principle in AL literature. Moreover, to mitigate the sampling bias caused by such strategy, a discriminator is introduced to judge if a certain subtomogram is labeled or unlabeled and subsequently the model queries the subtomogram that have higher probabilities to be unlabeled. Such query strategy encourages to match the data distribution between the labeled and unlabeled subtomogram samples, which essentially encodes the representativeness criterion into the subtomogram selection process. Additionally, HAL introduces a subset sampling strategy to improve the diversity of the query set, so that the information overlap is decreased between the queried batches and the algorithmic efficiency is improved. Our experiments on subtomogram classification tasks using both simulated and real data demonstrate that we can achieve comparable testing performance (on average only 3% accuracy drop) by using less than 30% of the labeled subtomograms, which shows a very promising result for subtomogram classification task with limited labeling resources.Availability and implementationhttps://github.com/xulabs/aitom.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

48. DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays

Author: Chen, Zhanlin, Zhang, Jing, Liu, Jason, Dai, Yi, Lee, Donghoon, Min, Martin Renqiang, Xu, Min, and Gerstein, Mark
Subjects: Information and Computing Sciences, Biological Sciences, Machine Learning, Genetics, Human Genome, Animals, Deep Learning, Enhancer Elements, Genetic, Genome-Wide Association Study, Mice, Neural Networks, Computer, Software, Mathematical Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationMapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping.ResultsOur DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization.Availability and implementationDECODE source code and pre-processing scripts are available at decode.gersteinlab.org.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

49. scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

Author: Song, Dongyuan, Li, Kexin, Hemminger, Zachary, Wollman, Roy, and Li, Jingyi Jessica
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Human Genome, Biotechnology, 1.4 Methodologies and measurements, Good Health and Well Being, Algorithms, Gene Expression Profiling, Sequence Analysis, RNA, Single-Cell Analysis, Software, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationSingle-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.ResultsHere, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data.Availability and implementationThe R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

50. JEDI: circular RNA prediction based on junction encoders and deep interaction among splice sites

Author: Jiang, Jyun-Yu, Ju, Chelsea J-T, Hao, Junheng, Chen, Muhao, and Wang, Wei
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Genetics, Networking and Information Technology R&D (NITRD), Human Genome, Machine Learning and Artificial Intelligence, Generic health relevance, Neural Networks, Computer, RNA, RNA Splice Sites, RNA Splicing, RNA, Circular, RNA, Long Noncoding, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: MotivationCircular RNA (circRNA) is a novel class of long non-coding RNAs that have been broadly discovered in the eukaryotic transcriptome. The circular structure arises from a non-canonical splicing process, where the donor site backspliced to an upstream acceptor site. These circRNA sequences are conserved across species. More importantly, rising evidence suggests their vital roles in gene regulation and association with diseases. As the fundamental effort toward elucidating their functions and mechanisms, several computational methods have been proposed to predict the circular structure from the primary sequence. Recently, advanced computational methods leverage deep learning to capture the relevant patterns from RNA sequences and model their interactions to facilitate the prediction. However, these methods fail to fully explore positional information of splice junctions and their deep interaction.ResultsWe present a robust end-to-end framework, Junction Encoder with Deep Interaction (JEDI), for circRNA prediction using only nucleotide sequences. JEDI first leverages the attention mechanism to encode each junction site based on deep bidirectional recurrent neural networks and then presents the novel cross-attention layer to model deep interaction among these sites for backsplicing. Finally, JEDI can not only predict circRNAs but also interpret relationships among splice sites to discover backsplicing hotspots within a gene region. Experiments demonstrate JEDI significantly outperforms state-of-the-art approaches in circRNA prediction on both isoform level and gene level. Moreover, JEDI also shows promising results on zero-shot backsplicing discovery, where none of the existing approaches can achieve.Availability and implementationThe implementation of our framework is available at https://github.com/hallogameboy/JEDI.Supplementary informationSupplementary data are available at Bioinformatics online.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

Publisher

6,287 results on '"bioinformatics"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources