215 results on '"FASTA format"'
Search Results
2. Processing Biological Sequences with MATLAB
- Author
-
Singh, Gautam B., Patnaik, Srikanta, Series editor, Sethi, Ishwar K., Series editor, Li, Xiaolong, Series editor, and Singh, Gautam B.
- Published
- 2015
- Full Text
- View/download PDF
3. Sequences
- Author
-
Pazos, Florencio, Chagoyen, Mónica, Pazos, Florencio, and Chagoyen, Mónica
- Published
- 2015
- Full Text
- View/download PDF
4. FexRNA: Exploratory Data Analysis and Feature Selection of Non-Coding RNA
- Author
-
Yi-Ping Phoebe Chen, Annette McGrath, and Noorul Amin
- Subjects
RNA, Untranslated ,Sequence Analysis, RNA ,Computer science ,Applied Mathematics ,Feature extraction ,Univariate ,FASTA format ,Computational Biology ,Feature selection ,computer.software_genre ,Machine Learning ,Set (abstract data type) ,Exploratory data analysis ,Identification (information) ,Genetics ,Data mining ,Databases, Nucleic Acid ,computer ,Algorithms ,Software ,Selection (genetic algorithm) ,Biotechnology - Abstract
Non-coding RNA (ncRNA) is involved in many biological processes and diseases in all species. Many ncRNA datasets exist that provide a sequential representation of data that best suits biomedical purposes. However, for ncRNA identification and analysis, statistical learning methods require hidden numerical features from the data. The extraction of hidden features, their analysis, and usage of a suitable set of features is crucial towards any statistical learning methods performance. Furthermore, a wealth of sequence intrinsic features has been proposed for ncRNA identification. Therefore, a systematic review and selection of these features are warranted. First, fasta format sequence datasets are generated from RNACentral representing many ncRNA types across a number of species. Next, a features dataset is created per fasta dataset consisting of 17 most frequently reported sequence intrinsic features. The features dataset is available from the FexRNA platform developed as part of this work. In addition, the features datasets are explored and analysed in terms of statistical information, univariate and bivariate analysis. For the feature selection (FS), a two-fold hierarchal FS framework based on majority voting and correlation is proposed and evaluated. Therefore, the FexRNA platform provides a useful platform for information about ncRNA features datasets, features analysis, and selection.
- Published
- 2021
- Full Text
- View/download PDF
5. Solutions to the Exercises
- Author
-
Selzer, Paul M., editor, Marhöfer, Richard J., editor, and Rohwer, Andreas, editor
- Published
- 2008
- Full Text
- View/download PDF
6. Sequence Comparisons and Sequence-Based Database Searches
- Author
-
Selzer, Paul M., editor, Marhöfer, Richard J., editor, and Rohwer, Andreas, editor
- Published
- 2008
- Full Text
- View/download PDF
7. E2EDNA: Simulation Protocol for DNA Aptamers with Ligands
- Author
-
Pengyu Ren, Michael Kilgour, Lena Simine, Brandon D. Walker, and Tao Liu
- Subjects
Analyte ,General Chemical Engineering ,Aptamer ,Library and Information Sciences ,Ligands ,01 natural sciences ,Article ,Force field (chemistry) ,Computational science ,03 medical and health sciences ,Computer Simulation ,A-DNA ,Protocol (object-oriented programming) ,030304 developmental biology ,0303 health sciences ,010405 organic chemistry ,SELEX Aptamer Technique ,Molecular biophysics ,FASTA format ,General Chemistry ,Folding (DSP implementation) ,Aptamers, Nucleotide ,0104 chemical sciences ,Computer Science Applications ,Biological system ,Systematic evolution of ligands by exponential enrichment - Abstract
We present the E2EDNA simulation protocol and accompanying code for the molecular biophysics and materials science communities. This protocol is both easy to use and sufficiently efficient to simulate single-stranded (ss)DNA and small analyte systems that are central to cellular processes and to nanotechnologies such as DNA aptamer-based sensors. Existing computational tools used for aptamer design focus on cost-effective secondary structure prediction and motif analysis in the large datasets produced by SELEX experiments. As a rule, they do not offer flexibility with respect to the choice of the theoretical engine or a direct access to the simulation. Practical aptamer optimization often requires higher accuracy predictions for only a small subset of sequences suggested e.g., by SELEX experiments, but in the absence of a streamlined procedure this task is extremely time and expertise intensive. We address this gap by introducing E2EDNA, a computational framework powered by Tinker, POLTYPE, MacroMoleculeBuilder and NUPACK that accepts a DNA sequence in the FASTA format and the chemical structures of the desired ligands in the sdf format as inputs and performs approximate folding followed by a refining step, docking, and molecular dynamics sampling at the desired level of accuracy. As a case study we simulate a DNA-UTP (Uridine TriPhosphate) complex in water using the state-of-the-art AMOEBA polarizable force field, beginning with identifying a representative 3D structure of the aptamer at experimental conditions, and then assessing the binding affinity of UTP.
- Published
- 2021
- Full Text
- View/download PDF
8. Introduction to Basic Local Alignment Search Tool
- Author
-
Bal, Harshawardhan and Hujol, Johnny
- Published
- 2007
- Full Text
- View/download PDF
9. Sequence Analysis with WinGene/WinPep
- Author
-
Hennig, Lars and Walker, John M., editor
- Published
- 2002
- Full Text
- View/download PDF
10. Processing GenBank Files
- Author
-
Xia, Xuhua
- Published
- 2000
- Full Text
- View/download PDF
11. Make every species count:<scp>fastachar</scp>software for rapid determination of molecular diagnostic characters to describe species
- Author
-
Lucas M. Merckelbach and Luisa M. S. Borges
- Subjects
Genetic Markers ,0106 biological sciences ,0301 basic medicine ,Species complex ,Genetic Speciation ,Biology ,Mega ,computer.software_genre ,010603 evolutionary biology ,01 natural sciences ,DNA barcoding ,03 medical and health sciences ,Software ,Genetics ,Phylogeny ,Ecology, Evolution, Behavior and Systematics ,Formal description ,Graphical user interface ,business.industry ,Programming language ,FASTA format ,DNA ,030104 developmental biology ,Taxon ,business ,computer ,Biotechnology - Abstract
Only a fraction of species found so far has been described, particularly cryptic species uncovered by molecular data. The latter might require the use of molecular data for its diagnosis, but it is important to make use of the diagnostic content of the molecular data itself. The molecular character‐based model provides discrete molecular diagnostic characters within DNA sequences that can be used in species descriptions fulfilling the requirement of most codes of nomenclature for a character‐based description of species. Here, we introduce fastachar, a software developed to extract molecular diagnostic characters from one or several taxonomically informative DNA markers of a selected taxon compared with those of other taxa in a single step. The input data consist of a single file with aligned sequences in the fasta format, which can be created using alignment software such as mega or geneious. fastachar is an easy‐to‐use software with a graphical interface. Thus, the software does not require the user to have any knowledge of the underlying programming environment (Python). We hope this software, based on the method proposed by Jörger and Schrödl (Frontiers in Zoology, 10, 59, 2013) to describe cryptic species, will encourage researchers to take the final step in taxonomy: the formal description of species. We propose the use of this method and fastachar also for the inclusion of molecular data in the description of any species. fastachar is released as open‐source software under GNU General Public License V3 and is freely available for all major operating systems from https://github.com/smerckel/FastaChar.
- Published
- 2020
- Full Text
- View/download PDF
12. Comparative modeling and mutual docking of structurally uncharacterized heat shock protein 70 and heat shock factor-1 proteins in water buffalo
- Author
-
Ravinder Singh, Ranjit S. Kataria, Ankita Behl, Vikash Kumar, S. K. Mishra, Saroj Rani, Ankita Gurao, and Changanamkandath Rajesh
- Subjects
homology modeling ,Veterinary medicine ,Computational biology ,PDBsum ,SF1-1100 ,03 medical and health sciences ,SF600-1100 ,heat shock protein 70 ,Homology modeling ,Protein secondary structure ,030304 developmental biology ,0303 health sciences ,Multiple sequence alignment ,General Veterinary ,bubalus bubalis ,Chemistry ,030302 biochemistry & molecular biology ,FASTA format ,ExPASy ,MODELLER ,heat shock factor-1 ,Animal culture ,Docking (molecular) ,docking ,heat shock proteins ,Research Article - Abstract
Aim: In this study, a wide range of in silico investigation of Bubalus bubalis (BB) heat shock protein 70 (HSP70) and heat shock factor-1 (HSF1) has been performed, ranging from sequence evaluation among species to homology modeling along with their docking studies to decipher the interacting residues of both molecules. Materials and Methods: Protein sequences of BB HSP70 and HSF1 were retrieved from NCBI database in FASTA format. Primary and secondary structure prediction were computed using Expasy ProtParam server and Phyre2 server, respectively. TMHMM server was used to identify the transmembrane regions in HSP70. Multiple sequence alignment and comparative analysis of the protein was carried out using MAFFT and visualization was created using ESPript 3.0. Phylogenetic analysis was accomplished by COBALT. Interactions of HSP70 with other proteins were studied using STRING database. Modeller 9.18, RaptorX, Swiss-Modeller, Phyre2, and I-TASSER were utilized to design the three-dimensional structure of these proteins followed by refinement; energy minimization was accomplished using ModRefiner and SPDBV program. Stereochemical quality along with the accuracy of the predicted models and their visualization was observed by PROCHECK program of PDBsum and UCSF Chimera, respectively. ClusPro 2.0 server was accessed for the docking of the receptor protein with the ligand. Results: The lower value of Grand Average of Hydropathy indicates the more hydrophilic nature of HSP70 protein. Value of the instability index (II) classified the protein as stable. No transmembrane region was reported for HSP70 by TMHMM server. Phylogenetic analysis based on multiple sequence alignments (MSAs) by COBALT indicated more evolutionarily closeness of Bos indicus (BI) with Bos taurus as compared to BI and BB. STRING database clearly indicates the HSF1 as one of the interacting molecules among 10 interacting partners with HSP 70. The best hit of 3D model of HSP70 protein and HSF1 was retrieved from I-TASSER and Phyre2, respectively. Interacting residues and type of bonding between both the molecules which were docked by ClusPro 2.0 were decoded by PIC server. Hydrophobic interactions, protein-protein main-chain-side-chain hydrogen bonds, and protein-protein side-chain-side-chain hydrogen bonds were delineated in this study. Conclusion: This is the first-ever study on in silico interaction of HSP70 and HSF1 proteins in BB. Several bioinformatics web tools were utilized to study secondary structure along with comparative modeling, physicochemical properties, and protein-protein interaction. The various interacting amino acid residues of both proteins have been indicated in this study.
- Published
- 2019
13. Simple and Rapid Detection of Burkholderia and Variolla Using Multiplex-PCR
- Author
-
Mohammad Javad Dehghan Esmatabadi, Mona Simkhah, Nafiseh Pourmahdi, and Mehdi Zeinoddini
- Subjects
variolla ,lcsh:R5-920 ,biology ,Hybrid vector ,FASTA format ,General Medicine ,Computational biology ,burkholderia ,biology.organism_classification ,Genome ,Burkholderia ,pcr ,Multiplex polymerase chain reaction ,hybrid construct ,Vector (molecular biology) ,lcsh:Medicine (General) ,Gene ,Sequence (medicine) ,positive control sample - Abstract
Background: Todays, one of the most important problems in detection of human pathogens, is lack of positive control. The idea of using hybrid vectors, containing genes of different pathogens, can overcome this limitation. We can design specific primers for each region and use the hybrid vector as positive control sample in PCR. In this research we designed a hybrid vector and relevant primers for detection of Variolla and Burkholderia. Materials and Methods: In this study 16srRNA and HA genes were chosen to be located on the vector, to represent of Burkholderia and Variolla, in respectively. The sequence of these genes obtained from NCBI in FASTA format and aligned in BioEdit software for finding conserve region of each gene, then some purposeful changes were applied in the sequence of each gene and the sequences were placed next to each other and the construct was designed. Specific primers designed for each region using Oligo7, BioEdit, GeneRunner softwares, Oligo analyzer website and NCBI database. Finally, the construct cloned in PUC57 in SnapGene and PCRsimulated on hybrid vector using designed primers. Results: Analysis confirmed that conserved region for each gene is located on hybrid vector for each pathogen, and simulation of PCR proved the accuracy of designed primers. Conclusion: Hybrid vectors design contain similar sequence of pathogens genome but they are none-pathogenic. We can use these hybrid vectors as positive control, without any concern.
- Published
- 2019
14. Riboswitches in Archaea
- Author
-
Angela Gupta and D Swati
- Subjects
Riboswitch ,Aptamer ,Computational biology ,Biology ,01 natural sciences ,Genome ,Genes, Archaeal ,03 medical and health sciences ,Databases, Genetic ,Drug Discovery ,030304 developmental biology ,0303 health sciences ,010405 organic chemistry ,Organic Chemistry ,FASTA format ,RNA ,General Medicine ,Non-coding RNA ,biology.organism_classification ,Archaea ,0104 chemical sciences ,Computer Science Applications ,GenBank ,5' Untranslated Regions - Abstract
Background: Riboswitches are cis-acting, non-coding RNA elements found in the 5’UTR of bacterial mRNA and 3’ UTR of eukaryotic mRNA, that fold in a complex manner to act as receptors for specific metabolites hence altering their conformation in response to the change in concentrations of a ligand or metabolite. Riboswitches function as gene regulators in numerous bacteria, archaea, fungi, algae and plants. Aim and Objective: This study identifies different classes of riboswitches in the Archaeal domain of life. Previous studies have suggested that riboswitches carry a conserved aptameric domain in different domains of life. Since Archaea are considered to be the most idiosyncratic organisms it was interesting to look for the conservation pattern of riboswitches in these obviously strange microorganisms. Materials and Methods: Completely sequenced Archaeal Genomes present in the NCBI repository were used for studying riboswitches and other ncRNAs. The sequence files in FASTA format were downloaded from NCBI Genome database and information related to these genomes was retrieved from GenBank. Three bioinformatics approaches were used namely, ab initio, consensus structure prediction and statistical model-based prediction for identifying riboswitches. Results: Archaeal genomes have a sporadic distribution of putative riboswitches like the TPP, FMN, Guanidine, Lysine and c-di-AMP riboswitches, which are known to occur in bacteria. Also, a class of riboswitch sensing c-di-GMP, a second messenger, has been identified in a few Archaeal organisms. Conclusion: This study clearly reveals that bioinformatics methods are likely to play a major role in identifying conserved riboswitches and in establishing how widespread these classes are in all domains of life, even though the final confirmation may come from wet lab methods.
- Published
- 2019
- Full Text
- View/download PDF
15. Sweet google O’ mine—The importance of online search engines for MS-facilitated, database-independent identification of peptide-encoded book prefaces
- Author
-
Rosa Rakownikow Jersie-Christensen and Alexander Hogrebe
- Subjects
0303 health sciences ,Information retrieval ,lcsh:QH426-470 ,Computer science ,030302 biochemistry & molecular biology ,Magic (programming) ,FASTA format ,Biochemistry ,Market fragmentation ,03 medical and health sciences ,Identification (information) ,lcsh:Genetics ,Online search ,DECIPHER ,Database search engine ,Paragraph ,030304 developmental biology - Abstract
In the recent year, we felt like we were not truly showing our full potential in our PhD projects, and so we were very happy and excited when YPIC announced the ultimate proteomics challenge. This gave us the opportunity of showing off and procrastinating at the same time:) The challenge was to identify the amino acid sequence of 19 synthetic peptides made up from an English text and then find the book that it came from. For this task we chose to run on an Orbitrap Fusion™ Lumos™ Tribrid™ Mass Spectrometer with two different sensitive MS2 resolutions, each with both HCD and CID fragmentation consecutively. This strategy was chosen because we speculated that multiple MS2 scans at high quality would be beneficial over lower resolution, speed and quantity in the relatively sparse sample. The resulting chromatogram did not reveal 19 sharp distinct peaks and it was not clear to us where to start a manual spectra interpretation. We instead used the de novo option in the MaxQuant software and the resulting output gave us two phrases with words that were specific enough to be searched in the magic Google search engine. Google gave us the name of a very famous physicist, namely Sir Joseph John Thomson, and a reference to his book “Rays of positive electricity” from 1913. We then converted the paragraph we believed to be the right one into a FASTA format and used it with MaxQuant to do a database search. This resulted in 16 perfectly FASTA search-identified peptide sequences, one with a missing PTM and one found as a truncated version. The remaining one was identified within the MaxQuant de novo sequencing results. We thus show in this study that our workflow combining de novo spectra analysis algorithms with an online search engine is ideally suited for all applications where users want to decipher peptide-encoded prefaces of 20th century science books. Keywords: YPIC challenge, Peptide sentence, De novo
- Published
- 2019
16. A multiple genome alignment workflow shows the impact of repeat masking and parameter tuning on alignment of functional regions in plants
- Author
-
Yaoyao Wu, Armin Scheben, Lynn M. Johnson, Baoxing Song, Michelle C. Stitzer, Adam Siepel, Cinta Romay, and Edward S. Buckler
- Subjects
Set (abstract data type) ,Masking (art) ,Comparative genomics ,Workflow ,Multiple sequence alignment ,Computer science ,Pipeline (computing) ,FASTA format ,Data mining ,computer.software_genre ,Genome ,computer - Abstract
Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%.
- Published
- 2021
- Full Text
- View/download PDF
17. riboviz 2: A flexible and robust ribosome profiling data analysis and visualization workflow
- Author
-
MacKenzie E, Edward W. J. Wallace, Liana F. Lareau, John S. Favate, Kurowska A, Amanda Mok, Shivakumar, Xue S, Anderson F, Peter Tilton, Alexander L. Cope, Winterbourne Sm, Michael Jackson, Priyal Shah, and Kostas Kavoussanakis
- Subjects
Workflow ,Computer science ,business.industry ,Separation of concerns ,FASTA format ,Usability ,File format ,Software engineering ,business ,Pipeline (software) ,Workflow management system ,Visualization - Abstract
MotivationRibosome profiling, or Ribo-seq, is the state of the art method for quantifying protein synthesis in living cells. Computational analysis of Ribo-seq data remains challenging due to the complexity of the procedure, as well as variations introduced for specific organisms or specialized analyses. Many bioinformatic pipelines have been developed, but these pipelines have key limitations in terms of functionality or usability.ResultsWe present riboviz 2, an updated riboviz package, for the comprehensive transcript-centric analysis and visualization of Ribo-seq data. riboviz 2 includes an analysis workflow built on the Nextflow workflow management system, combining freely available software with custom code. The package is extensively documented and provides example configuration files for organisms spanning the domains of life. riboviz 2 is distinguished by clear separation of concerns between annotation and analysis: prior to a run, the user chooses a transcriptome in FASTA format, paired with annotation for the CDS locations in GFF3 format. The user is empowered to choose the relevant transcriptome for their biological question, or to run alternative analyses that address distinct questions. riboviz 2 has been extensively tested on various library preparation strategies, including multiplexed samples. riboviz 2 is flexible and uses open, documented file formats, allowing users to integrate new analyses with the pipeline.Availabilityriboviz 2 is freely available at github.com/riboviz/riboviz.Supplementary information
- Published
- 2021
- Full Text
- View/download PDF
18. Qiime Artifact eXtractor (qax): A Fast and Versatile Tool to Interact with Qiime2 Archives
- Author
-
Andrea Telatin
- Subjects
FASTQ format ,Computer science ,Interface (computing) ,lcsh:Biotechnology ,Biomedical Engineering ,microbiome ,Bioengineering ,Artifact (software development) ,Applied Microbiology and Biotechnology ,Biochemistry ,03 medical and health sciences ,Software ,lcsh:TP248.13-248.65 ,Newick format ,Qiime2 ,030304 developmental biology ,0303 health sciences ,Information retrieval ,030306 microbiology ,business.industry ,metadata ,FASTA format ,bioinformatics ,Metadata ,Data exchange ,metabarcoding ,business ,Biotechnology - Abstract
Qiime2 is one of the most popular software tools used for analysis of output from metabarcoding experiments (e.g., sequencing of 16S, 18S, or ITS amplicons). Qiime2 introduced a novel and innovative data exchange format: the ‘Qiime2 artifact’. Qiime2 artifacts are structured compressed archives containing a dataset and its associated metadata. Examples of datasets are FASTQ reads, representative sequences in FASTA format, a phylogenetic tree in Newick format, while examples of metadata are the command that generated the artifact, information on the execution environment, citations on the used software, and all the metadata of the artifacts used to produce it. While artifacts can improve the shareability and reproducibility of Qiime2 workflows, they are less easily integrated with general bioinformatics pipelines. Accessing metadata in the artifacts also requires full Qiime2 installation. Qiime Artifact eXtractor (qax) allows users to easily interface with Qiime2 artifacts from the command line, without needing the full Qiime2 environment installed (or activated).
- Published
- 2021
19. Viromebrowser: A shiny app for browsing virome sequencing analysis results
- Author
-
Bas B. Oude Munnink, Marion Koopmans, David F. Nieuwenhuijse, and Virology
- Subjects
0301 basic medicine ,Computer science ,data analysis ,lcsh:QR1-502 ,Article ,lcsh:Microbiology ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,SDG 3 - Good Health and Well-being ,Virology ,Human virome ,visualization ,metagenomics ,virome ,Information retrieval ,Base Sequence ,Contig ,NGS analysis ,shiny ,FASTA format ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Molecular Sequence Annotation ,bioinformatics ,Metadata ,Identifier ,030104 developmental biology ,Infectious Diseases ,Metagenomics ,030220 oncology & carcinogenesis ,Software ,Reference genome - Abstract
Experiments in which complex virome sequencing data is generated remain difficult to explore and unpack for scientists without a background in data science. The processing of raw sequencing data by high throughput sequencing workflows usually results in contigs in FASTA format coupled to an annotation file linking the contigs to a reference sequence or taxonomic identifier. The next step is to compare the virome of different samples based on the metadata of the experimental setup and extract sequences of interest that can be used in subsequent analyses. The viromeBrowser is an application written in the opensource R shiny framework that was developed in collaboration with end-users and is focused on three common data analysis steps. First, the application allows interactive filtering of annotations by default or custom quality thresholds. Next, multiple samples can be visualized to facilitate comparison of contig annotations based on sample specific metadata values. Last, the application makes it easy for users to extract sequences of interest in FASTA format. With the interactive features in the viromeBrowser we aim to enable scientists without a data science background to compare and extract annotation data and sequences from virome sequencing analysis results.
- Published
- 2021
20. sangeranalyseR: Simple and Interactive Processing of Sanger Sequencing Data in R
- Author
-
Kuan-Hao Chao, Robert Lanfear, Sarah Palmer, and Kirston Barton
- Subjects
AcademicSubjects/SCI01140 ,Letter ,Web Browser ,Biology ,computer.software_genre ,Bioconductor ,User-Computer Interface ,03 medical and health sciences ,symbols.namesake ,0302 clinical medicine ,shiny application ,Genetics ,Interactive processing ,Phylogeny ,Ecology, Evolution, Behavior and Systematics ,030304 developmental biology ,Sanger sequencing ,0303 health sciences ,Database ,Contig ,SIMPLE (military communications protocol) ,AcademicSubjects/SCI01130 ,FASTA format ,Computational Biology ,alignment ,DNA ,Sequence Analysis, DNA ,bioconductor ,R package ,symbols ,chromatogram ,Sequence Alignment ,computer ,Software ,030217 neurology & neurosurgery - Abstract
sangeranalyseR is feature-rich, free, and open-source R package for processing Sanger sequencing data. It allows users to go from loading reads to saving aligned contigs in a few lines of R code by using sensible defaults for most actions. It also provides complete flexibility for determining how individual reads and contigs are processed, both at the command-line in R and via interactive Shiny applications. sangeranalyseR provides a wide range of options for all steps in Sanger processing pipelines including trimming reads, detecting secondary peaks, viewing chromatograms, detecting indels and stop codons, aligning contigs, estimating phylogenetic trees, and more. Input data can be in either ABIF or FASTA format. sangeranalyseR comes with extensive online documentation and outputs aligned and unaligned reads and contigs in FASTA format, along with detailed interactive HTML reports. sangeranalyseR supports the use of colorblind-friendly palettes for viewing alignments and chromatograms. It is released under an MIT licence and available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR, last accessed February 22, 2021) and on Github (https://github.com/roblanf/sangeranalyseR, last accessed February 22, 2021).
- Published
- 2021
- Full Text
- View/download PDF
21. Performance Comparison for Data Retrieval from NoSQL and SQL Databases: A Case Study for COVID-19 Genome Sequence Dataset
- Author
-
Soarov Chakraborty, K. M. Azharul Hasan, and Shourav Paul
- Subjects
SQL ,Data retrieval ,Database ,Computer science ,Relational database ,Group method of data handling ,FASTA format ,NoSQL ,computer.software_genre ,Data structure ,computer ,JSON ,computer.programming_language - Abstract
NoSQL database management system is introduced to tackle different sorts of challenges, including performing operations on unstructured, semi-structured, and structured data. NoSQL databases gained popularity because of the improved performance than the SQL databases. We aim to investigate the NoSQL system's performance, namely MongoDB and Cassandra and SQL database, namely MySQL for DNA sequences data from the COVID-19 dataset. Studies of the DNA sequences are essential for medical diagnosis and biotechnology. However, it is quite challenging to store these genomics data in a traditional RDMS because of their unstructured nature. NoSQL is an efficient solution for textual characters like genomics data. We used around 3GB of human genome data from the COVID-19 dataset provided by NCBI. The original data was in the FASTA format, and we process these data into JSON format. Also, we have analyzed the different query syntax, data load time, and query performance time for the genomics data.
- Published
- 2021
- Full Text
- View/download PDF
22. In silico characterization and structural modeling of a homeobox protein MSX1 from Homo sapiens
- Author
-
Yogesh Mishra, Thakur Prasad Chaturvedi, Akanksha Srivastava, Subhankar Biswas, and Sneha Singh
- Subjects
0301 basic medicine ,chemistry.chemical_classification ,Homo sapiens ,Phylogenetic analysis ,Globular protein ,In silico ,FASTA format ,Health Informatics ,Computational biology ,Biology ,lcsh:Computer applications to medicine. Medical informatics ,Protein tertiary structure ,03 medical and health sciences ,030104 developmental biology ,0302 clinical medicine ,chemistry ,Docking (molecular) ,030220 oncology & carcinogenesis ,Molecular docking ,Homeobox ,Transcription regulator ,lcsh:R858-859.7 ,Gene ,MSX1 ,Synteny - Abstract
Introduction MSX1 protein, a homeobox transcriptional regulator plays a significant role in various developmental processes of the mammalian system such as limb-pattern formation, craniofacial development, in particular, odontogenesis, and tumor growth inhibition. Several studies have been performed on MSX1 at the genomic and transcriptomic levels. However, there is a lack of information on its structural and conformational aspects. Objective For better understanding of the molecular mechanism of MSX1, the present study aims to conduct a detailed in-silico analysis of this protein in terms of its physicochemical properties, secondary and tertiary structure predictions, interacting partners, and phylogenetic relationship with other orthologs. Methods The sequence of the MSX1 protein from Homo sapiens was retrieved in the FASTA format from the National Center for Biotechnology Information (NCBI). The standard bioinformatic tools were further used to characterize and model the structure of this protein. Results The in-silico characterization of MSX1 revealed that it is a basic, non-polar, and thermostable globular protein mainly localized in the nucleus. This protein is extremely rigid due to the presence of high proline content. The phylogenetic and synteny analysis revealed that the gene is highly conserved at the level of the amino acid sequences, but underwent several modifications at the genomic level in the course of evolution possibly to attain the diverse function. Major part of this protein is a random coil, making it suitable for interaction with other proteins. Subcellular localization and protein-protein interaction suggested that the protein may act as a secretory protein and play a crucial role in regulating several developmental processes. Docking analysis suggested that the MSX1 protein may interact with other proteins and form complexes to carry out its function. Conclusion The structural characterization of this protein will help to better understand its molecular mechanism of action. In addition, the predicted 3-D model would act as a base for further understanding of the protein's other functional potential.
- Published
- 2021
23. The Web Platform for Storing Biotechnologically Significant Properties of Bacterial Strains
- Author
-
Tatiana N. Lakhova, Pavel S. Demenkov, Sergey A. Lashin, Fedor V. Kazantsev, Alexandra I. Klimenko, and Aleksey M. Mukhin
- Subjects
World Wide Web ,Set (abstract data type) ,Data processing ,Process (engineering) ,business.industry ,FASTA format ,Database schema ,Information technology ,Table (database) ,business ,Data warehouse - Abstract
Current biology tasks are impracticable without bioinformatic data processing. Information technologies and the newest computers provide the ability to automatically execute algorithms on an extensive data set and store either strong- or weak-structured data. A well-designed architecture of such data warehouses increases the reproducibility of investigations. However, it is challenging to create a data schema that aids fast search of properties in such warehouses. This paper describes the method and its implementation for storing and processing microbiological and bioinformatical data. The web platform stores genomes in FASTA format, genome annotations in table files that indicate gene coordinates in the genomes, structural and mathematical models to compare different strains and predict new properties.
- Published
- 2021
- Full Text
- View/download PDF
24. Aln2tbl: building a mitochondrial features table from a assembly alignment in fasta format
- Author
-
Joan Pons, Miquel Serra, Pere Bover, Francesco Nardi, Juan José Ensenyat, Ministerio de Ciencia, Innovación y Universidades (España), and Ministerio de Economía y Competitividad (España)
- Subjects
feature table ,GenBank submission ,gene annotation ,Mitochondrial genome ,Python ,Information retrieval ,Gene annotation ,FASTA format ,Gene Annotation ,Biology ,Python (programming language) ,computer.software_genre ,Genome ,Annotation ,Feature (computer vision) ,Scripting language ,Genetics ,Table (database) ,Molecular Biology ,computer ,Rapid Communication ,computer.programming_language ,Feature table ,Research Article - Abstract
The sequencing, annotation and analysis of complete mitochondrial genomes is an important research tool in phylogeny and evolution. Starting with the primary sequence, genes/features are generally annotated automatically to obtain preliminary annotations in the form of a feature table. Further manual curation in a graphic alignment editor is nevertheless necessary to revise annotations. As such, the automatically generated feature table is invalidated and has to be modified manually before submission to data banks. We developed aln2tbl.py, a python script that recreates a feature table from a manually refined alignment of genes mapped on the mitochondrial genome in fasta format. The feature table is populated with notes and annotations specific to mitochondrial genomes. The table can be used to create a sqn file to be submitted directly to data banks. In summary, our scripts fills one gap in the available toolbox and, combined with other software, allows the automation of the entire process, from primary sequence to annotated genome submission, even if a manual curation step is conducted in a visual sequence editor., This work was supported by the Ministerio de Ciencia, Innovación y Universidades under Grant PID2019-107481GA-I00, and Ministerio de Economía y Competitividad under Grant CGL2016-76164-P.
- Published
- 2021
- Full Text
- View/download PDF
25. A software tool for extraction of annotation data from a PDB file.
- Author
-
Mandal, Amit Kr., Indra Gopal Das, and Bhattacharjee, Debotosh
- Abstract
The rapid expansion in the amount of biological data being generated worldwide is exceeding efforts to manage analysis of the data. Annotation can be specified as any piece of information associated with an amino acid sequence. Annotation is a process of relating additional information with a particular point in a piece of information. The present task is completely based on the biological database Protein Data Bank (PDB). The main outcome of the work is to extract annotation field information from pdb files in suitable way and accumulate it in a condensed manner. During the course of work, the FASTA sequence representation is made and this sequence is used to generate the annotation id which is one of the major annotation field. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
26. TEfinder: A Bioinformatics Pipeline for Detecting New Transposable Element Insertion Events in Next-Generation Sequencing Data
- Author
-
Vista Sohrab, Li-Jun Ma, Dilay Hazal Ayhan, Cristina López-Díaz, and Antonio Di Pietro
- Subjects
0301 basic medicine ,Transposable element ,Genome evolution ,lcsh:QH426-470 ,Computer science ,genome evolution ,computer.software_genre ,Article ,03 medical and health sciences ,0302 clinical medicine ,Software ,Genetics ,biochemistry ,Animals ,Software requirements ,Adaptation (computer science) ,next-generation sequencing (NGS) ,Genetics (clinical) ,business.industry ,FASTA format ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,Pipeline (software) ,mobile element insertion events ,lcsh:Genetics ,030104 developmental biology ,030220 oncology & carcinogenesis ,DNA Transposable Elements ,Data mining ,Mobile genetic elements ,transposable elements ,business ,computer - Abstract
Transposable elements (TEs) are mobile genetic elements capable of rapidly altering the genome through their movements. The importance of TE activity has been documented in many biological processes, such as introducing genetic instability, altering patterns of gene expression, and accelerating genome evolution. Increasing appreciation of TEs results in the growing number of bioinformatics software to identify insertion events. However, the application of existing TE finding tools is limited by either narrow-focused design of the package, too many dependencies on other tools, or prior knowledge required as input files that may not be readily available to all users. Here, we report a simple pipeline, TEfinder, developed for the detection of new TE insertions with minimal software dependencies using four inputs that can be easily generated with popular variant calling pipelines. The external software requirements are BEDTools, SAMtools, and Picard. Necessary inputs include TEs present in the reference genome, binary paired-end alignment, reference genome index, and a list of TE names. We tested TEfinder pipeline among several evolving populations of Fusarium oxysporum generated through a short-term adaptation study. Our results demonstrate that this easy-to-use tool can effectively detect new TE insertion events, making it accessible and practical for TE analysis.
- Published
- 2020
27. Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes
- Author
-
Koen Deforche, Vagner Fonseca, Wim Dumon, Luiz Carlos Junior Alcantara, Wasim Abdool Karim, Sara Cleemput, Marta Giovanetti, and Tulio de Oliveira
- Subjects
Statistics and Probability ,China ,viruses ,Pneumonia, Viral ,coronavirus ,Genome, Viral ,Computational biology ,Biology ,medicine.disease_cause ,Biochemistry ,Genome ,DNA sequencing ,Article ,Betacoronavirus ,03 medical and health sciences ,0302 clinical medicine ,Phylogenetics ,medicine ,Humans ,030212 general & internal medicine ,Typing ,Pandemics ,Molecular Biology ,Phylogeny ,030304 developmental biology ,Coronavirus ,SARS ,Whole genome sequencing ,Internet ,0303 health sciences ,Whole Genome Sequencing ,Phylogenetic tree ,SARS-CoV-2 ,Viral Epidemiology ,FASTA format ,COVID-19 ,mutations ,viral classification ,3. Good health ,Computer Science Applications ,Computational Mathematics ,Applications Note ,Computational Theory and Mathematics ,2019-nCoV ,Identification (biology) ,phylogenetic ,Coronavirus Infections ,Software - Abstract
Summary Genome detective is a web-based, user-friendly software application to quickly and accurately assemble all known virus genomes from next-generation sequencing datasets. This application allows the identification of phylogenetic clusters and genotypes from assembled genomes in FASTA format. Since its release in 2019, we have produced a number of typing tools for emergent viruses that have caused large outbreaks, such as Zika and Yellow Fever Virus in Brazil. Here, we present the Genome Detective Coronavirus Typing Tool that can accurately identify the novel severe acute respiratory syndrome (SARS)-related coronavirus (SARS-CoV-2) sequences isolated in China and around the world. The tool can accept up to 2000 sequences per submission and the analysis of a new whole-genome sequence will take approximately 1 min. The tool has been tested and validated with hundreds of whole genomes from 10 coronavirus species, and correctly classified all of the SARS-related coronavirus (SARSr-CoV) and all of the available public data for SARS-CoV-2. The tool also allows tracking of new viral mutations as the outbreak expands globally, which may help to accelerate the development of novel diagnostics, drugs and vaccines to stop the COVID-19 disease. Availability and implementation https://www.genomedetective.com/app/typingtool/cov Contact koen@emweb.be or deoliveira@ukzn.ac.za Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2020
28. Extending Comet for global amino acid variant and post-translational modification analysis using the PSI extended FASTA format (PEFF)
- Author
-
Jimmy K. Eng and Eric W. Deutsch
- Subjects
Proteomics ,0303 health sciences ,NeXtProt ,Sequence database ,Proteomics Standards Initiative ,Proteome ,Computer science ,030302 biochemistry & molecular biology ,FASTA format ,Computational biology ,Biochemistry ,Post Translational Modification Analysis ,Article ,03 medical and health sciences ,HEK293 Cells ,Human proteome project ,Comet (programming) ,Humans ,Amino Acids ,Databases, Protein ,Molecular Biology ,Protein Processing, Post-Translational ,Software ,030304 developmental biology - Abstract
Protein identification by tandem mass spectrometry sequence database searching is a standard practice in many proteomics laboratories. The de facto standard for the representation of sequence databases used as input to sequence database search tools is the FASTA format. The Human Proteome Organization's Proteomics Standards Initiative has developed an extension to the FASTA format termed the proteomics standards initiative extended FASTA format or PSI extended FASTA format (PEFF) where additional information such as structural annotations are encoded in the protein description lines. Comet has been extended to automatically analyze the post translational modifications and amino acid substitutions encoded in PEFF databases. Comet's PEFF implementation and example analysis results searching a HEK293 dataset against the neXtProt PEFF database are presented.
- Published
- 2020
29. DNA Barcode: The Genetic Blueprint for Identity and Diversity of Phyllanthus amarus Schum. et. Thonn
- Author
-
M. Vijayalakshmi, M. Ushakiranmayi, and P. Sudhakar
- Subjects
genomic DNA ,chemistry.chemical_compound ,Multiple sequence alignment ,chemistry ,Phylogenetic tree ,GenBank ,FASTA format ,food and beverages ,Computational biology ,Biology ,Mega ,DNA barcoding ,DNA - Abstract
Since the identification of the plant species is an important event, the present work was carried out to identify the medicinal plant Phyllanthus amarus taking the genetic variability among the species into consideration. The chloroplast genomic DNA was isolated and tested for quantity and quality. 2 μl of the template DNA, master mix, molecular-grade water and the forward and reverse primer were added and the final volume was made to 50 μl, and the amplification was performed in a gradient thermo-cycler and tested for amplification. The product was purified and sequenced and the sequence was submitted to GenBank. The sequence was subjected to multiple sequence alignment, and the phylogenic tree was constructed using MEGA 4 bioinformatics tools. The genomic DNA of the chloroplast was isolated, the highly conserved non-coding intron region of t-RNA L of the chloroplast was amplified, the amplified fragment was sequenced and the sequence in FASTA format was subjected to Multiple Sequence Alignment. The sequence showed no homology with any other sequence reported in the GenBank and this was the first highly conserved non-coding intron sequence of the plant Phyllanthus amarus that was deposited and the phylogenic tree was constructed. The identification of plant species based on the morphology is acceptable to an extent but the genetic blueprint gives you the exact information for the identification of the plant species. DNA barcode is a standardized and cost-effective molecular identification system used for plant identification. Hence the medicinally valuable plants needs the accurate identification among the species belonging to the same genus.
- Published
- 2020
- Full Text
- View/download PDF
30. Global cataloguing of variations in untranslated regions of viral genome and prediction of key host RNA binding protein-microRNA interactions modulating genome stability in SARS-CoV-2
- Author
-
Srikanta Goswami and Moumita Mukherjee
- Subjects
0301 basic medicine ,Untranslated region ,RNA viruses ,Coronaviruses ,RNA-binding protein ,RNA-binding proteins ,Squamous Cell Lung Carcinoma ,Genome ,Biochemistry ,Lung and Intrathoracic Tumors ,0302 clinical medicine ,Untranslated Regions ,Coding region ,3' Untranslated Regions ,Immune Response ,Pathology and laboratory medicine ,Genetics ,Viral Genomics ,Multidisciplinary ,Adenocarcinoma of the Lung ,Messenger RNA ,FASTA format ,Genomics ,Medical microbiology ,Nucleic acids ,Oncology ,030220 oncology & carcinogenesis ,Host-Pathogen Interactions ,Viruses ,Viral Genome ,RNA, Viral ,Medicine ,SARS CoV 2 ,Pathogens ,Coronavirus Infections ,Protein Binding ,Research Article ,SARS coronavirus ,Science ,Pneumonia, Viral ,Immunology ,Genome, Viral ,Microbial Genomics ,Biology ,Adenocarcinoma ,Microbiology ,Genomic Instability ,03 medical and health sciences ,Betacoronavirus ,Open Reading Frames ,Signs and Symptoms ,Virology ,microRNA ,Humans ,Squamous Cell Carcinoma ,Non-coding RNA ,Pandemics ,Whole genome sequencing ,Medicine and health sciences ,Inflammation ,Natural antisense transcripts ,Binding Sites ,Base Sequence ,Biology and life sciences ,SARS-CoV-2 ,Carcinoma ,Organisms ,Viral pathogens ,RNA ,COVID-19 ,Cancers and Neoplasms ,Proteins ,Gene regulation ,Microbial pathogens ,MicroRNAs ,030104 developmental biology ,Gene expression ,Clinical Medicine ,5' Untranslated Regions - Abstract
BackgroundThe world is going through the critical phase of COVID-19 pandemic, caused by human coronavirus, SARS-CoV2. Worldwide concerted effort to identify viral genomic changes across different sub-types has identified several strong changes in the coding region. However, there have not been many studies focusing on the variations in the 5’ and 3’ untranslated regions and their consequences. Considering the possible importance of these regions in host mediated regulation of viral RNA genome, we wanted to explore the phenomenon.MethodsTo have an idea of the global changes in 5’ and 3’-UTR sequences, we downloaded 8595 complete and high-coverage SARS-CoV2 genome sequence information from human host in FASTA format from Global Initiative on Sharing All Influenza Data (GISAID) from 15 different geographical regions. Next, we aligned them using Clustal Omega software and investigated the UTR variants. We also looked at the putative host RNA binding protein (RBP) and microRNA binding sites in these regions by ‘RBPmap’ and ‘RNA22 v2’ respectively. Expression status of selected RBPs and microRNAs were checked in lungs tissue.ResultsWe identified 28 unique variants in SARS-CoV2 UTR region based on a minimum variant percentage cut-off of 0.5. Along with 241C>T change the important 5’-UTR change identified was 187A>G, while 29734G>C, 29742G>A/T and 29774C>T were the most familiar variants of 3’UTR among most of the continents. Furthermore, we found that despite of the variations in the UTR regions, binding of host RBP to them remains mostly unaltered, which further influenced the functioning of specific miRNAs.ConclusionOur results, shows for the first time in SARS-Cov2 infection, a possible cross-talk between host RBPs-miRNAs and viral UTR variants, which ultimately could explain the mechanism of escaping host RNA decay machinery by the virus. The knowledge might be helpful in developing anti-viral compounds in future.
- Published
- 2020
31. Reconstruction of Phylogenetic Tree for COX with DNA Sequences
- Author
-
V. JaswanthSai, N. V. Krishna Rao, D. Anurag, R. AnanthSai, P. Akash, and Chukka Santhaiah
- Subjects
Tree (data structure) ,Phylogenetic tree ,Similarity (network science) ,FASTA format ,Computational biology ,Biology ,Gene ,Function (biology) ,DNA sequencing - Abstract
The phylogenetic tree plays a major role in identifying the evolution of specie or evolutionary relationships. The aim of this study was to reconstruct phylogenetic tree for cyclooxygenase (COX) gene with DNA sequences. The COX gene (P22437.1) data is taken from the NCBI (National Centre for Biotechnology Information) repository. During the process of reconstruction, the data is always considered in FASTA format. Blast function is used on this data which helps in calculating optimal alignments in faster process and also searches for similarity scores in known DNA sequences or other organisms. Then, the ClustalW approach is used for reconstruction of the tree as it uses progressive alignment methods and also helps in reducing duplicate sequences.
- Published
- 2020
- Full Text
- View/download PDF
32. In silico identification and construction of microbial gene clusters associated with biodegradation of xenobiotic compounds
- Author
-
Aditya B. Pant, Prachi Srivastava, Anjani Kumari, and Garima Awasthi
- Subjects
0301 basic medicine ,In silico ,Microbial Consortia ,Computational biology ,Biology ,Microbiology ,Xenobiotics ,Skeletonema costatum ,03 medical and health sciences ,chemistry.chemical_compound ,Gene ,Phylogeny ,Bacteria ,Fungi ,FASTA format ,Biodegradation ,biology.organism_classification ,Pseudomonas putida ,Biodegradation, Environmental ,030104 developmental biology ,Infectious Diseases ,chemistry ,Multigene Family ,Identification (biology) ,Xenobiotic ,Genes, Microbial - Abstract
Chemical substances not showing any importance in existence of biological systems and causing serious health hazards may be designated as Xenobiotic compound. Elimination or degradation of these unwanted substances is a major issue of concern for current time research. Process of biodegradation is a very important aspect of current research as discussed in current manuscript. Current study focuses on the detailed mining of data for the construction of microbial consortia for wide range of xenobiotics compounds. Intensive literature search was done for the construction of this library. Desired data was retrieved from NCBI in fasta format. Data was analysed through homology approaches by using BLAST. This homology based searched enriched with a great vision that not only bacterial population but many other cheap and potential sources are available for different xenobiotic degradation. Though it was focused that bacterial population covers a major part of biodegradation which is near about 90.6% but algae and fungi are also showing promising future in degradation of some important xenobiotic compounds. Analysis of data reveals that Pseudomonas putida has potential for degrading maximum compounds. Establishment of correlation through cluster analysis signifies that Pseudomonas putida, Aspergillus niger and Skeletonema costatum can have combined traits that can be used in finding out actual evolutionary relationship between these species. These findings may also givea new outcome in terms of much cheaper and eco-friendly source in the area of biodegradation of specified xenobiotic compounds.
- Published
- 2018
- Full Text
- View/download PDF
33. JCAST: Sample-specific protein isoform databases for mass spectrometry-based proteomics experiments
- Author
-
Edward Lau and R.W. Ludwig
- Subjects
Protein isoform ,Database ,Computer science ,In silico ,Alternative splicing ,FASTA format ,Python (programming language) ,Proteogenomics ,computer.software_genre ,Proteomics ,ComputingMethodologies_PATTERNRECOGNITION ,Protein sequencing ,computer ,Software ,computer.programming_language - Abstract
JCAST is an open-source Python software tool that allows users to easily create custom protein sequence databases for proteogenomic applications. JCAST takes in RNA sequencing data containing alternative splicing junctions as input, models the likely translatable protein isoform sequences within a particular sample, performs in silico translation using annotated open reading frames, and outputs sample-specific protein sequence databases in FASTA format to support downstream mass spectrometry data analysis of protein isoforms. This article describes the functionality and usage of the JCAST software and documents a stable code repository for user access.
- Published
- 2021
- Full Text
- View/download PDF
34. An Efficient Approach to Explore and Discriminate Anomalous Regions in Bacterial Genomes Based on Maximum Entropy
- Author
-
Fabrício Martins Lopes, Gesiele Almeida Barros-Carvalho, and Marie-Anne Van Sluys
- Subjects
DNA, Bacterial ,0301 basic medicine ,Whole genome sequencing ,Xanthomonas ,biology ,Sequence analysis ,Entropy ,Principle of maximum entropy ,FASTA format ,Genomics ,Bacterial genome size ,Computational biology ,biology.organism_classification ,Genome ,Xanthomonas campestris ,03 medical and health sciences ,Computational Mathematics ,Annotation ,030104 developmental biology ,Computational Theory and Mathematics ,Modeling and Simulation ,Botany ,Genetics ,Molecular Biology ,Genome, Bacterial - Abstract
Recently, there has been an increase in the number of whole bacterial genomes sequenced, mainly due to the advancing of next-generation sequencing technologies. In face of this, there is a need to provide new analytical alternatives that can follow this advance. Given our current knowledge about the genomic plasticity of bacteria and that those genomic regions can uncover important features about this microorganism, our goal was to develop a fast methodology based on maximum entropy (ME) to guide the researcher to regions that could be prioritized during the analysis. This methodology was compared with other available methods. In addition, ME was applied to eight different bacterial genera. The methodology consists of two main steps: processing the nucleotide sequence and ME calculation. We applied ME to Xanthomonas axonopodis pv. citri 306 (XAC) and Xanthomonas campestris pv. campestris ATCC 33913 (XCC), both of which have their anomalous regions well documented. We then compared our results against those from Alien Hunter, HGT-DB, Islander, IslandPath, and SIGI-HMM. ME was shown to be superior in terms of efficiency and analysis duration. Besides, ME only needs the genome sequence in FASTA format as input. The proposed strategy based on ME is able to help in bacterial genome exploration. This is a simple and fast strategy for individual genomes in comparison with other available methods, without relying on previous annotation and alignments. This methodology can also be a new option in the early stages of analysis of newly sequenced bacterial genomes.
- Published
- 2017
- Full Text
- View/download PDF
35. WGSSAT: A High-Throughput Computational Pipeline for Mining and Annotation of SSR Markers From Whole Genomes
- Author
-
Shreya Srivastava, Ravindra Kumar, Suyash Agarwal, Manmohan Pandey, Prachi Srivastava, Naresh Sahebrao Nagpure, Basdeo Kushwaha, and J. K. Jena
- Subjects
Genetic Markers ,0301 basic medicine ,In silico ,Computational biology ,Biology ,Bioinformatics ,Genome ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,Genetics ,Animals ,Coding region ,Molecular Biology ,Gene ,Genetics (clinical) ,Whole genome sequencing ,Fugu ,FASTA format ,Computational Biology ,food and beverages ,Genomics ,Takifugu ,030104 developmental biology ,030220 oncology & carcinogenesis ,Software ,Microsatellite Repeats ,Biotechnology - Abstract
Mining and characterization of Simple Sequence Repeat (SSR) markers from whole genomes provide valuable information about biological significance of SSR distribution and also facilitate development of markers for genetic analysis. Whole genome sequencing (WGS)-SSR Annotation Tool (WGSSAT) is a graphical user interface pipeline developed using Java Netbeans and Perl scripts which facilitates in simplifying the process of SSR mining and characterization. WGSSAT takes input in FASTA format and automates the prediction of genes, noncoding RNA (ncRNA), core genes, repeats and SSRs from whole genomes followed by mapping of the predicted SSRs onto a genome (classified according to genes, ncRNA, repeats, exonic, intronic, and core gene region) along with primer identification and mining of cross-species markers. The program also generates a detailed statistical report along with visualization of mapped SSRs, genes, core genes, and RNAs. The features of WGSSAT were demonstrated using Takifugu rubripes data. This yielded a total of 139 057 SSR, out of which 113 703 SSR primer pairs were uniquely amplified in silico onto a T. rubripes (fugu) genome. Out of 113 703 mined SSRs, 81 463 were from coding region (including 4286 exonic and 77 177 intronic), 7 from RNA, 267 from core genes of fugu, whereas 105 641 SSR and 601 SSR primer pairs were uniquely mapped onto the medaka genome. WGSSAT is tested under Ubuntu Linux. The source code, documentation, user manual, example dataset and scripts are available online at https://sourceforge.net/projects/wgssat-nbfgr.
- Published
- 2017
- Full Text
- View/download PDF
36. VCFtoTree: a user-friendly tool to construct locus-specific alignments and phylogenies from thousands of anthropologically relevant genome sequences
- Author
-
Duo Xu, Yousef Jaber, Pavlos Pavlidis, and Omer Gokcumen
- Subjects
0301 basic medicine ,Primates ,Next generation sequencing data ,Computational biology ,Biology ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Genome ,DNA sequencing ,03 medical and health sciences ,User-Computer Interface ,1000Genomes ,INDEL Mutation ,Structural Biology ,Newick format ,Animals ,Humans ,1000 Genomes Project ,FASTA ,Molecular Biology ,lcsh:QH301-705.5 ,Phylogeny ,Graphical user interface ,Parsing ,Anthropological genetics ,Base Sequence ,business.industry ,Genome, Human ,Applied Mathematics ,FASTA format ,Sequence Analysis, DNA ,Computer Science Applications ,030104 developmental biology ,ComputingMethodologies_PATTERNRECOGNITION ,lcsh:Biology (General) ,Genetic Loci ,VCF ,lcsh:R858-859.7 ,Human genome ,business ,computer ,Sequence Alignment ,Algorithms ,Software - Abstract
Background Constructing alignments and phylogenies for a given locus from large genome sequencing studies with relevant outgroups allow novel evolutionary and anthropological insights. However, no user-friendly tool has been developed to integrate thousands of recently available and anthropologically relevant genome sequences to construct complete sequence alignments and phylogenies. Results Here, we provide VCFtoTree, a user friendly tool with a graphical user interface that directly accesses online databases to download, parse and analyze genome variation data for regions of interest. Our pipeline combines popular sequence datasets and tree building algorithms with custom data parsing to generate accurate alignments and phylogenies using all the individuals from the 1000 Genomes Project, Neanderthal and Denisovan genomes, as well as reference genomes of Chimpanzee and Rhesus Macaque. It can also be applied to other phased human genomes, as well as genomes from other species. The output of our pipeline includes an alignment in FASTA format and a tree file in newick format. Conclusion VCFtoTree fulfills the increasing demand for constructing alignments and phylogenies for a given loci from thousands of available genomes. Our software provides a user friendly interface for a wider audience without prerequisite knowledge in programming. VCFtoTree can be accessed from https://github.com/duoduoo/VCFtoTree_3.0.0. Electronic supplementary material The online version of this article (10.1186/s12859-017-1844-0) contains supplementary material, which is available to authorized users.
- Published
- 2017
- Full Text
- View/download PDF
37. Broom: application for non-redundant storage of high throughput sequencing data
- Author
-
George Golovko, Yuriy Fofanov, Levent Albayrak, and Kamil Khanipov
- Subjects
Statistics and Probability ,FASTQ format ,Sequence analysis ,Computer science ,Test data generation ,computer.software_genre ,Biochemistry ,Genome ,DNA sequencing ,03 medical and health sciences ,chemistry.chemical_compound ,Software ,Transfer (computing) ,Code (cryptography) ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,business.industry ,030302 biochemistry & molecular biology ,FASTA format ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,chemistry ,Computer data storage ,Nucleic acid ,Operating system ,Databases, Nucleic Acid ,business ,computer ,Algorithms ,DNA - Abstract
Motivation The data generation capabilities of high throughput sequencing (HTS) instruments have exponentially increased over the last few years, while the cost of sequencing has dramatically decreased allowing this technology to become widely used in biomedical studies. For small labs and individual researchers, however, storage and transfer of large amounts of HTS data present a significant challenge. The recent trends in increased sequencing quality and genome coverage can be used to reconsider HTS data storage strategies. Results We present Broom, a stand-alone application designed to select and store only high-quality sequencing reads at extremely high compression rates. Written in C++, the application accepts single and paired-end reads in FASTQ and FASTA formats and decompresses data in FASTA format. Availability and implementation C++ code available at https://scsb.utmb.edu/labgroups/fofanov/broom.asp. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2018
- Full Text
- View/download PDF
38. PSLCNN: Protein Subcellular Localization Prediction for Eukaryotes and Prokaryotes Using Deep Learning
- Author
-
Che-Yu Chang, Tz-Wei Hsu, and Jia-Ming Chang
- Subjects
Feature engineering ,business.industry ,Computer science ,Deep learning ,FASTA format ,Computational biology ,Subcellular localization ,Convolutional neural network ,Protein subcellular localization prediction ,Annotation ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial intelligence ,UniProt ,business - Abstract
Many machine learning methods have been used to predict prokaryotic and eukaryotic protein subcellular localization. As most algorithms involve specific feature engineering, we carry out prediction using the feature-free property of deep learning methods. We present PSLCNN, a model using deep neural networks to predict protein subcellular localization for eukaryotes and prokaryotes. Only sequence information is needed (FASTA format). The model uses 1D convolution and predicts where the query localizes. It was trained and tested on an un-redundant dataset from the latest UniProt release, only for data with experimental annotation. Compared with the state-of-the-art tools, PSLCNN achieves the best performance for prokaryotes and is comparable for eukaryotes. We have also implemented a free PSLCNN web service available at https://github.com/changlabtw/PSLCNN.
- Published
- 2019
- Full Text
- View/download PDF
39. ProphET, prophage estimation tool: A stand-alone prophage sequence prediction tool with self-updating reference database
- Author
-
Ashlee M. Earl, Daniella Castanheira Bartholomeu, Gustavo C. Cerqueira, João Luís Reis-Cunha, and Abigail L. Manson
- Subjects
Computer science ,medicine.medical_treatment ,Prophages ,Web Browser ,Genome ,Bacteriophage ,Database and Informatics Methods ,Gene expression ,Bacteriophages ,Database Searching ,Genetics ,0303 health sciences ,Viral Genomics ,Multidisciplinary ,Bacterial Genomics ,Bacterial genomics ,Repertoire ,FASTA format ,Microbial Genetics ,Genome project ,Genomics ,Genomic Databases ,Molecular Sequence Annotation ,Viruses ,Medicine ,Databases, Nucleic Acid ,Sequence Analysis ,Research Article ,Phage therapy ,Bioinformatics ,Science ,Virulence ,Sequence Databases ,Computational biology ,Bacterial genome size ,Microbial Genomics ,Genome, Viral ,Biology ,Research and Analysis Methods ,Genomic databases ,Microbiology ,Sensitivity and Specificity ,03 medical and health sciences ,Antibiotic resistance ,Sequence prediction ,Virology ,medicine ,Bacterial Genetics ,Sequence Similarity Searching ,Gene ,Prophage ,030304 developmental biology ,030306 microbiology ,Organisms ,Biology and Life Sciences ,Computational Biology ,Bacteriology ,Gene Annotation ,biology.organism_classification ,Genome Analysis ,Biological Databases ,Nucleic acid ,Reference database ,Software - Abstract
BackgroundProphages play a significant role in prokaryotic evolution, often altering the function of the cell that they infect via transfer of new genes e.g., virulence or antibiotic resistance factors, inactivation of existing genes or by modifying gene expression. Recently, phage therapy has gathered renewed interest as a promising alternative to control bacterial infections. Cataloging the repertoire of prophages in large collections of species' genomes is an important initial step in understanding their evolution and potential therapeutic utility. However, current widely-used tools for identifying prophages within bacterial genome sequences are mainly web-based, can have long response times, and do not scale to keep pace with the many thousands of genomes currently being sequenced routinely.MethodologyIn this work, we present ProphET, an easy to install prophage predictor to be used in Linux operation system, without the constraints associated with a web-based tool. ProphET predictions rely on similarity searches against a database of prophage genes, taking as input a bacterial genome sequence in FASTA format and its corresponding gene annotation in GFF. ProphET identifies prophages in three steps: similarity search, calculation of the density of prophage genes, and edge refinement. ProphET performance was evaluated and compared with other phage predictors based on a set of 54 bacterial genomes containing 267 manually annotated prophages.Findings and conclusionsProphET identifies prophages in bacterial genomes with high precision and offers a fast, highly scalable alternative to widely-used web-based applications for prophage detection.
- Published
- 2019
40. XMAn v2—a database of Homo sapiens mutated peptides
- Author
-
Iulia M. Lazar and Marcela Aguilera Flores
- Subjects
Statistics and Probability ,Databases, Factual ,Nonsense mutation ,Biology ,computer.software_genre ,Tandem mass spectrometry ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Protein sequencing ,Tandem Mass Spectrometry ,Missense mutation ,Humans ,Amino Acid Sequence ,Databases, Protein ,Molecular Biology ,Peptide sequence ,030304 developmental biology ,0303 health sciences ,Database ,FASTA format ,Proteins ,Applications Notes ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Homo sapiens ,Mutation testing ,Peptides ,computer ,030217 neurology & neurosurgery - Abstract
Summary The ‘Unknown Mutation Analysis (XMAn)’ database is a compilation of Homo sapiens mutated peptides in FASTA format, that was constructed for facilitating the identification of protein sequence alterations by tandem mass spectrometry detection. The database comprises 2 539 031 non-redundant mutated entries from 17 599 proteins, of which 2 377 103 are missense and 161 928 are nonsense mutations. It can be used in conjunction with search engines that seek the identification of peptide amino acid sequences by matching experimental tandem mass spectrometry data to theoretical sequences from a database. Availability and implementation XMAn v2 can be accessed from github.com/lazarlab/XMAnv2. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2019
41. siRNA-Finder (si-Fi) Software for RNAi-Target Design and Off-Target Prediction
- Author
-
Stefanie Lück, Tino Kreszies, Patrick Schweizer, Markus Kuhlmann, Marc Strickert, and Dimitar Douchkov
- Subjects
0106 biological sciences ,0301 basic medicine ,Computer science ,Transgene ,Plant Science ,Computational biology ,lcsh:Plant culture ,01 natural sciences ,posttranscriptional gene silencing ,03 medical and health sciences ,Software ,RNA interference ,Methods ,Gene silencing ,lcsh:SB1-1110 ,Gene ,si-Fi ,Mechanism (biology) ,business.industry ,fungi ,FASTA format ,030104 developmental biology ,RNAi design ,RNAi efficiency prediction ,business ,RNA interface ,off-target ,Function (biology) ,010606 plant biology & botany - Abstract
RNA interference (RNAi) is a technique used for transgene-mediated gene silencing based on the mechanism of posttranscriptional gene silencing (PTGS). PTGS is an ubiquitous basic biological phenomenon involved in the regulation of transcript abundance and plants’ immune response to viruses. PTGS also mediates genomic stability by silencing of retroelements. RNAi has become an important research tool for studying gene function by strong and selective suppression of target genes. Here, we present si-Fi, a software tool for design optimization of RNAi constructs necessary for specific target gene knock-down. It offers efficiency prediction of RNAi sequences and off-target search, required for the practical application of RNAi. si-Fi is an open-source (CC BY-SA license) desktop software that works in Microsoft Windows environment and can use custom sequence databases in standard FASTA format.
- Published
- 2019
- Full Text
- View/download PDF
42. ZooPhy: A bioinformatics pipeline for virus phylogeography and surveillance
- Author
-
Matteo Valente, Matthew Scotch, and Arjun Magge
- Subjects
0303 health sciences ,education.field_of_study ,Sequence database ,Computer science ,Population ,FASTA format ,Context (language use) ,Bioinformatics ,computer.software_genre ,Metadata ,03 medical and health sciences ,Tree (data structure) ,0302 clinical medicine ,Viral phylodynamics ,General Earth and Planetary Sciences ,030212 general & internal medicine ,education ,computer ,Abstract ,030304 developmental biology ,General Environmental Science ,Data integration - Abstract
Objective We will describe the ZooPhy system for virus phylogeography and public health surveillance [1]. ZooPhy is designed for public health personnel that do not have expertise in bioinformatics or phylogeography. We will show its functionality by performing case studies of different viruses of public health concern including influenza and rabies virus. We will also provide its URL for user feedback by ISDS delegates. Introduction Sequence-informed surveillance is now recognized as an important extension to the monitoring of rapidly evolving pathogens [2]. This includes phylogeography, a field that studies the geographical lineages of species including viruses [3] by using sequence data (and relevant metadata such as sampling location). This work relies on bioinformatics knowledge. For example, the user first needs to find a relevant sequence database, navigate through it, and use proper search parameters to obtain the desired data. They also must ensure that there is sufficient metadata such as collection date and sampling location. They then need to align the sequences and integrate everything into specific software for phylogeography. For example, BEAST [4] is a popular tool for discrete phylogeography. For proper use, the software requires knowledge of phylogenetics and utilization of BEAUti, its XML processing software. The user then needs to use other software, like TreeAnnotator [4], to produce a single (“representative”) maximum clade credibility (MCC) tree. Even then, the evolutionary spread of the virus can be difficult to interpret via a simple tree viewer. There is software (such as SpreaD3 [5]) for visualizing a tree within a geographic context, yet for novice users, it might not be easy to use. Currently, there are only a few systems designed to automate these types of tasks for virus surveillance and phylogeography. Methods We have developed ZooPhy, a pipeline for sequence-informed surveillance and phylogeography [1]. It is designed for health agency personnel that do not have expertise in bioinformatics or phylogeography. We created a large database of all virus sequences and metadata from GenBank [6] as well as a smaller database for selected viruses perceived to be of great interest for health agencies including: influenza (A, B, and C), Ebola, rabies, West Nile virus, and Zika virus. In Figure 1A, we show our front-end architecture, created in the style of the influenza research database [7], that enables the user to search by: virus, gene name, host, time-frame, and geography. We also allow users to upload their own list of GenBank accessions or unpublished sequences. Hitting “Search” produces a Results tab which includes the metadata of the sequences. We provide a feature to randomly down-sample by a specified percentage or number. We also allow the user to download the metadata in CSV format or the unaligned sequences in FASTA format. The final tab, "Run", includes a text box for specifying an email in order to send job updates and final results on virus spread. We also enable for the user to study the influence of predictors on virus spread (via a generalized linear model). Currently, we have predictors such as temperature, great circle distance, population, and sample size for selected countries. We also offer experts the ability to specify advanced modeling parameters including the molecular clock type (strict vs. relaxed), coalescent tree prior, and chain length and sampling frequency for the Markov-chain Monte Carlo. When the user selects “Start ZooPhy”, a pre-processor eliminates incomplete or non-disjoint record locations and sends the rest for analysis. Results When initiated, the ZooPhy pipeline includes sequence alignment via Mafft [8] and creation of an XML template via BEASTGen for input into BEAST for discrete phylogeography. It then uses TreeAnnotator [3] to create an MCC tree from the posterior distribution of sampled trees. ZooPhy uses the MCC as input into SpreaD3 for a recreation of the time-estimated migration via a map. If the user selects the GLM option, the system runs an R script to calculate the Bayes factor of the inclusion probability for each predictor and draws a plot including the regression coefficient and its 95% Bayesian credible interval. We are currently working on new visualization techniques such as those demonstrated by Dudas et al. that combine time-oriented spread via a map and evolution on a phylogenetic tree annotated by discrete locations [9]. Conclusions Recent advances in phylodynamics, bioinformatics, and visualization have demonstrated the potential of pipelines to support surveillance. One example is NextStrain which can perform real-time virus phylodynamics [10]. The system has recently been added as an app to the Global Initiative on Sharing Avian Influenza Data (GISAID) database for influenza tracking using DNA sequences [11]. This presentation will highlight a pipeline for virus phylogeography designed for epidemiologists who are not experts in bioinformatics but wish to leverage virus sequence data as part of routine surveillance. We will describe the development and implementation of our system, ZooPhy, and use real-world case studies to demonstrate its functionality. We invite ISDS delegates to use the system via our web portal, https://zodo.asu.edu/zoophy/ and provide feedback on system utilization. References 1. Scotch, M., et al., At the intersection of public-health informatics and bioinformatics: using advanced Web technologies for phylogeography. Epidemiology, 2010. 21(6), 764-768. 2. Gardy, J.L. and N.J. Loman, Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet, 2018. 19: p. 9-20. 3. Avise, J.C., Phylogeography : the history and formation of species. 2000, Cambridge, Mass.: Harvard University Press. 4. Suchard, M.A., et al., Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol, 2018. 4. 5. Bielejec, F., et al., SpreaD3: Interactive Visualization of Spatiotemporal History and Trait Evolutionary Processes. Mol Biol Evol, 2016. 33(8): p. 2167-9. 6. Benson, D. A.,et al., GenBank. Nucleic Acids Res, 2018. 46, p. D41-D47. 7. Zhang, Y., et al., Influenza Research Database: An integrated bioinformatics resource for influenza virus research. Nucleic Acids Res, 2017. 45: p. D466-D474. 8. Katoh, K. and D.M. Standley, MAFFT: iterative refinement and additional methods. Methods Mol Biol, 2014. 1079: p. 131-46. 9. Dudas, G., et al., Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature, 2017. 544(7650): p. 309-315. 10. Hadfield, J., et al., Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 2018. 11. NextFlu. 2018; Available from: https://www.gisaid.org/epiflu-applications/nextflu-app/.
- Published
- 2019
- Full Text
- View/download PDF
43. Proteomics Standards Initiative Extended FASTA Format (PEFF)
- Author
-
Peter R. Baker, Harald Barsnes, Jimmy K. Eng, Luis Mendoza, Yasset Perez-Riverol, Martin Eisenacher, Tim Van Den Bossche, Robert J. Chalkley, Jim Shofstahl, Sean L. Seymour, Luis Francisco Hernández Sánchez, Juan Antonio Vizcaíno, Gerhard Mayer, Pierre-Alain Binz, Andrew Collins, Eric W. Deutsch, Emanuele Alpi, Lydie Lane, Eugene A. Kapp, Gerben Menschaert, and Karl R. Clauser
- Subjects
0303 health sciences ,Information retrieval ,Proteomics Standards Initiative ,NeXtProt ,Computer science ,030302 biochemistry & molecular biology ,FASTA format ,Proteomics ,File format ,Proteogenomics ,Mass spectrometry ,Metadata ,03 medical and health sciences ,Validator ,UniProt ,030304 developmental biology - Abstract
Mass spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs), in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI Extended FASTA Format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backwards compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available athttp://www.psidev.info/peff.
- Published
- 2019
- Full Text
- View/download PDF
44. MyDGR: a server for identification and characterization of diversity-generating retroelements
- Author
-
Fatemeh Sharifi and Yuzhen Ye
- Subjects
Web server ,Information Storage and Retrieval ,Sequence alignment ,Bacterial genome size ,Computational biology ,Biology ,computer.software_genre ,Genome ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,Genetics ,Humans ,Bacteriophages ,030304 developmental biology ,0303 health sciences ,Internet ,Bacteria ,Base Sequence ,Microbiota ,FASTA format ,Molecular Sequence Annotation ,RNA-Directed DNA Polymerase ,Archaea ,Identification (information) ,Genetic Loci ,Web Server Issue ,Candidate Disease Gene ,computer ,Sequence Alignment ,030217 neurology & neurosurgery ,Software - Abstract
MyDGR is a web server providing integrated prediction and visualization of Diversity-Generating Retroelements (DGR) systems in query nucleotide sequences. It is built upon an enhanced version of DGRscan, a tool we previously developed for identification of DGR systems. DGR systems are remarkable genetic elements that use error-prone reverse transcriptases to generate vast sequence variants in specific target genes, which have been shown to benefit their hosts (bacteria, archaea or phages). As the first web server for annotation of DGR systems, myDGR is freely available on the web at http://omics.informatics.indiana.edu/myDGR with all major browsers supported. MyDGR accepts query nucleotide sequences in FASTA format, and outputs all the important features of a predicted DGR system, including a reverse transcriptase, a template repeat and one (or more) variable repeats and their alignment featuring A-to-N (N can be C, T or G) substitutions, and VR-containing target gene(s). In addition to providing the results as text files for download, myDGR generates a visual summary of the results for users to explore the predicted DGR systems. Users can also directly access pre-calculated, putative DGR systems identified in currently available reference bacterial genomes and a few other collections of sequences (including human microbiomes).
- Published
- 2019
45. VaxiJen Dataset of Bacterial Immunogens: An Update
- Author
-
Irini Doytchinova, Nevena Zaharieva, Ivawn Dimitrov, and Darren R. Flower
- Subjects
Datasets as Topic ,Computational biology ,Biology ,01 natural sciences ,Epitope ,Set (abstract data type) ,Epitopes ,Bacterial Proteins ,Drug Discovery ,Data Mining ,Humans ,Total protein ,Antigens, Bacterial ,Training set ,Bacteria ,010405 organic chemistry ,Immunogenicity ,FASTA format ,Bacterial Infections ,General Medicine ,Fusion protein ,0104 chemical sciences ,010404 medicinal & biomolecular chemistry ,Bacterial Vaccines ,Molecular Medicine ,UniProt - Abstract
Background Identifying immunogenic proteins is the first stage in vaccine design and development. VaxiJen is the most widely used and highly cited server for immunogenicity prediction. As the developers of VaxiJen, we are obliged to update and improve it regularly. Here, we present an updated dataset of bacterial immunogens containing 317 experimentally proven immunogenic proteins of bacterial origin, of which 60% have been reported during the last 10 years. Methods PubMed was searched for papers containing data for novel immunogenic proteins tested on humans till March 2017. Corresponding protein sequences were collected from NCBI and UniProtKB. The set was curated manually for multiple protein fragments, isoforms, and duplicates. Results The final curated dataset consists of 306 immunogenic proteins tested on humans derived from 47 bacterial microorganisms. Certain proteins have several isoforms. All were considered, and the total protein sequences in the set are 317. The updated set contains 206 new immunogens, compared to the previous VaxiJen bacterial dataset. The average number of immunogens per species is 6.7. The set also contains 12 fusion proteins and 41 peptide fragments and epitopes. The dataset includes the names of bacterial microorganisms, protein names, and protein sequences in FASTA format. Conclusion Currently, the updated VaxiJen bacterial dataset is the best known manually-curated compilation of bacterial immunogens. It is freely available at http://www.ddg-pharmfac.net/vaxi jen/dataset. It can easily be downloaded, searched, and processed. When combined with an appropriate negative dataset, this update could also serve as a training set, allowing enhanced prediction of the potential immunogenicity of unknown protein sequences.
- Published
- 2019
- Full Text
- View/download PDF
46. Proteomics Standards Initiative Extended FASTA Format
- Author
-
Peter R. Baker, Martin Eisenacher, Eric W. Deutsch, Tim Van Den Bossche, Robert J. Chalkley, Jim Shofstahl, Juan Antonio Vizcaíno, Luis Francisco Hernández Sánchez, Karl R. Clauser, Lydie Lane, Andrew Collins, Eugene A. Kapp, Sean L. Seymour, Gerhard Mayer, Pierre-Alain Binz, Luis Mendoza, Jimmy K. Eng, Gerben Menschaert, Yasset Perez-Riverol, Harald Barsnes, and Emanuele Alpi
- Subjects
0301 basic medicine ,Proteomics ,Biochemistry & Molecular Biology ,Computer science ,Information Storage and Retrieval ,PEFF ,Biochemistry ,Mass Spectrometry ,Article ,Proteomics Standards Initiative ,03 medical and health sciences ,Controlled vocabulary ,Humans ,PSI ,FASTA ,ddc:616 ,Information retrieval ,030102 biochemistry & molecular biology ,NeXtProt ,file formats ,FASTA format ,General Chemistry ,Biological Sciences ,File format ,Metadata ,PASTA ,030104 developmental biology ,Validator ,proteogenomics ,Chemical Sciences ,standards ,Generic health relevance ,UniProt ,Software ,Biotechnology - Abstract
Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI extended FASTA format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backward compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available at http://www.psidev.info/peff. acceptedVersion
- Published
- 2019
47. Predicting Secondary Structure for Human Proteins Based on Chou-Fasman Method
- Author
-
Gerasimos Vonitsanos, Fotios Kounelis, Ioannis E. Livieris, Panagiotis Pintelas, and Andreas Kanavos
- Subjects
chemistry.chemical_classification ,0303 health sciences ,Computer science ,FASTA format ,Folding (DSP implementation) ,Protein structure prediction ,Amino acid ,03 medical and health sciences ,0302 clinical medicine ,Protein structure ,chemistry ,Chou–Fasman method ,Protein folding ,Protein secondary structure ,Algorithm ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
Proteins are constructed by the combination of a different number of amino acids and thus, have a different structure and folding depending on chemical reactions and other aspects. The protein folding prediction can help in many healthcare scenarios to foretell and prevent diseases. The different elements that form a protein give the secondary structure. One of the most common algorithms used for secondary structure prediction constitutes the Chou-Fasman method. This technique divides and in following analyses each amino acid in three different elements, which are Open image in new window -helices, Open image in new window -sheets and turns based on already known protein structures. Its aim is to predict the probability for which each of these elements will be formed. In this paper, we have used Chou-Fasman algorithm for extracting the probabilities of a series of amino acids in FASTA format. We make an analysis given all probabilities for any length of a human protein without any restriction as other existing tools.
- Published
- 2019
- Full Text
- View/download PDF
48. Jasmine: a Java pipeline for isomiR characterization in miRNA-Seq data
- Author
-
Xiangfu Zhong, Simon Rayner, and Albert Pla
- Subjects
Statistics and Probability ,Gene isoform ,FASTQ format ,Java ,Computer science ,FASTA format ,Computational biology ,Biochemistry ,Pipeline (software) ,Computer Science Applications ,Applications Note ,Computational Mathematics ,Identification (information) ,IsomiR ,Computational Theory and Mathematics ,microRNA ,Sequence Analysis ,Molecular Biology ,computer ,computer.programming_language - Abstract
Motivation The existence of complex subpopulations of miRNA isoforms, or isomiRs, is well established. While many tools exist for investigating isomiR populations, they differ in how they characterize an isomiR, making it difficult to compare results across different tools. Thus, there is a need for a more comprehensive and systematic standard for defining isomiRs. Such a standard would allow investigation of isomiR population structure in progressively more refined sub-populations, permitting the identification of more subtle changes between conditions and leading to an improved understanding of the processes that generate these differences. Results We developed Jasmine, a software tool that incorporates a hierarchal framework for characterizing isomiR populations. Jasmine is a Java application that can process raw read data in fastq/fasta format, or mapped reads in SAM format to produce a detailed characterization of isomiR populations. Thus, Jasmine can reveal structure not apparent in a standard miRNA-Seq analysis pipeline. Availability and implementation Jasmine is implemented in Java and R and freely available at bitbucket https://bitbucket.org/bipous/jasmine/src/master/. Contact simon.rayner@medisin.uio.no Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2019
49. An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema
- Author
-
Gunter Saake, Roman Zoun, Xiao Chen, Dirk Benndorf, David Broneske, Robert Heyer, Ivayla Trifonova, and Kay Schallert
- Subjects
radix tree ,lcsh:T55.4-60.8 ,Computer science ,Radix tree ,sequence data ,computer.software_genre ,01 natural sciences ,Column (database) ,lcsh:QA75.5-76.95 ,Theoretical Computer Science ,03 medical and health sciences ,proteomics ,Trie ,lcsh:Industrial engineering. Management engineering ,mass spectrometry ,030304 developmental biology ,Database engine ,0303 health sciences ,Numerical Analysis ,Database ,010401 analytical chemistry ,FASTA format ,trie ,Data structure ,0104 chemical sciences ,Schema (genetic algorithms) ,Computational Mathematics ,Workflow ,Computational Theory and Mathematics ,storage system ,lcsh:Electronic computers. Computer science ,computer - Abstract
Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.
- Published
- 2021
- Full Text
- View/download PDF
50. Digital data for quick response (QR) codes of alkalophilic Bacillus pumilus to identify and to compare bacilli isolated from Lonar Crator Lake, India
- Author
-
Bhagwan N. Rekadwad and Chandrahasya N. Khobragade
- Subjects
0106 biological sciences ,0301 basic medicine ,Bacilli ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,010603 evolutionary biology ,01 natural sciences ,03 medical and health sciences ,Alkalophiles ,Gene bank ,Alkaline environment ,Bacillus signatures ,lcsh:Science (General) ,Lonar Crator Lake ,Data Article ,Multidisciplinary ,biology ,Bacillus pumilus ,business.industry ,fungi ,FASTA format ,biology.organism_classification ,16S ribosomal RNA ,Soda Lake ,Biotechnology ,030104 developmental biology ,lcsh:R858-859.7 ,business ,lcsh:Q1-390 - Abstract
Microbiologists are routinely engaged isolation, identification and comparison of isolated bacteria for their novelty. 16S rRNA sequences of Bacillus pumilus were retrieved from NCBI repository and generated QR codes for sequences (FASTA format and full Gene Bank information). 16SrRNA were used to generate quick response (QR) codes of Bacillus pumilus isolated from Lonar Crator Lake (19° 58′ N; 76° 31′ E), India. Bacillus pumilus 16S rRNA gene sequences were used to generate CGR, FCGR and PCA. These can be used for visual comparison and evaluation respectively. The hyperlinked QR codes, CGR, FCGR and PCA of all the isolates are made available to the users on a portal https://sites.google.com/site/bhagwanrekadwad/. This generated digital data helps to evaluate and compare any Bacillus pumilus strain, minimizes laboratory efforts and avoid misinterpretation of the species. Keywords: Alkalophiles, Alkaline environment, Bacillus signatures, Lonar Crator Lake, Soda Lake
- Published
- 2016
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.