685 results on '"Sequence logo"'
Search Results
52. Evolutionary and experimental analyses of inorganic phosphate transporter PiT family reveals two related signature sequences harboring highly conserved aspartic acids critical for sodium-dependent phosphate transport function of human PiT2.
- Author
-
Bøttger, Pernille and Pedersen, Lene
- Subjects
- *
PROTEINS , *BIOLOGICAL transport , *PHOSPHATES , *ASPARTIC acid , *SODIUM , *CATIONS - Abstract
The mammalian members of the inorganic phosphate (Pi) transporter (PiT) family, the type III sodium-dependent phosphate (NaPi) transporters PiT1 and PiT2, have been assigned housekeeping Pi transport functions and are suggested to be involved in chondroblastic and osteoblastic mineralization and ectopic calcification. The PiT family members are conserved throughout all kingdoms and use either sodium (Na+) or proton (H+) gradients to transport Pi. Sequence logo analyses revealed that independent of their cation dependency these proteins harbor conserved signature sequences in their N- and C-terminal ends with the common core consensus sequence GANDVANA. With the exception of 10 proteins from extremophiles all 109 proteins analyzed carry an aspartic acid in one or both of the signature sequences. We changed either of the highly conserved aspartates, Asp28 and Asp506, in the N- and C-terminal signature sequences, respectively, of human PiT2 to asparagine and analyzed Pi uptake function in Xenopus laevis oocytes. Both mutant proteins were expressed at the cell surface of the oocytes but exhibited knocked out NaPi transport function. Human PiT2 is also a retroviral receptor and we have previously shown that this function can be exploited as a control for proper processing and folding of mutant proteins. Both mutant transporters displayed wild-type receptor functions implying that their overall architecture is undisturbed. Thus the presence of an aspartic acid in either of the PiT family signature sequences is critical for the Na+-dependent Pi transport function of human PiT2. The conservation of the aspartates among proteins using either Na+- or H+-gradients for Pi transport suggests that they are involved in H+-dependent Pi transport as well. Current results favor a membrane topology model in which the N- and C-terminal PiT family signature sequences are positioned in intra- and extracellular loops, respectively, suggesting that they are involved in related functions on either side of the membrane. The present data are in agreement with a possible role of the signature sequences in translocation of cations. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
53. Dynalogo: an interactive sequence logo with dynamic thresholding of matched quantitative proteomic data
- Author
-
Adam T. Lafontaine, Kazuya Machida, and Bruce J. Mayer
- Subjects
Statistics and Probability ,Proteomics ,Source code ,Computer science ,media_common.quotation_subject ,Quantitative proteomics ,computer.software_genre ,Biochemistry ,Data file ,Position-Specific Scoring Matrices ,Molecular Biology ,media_common ,chemistry.chemical_classification ,Generator (computer programming) ,Computers ,Thresholding ,Computer Science Applications ,Amino acid ,Visualization ,Computational Mathematics ,Sequence logo ,Applications Note ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,chemistry ,Programming Languages ,Data mining ,computer ,Software - Abstract
Summary Current web-based sequence logo analyses for studying domain–peptide interactions are often conducted only on high affinity binders due to conservative data thresholding. We have developed Dynalogo, a combination of threshold varying tool and sequence logo generator written in the R statistical programming language, which allows on-the-fly visualization of binding specificity over a wide range of affinity interactions. Hence researchers can easily explore their dataset without the constraint of an arbitrary threshold. After importing quantitative data files, there are various data filtering and visualizing features available. Using a threshold control, users can easily track the dynamic change of enrichment and depletion of amino acid characters in the sequence logo panel. The built-in export function allows downloading filtered data and graphical outputs for further analyses. Dynalogo is optimized for analysis of modular domain–peptide binding experiments but the platform offers a broader application including quantitative proteomics. Availability and implementation Dynalogo application, user manual and sample data files are available at https://dynalogo.cam.uchc.edu. The source code is available at https://github.com/lafontaine-uchc/dynalogo. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2019
54. Motif and Regulatory Sequence Analysis
- Author
-
Ju Han Kim
- Subjects
Regulation of gene expression ,Sequence logo ,Phylogenetic tree ,Regulatory sequence ,Sequence alignment ,Computational biology ,Genome browser ,Biology ,Sequence motif ,Transcription factor - Abstract
In this chapter, we will learn and practice (1) type and method of sequence alignment (2) sequence motif searching, sequence logo creation, and phylogenetic tree analysis (3) prediction of transcription factor and microRNA (miRNA) binding sites involved in gene regulation (4) visualization and exploration of sequence annotations using a genome browser.
- Published
- 2019
- Full Text
- View/download PDF
55. DUC-Curve, a highly compact 2D graphical representation of DNA sequences and its application in sequence alignment
- Author
-
Qian Liu, Xiaoqi Zheng, and Yushuang Li
- Subjects
0301 basic medicine ,Statistics and Probability ,chemistry.chemical_classification ,Multiple sequence alignment ,Computer science ,Sequence analysis ,Sequence alignment ,Computational biology ,Condensed Matter Physics ,DNA sequencing ,Combinatorics ,03 medical and health sciences ,Sequence logo ,Exon ,030104 developmental biology ,Linguistic sequence complexity ,chemistry ,k-mer ,Consensus sequence ,Microsatellite ,Nucleotide ,Alignment-free sequence analysis ,Sequence (medicine) - Abstract
A highly compact and simple 2D graphical representation of DNA sequences, named DUC-Curve, is constructed through mapping four nucleotides to a unit circle with a cyclic order. DUC-Curve could directly detect nucleotide, di-nucleotide compositions and microsatellite structure from DNA sequences. Moreover, it also could be used for DNA sequence alignment. Taking geometric center vectors of DUC-Curves as sequence descriptor, we perform similarity analysis on the first exons of β -globin genes of 11 species, oncogene TP53 of 27 species and twenty-four Influenza A viruses, respectively. The obtained reasonable results illustrate that the proposed method is very effective in sequence comparison problems, and will at least play a complementary role in classification and clustering problems.
- Published
- 2016
- Full Text
- View/download PDF
56. Dali server update
- Author
-
Liisa Holm, Laura M. Laakso, Biosciences, Computational genomics, Genetics, Institute of Biotechnology, and Bioinformatics
- Subjects
0301 basic medicine ,Models, Molecular ,ENZYME ,SCOP ,Structural alignment ,Sequence alignment ,Biology ,Bioinformatics ,CLASSIFICATION ,Protein Structure, Secondary ,Amidohydrolases ,Set (abstract data type) ,03 medical and health sciences ,User-Computer Interface ,Imaging, Three-Dimensional ,HOMOLOGY ,Similarity (network science) ,Protein Domains ,Sequence Analysis, Protein ,Databases, Genetic ,Genetics ,Computer Graphics ,TOOL ,Web Server issue ,Humans ,Amino Acid Sequence ,Phylogeny ,Internet ,Information retrieval ,030102 biochemistry & molecular biology ,SUPERFAMILY ,computer.file_format ,Protein Data Bank ,Sequence logo ,Projection (relational algebra) ,030104 developmental biology ,Structural Homology, Protein ,1182 Biochemistry, cell and molecular biology ,UniProt ,computer ,Sequence Alignment ,FUNCTIONAL ANNOTATION ,Algorithms ,PROTEIN-STRUCTURE - Abstract
The Dali server (http://ekhidna2.biocenter.helsinki.fi/dali) is a network service for comparing protein structures in 3D. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The Dali server has been running in various places for over 20 years and is used routinely by crystallographers on newly solved structures. The latest update of the server provides enhanced analytics for the study of sequence and structure conservation. The server performs three types of structure comparisons: (i) Protein Data Bank (PDB) search compares one query structure against those in the PDB and returns a list of similar structures; (ii) pairwise comparison compares one query structure against a list of structures specified by the user; and (iii) all against all structure comparison returns a structural similarity matrix, a dendrogram and a multidimensional scaling projection of a set of structures specified by the user. Structural superimpositions are visualized using the Java-free WebGL viewer PV. The structural alignment view is enhanced by sequence similarity searches against Uniprot. The combined structure-sequence alignment information is compressed to a stack of aligned sequence logos. In the stack, each structure is structurally aligned to the query protein and represented by a sequence logo.
- Published
- 2016
57. Structural analysis of key gap junction domains—Lessons from genome data and disease-linked mutants
- Author
-
Donglin Bai
- Subjects
0301 basic medicine ,Mutant ,Connexin ,Biology ,Random hexamer ,Genome ,Connexins ,Homology (biology) ,03 medical and health sciences ,0302 clinical medicine ,Protein Domains ,otorhinolaryngologic diseases ,Animals ,Humans ,Disease ,Amino Acid Sequence ,Gene ,Genetics ,Gap junction ,Gap Junctions ,Cell Biology ,Sequence logo ,030104 developmental biology ,Mutation ,sense organs ,030217 neurology & neurosurgery ,Developmental Biology - Abstract
A gap junction (GJ) channel is formed by docking of two GJ hemichannels and each of these hemichannels is a hexamer of connexins. All connexin genes have been identified in human, mouse, and rat genomes and their homologous genes in many other vertebrates are available in public databases. The protein sequences of these connexins align well with high sequence identity in the same connexin across different species. Domains in closely related connexins and several residues in all known connexins are also well-conserved. These conserved residues form signatures (also known as sequence logos) in these domains and are likely to play important biological functions. In this review, the sequence logos of individual connexins, groups of connexins with common ancestors, and all connexins are analyzed to visualize natural evolutionary variations and the hot spots for human disease-linked mutations. Several gap junction domains are homologous, likely forming similar structures essential for their function. The availability of a high resolution Cx26 GJ structure and the subsequently-derived homology structure models for other connexin GJ channels elevated our understanding of sequence logos at the three-dimensional GJ structure level, thus facilitating the understanding of how disease-linked connexin mutants might impair GJ structure and function. This knowledge will enable the design of complementary variants to rescue disease-linked mutants.
- Published
- 2016
- Full Text
- View/download PDF
58. Simple yet functional phosphate-loop proteins
- Author
-
Yu-Ru Lin, Alon Wellner, Dan S. Tawfik, Fanindra Kumar-Deshmukh, Igor N. Berezovsky, Gabriele Varani, David Baker, Alexander Goncearenco, Fan Yang, Maria Luisa Romero Romero, Wen Yang, Agnes Toth-Petroczy, and Michal Sharon
- Subjects
Models, Molecular ,0301 basic medicine ,Protein Conformation ,Polynucleotides ,Sequence alignment ,RNA-binding protein ,Plasma protein binding ,010402 general chemistry ,01 natural sciences ,Phosphates ,Evolution, Molecular ,03 medical and health sciences ,Adenosine Triphosphate ,Protein structure ,Catalytic Domain ,Magnesium ,Protein Interaction Domains and Motifs ,Amino Acid Sequence ,Binding site ,Peptide sequence ,Phylogeny ,Binding Sites ,Multidisciplinary ,Sequence Homology, Amino Acid ,Chemistry ,Walker motifs ,Proteins ,RNA-Binding Proteins ,DNA ,Nucleoside-Triphosphatase ,0104 chemical sciences ,Sequence logo ,030104 developmental biology ,PNAS Plus ,Biochemistry ,Mutation ,RNA ,Sequence Alignment ,Protein Binding - Abstract
Abundant and essential motifs, such as phosphate-binding loops (P-loops), are presumed to be the seeds of modern enzymes. The Walker-A P-loop is absolutely essential in modern NTPase enzymes, in mediating binding, and transfer of the terminal phosphate groups of NTPs. However, NTPase function depends on many additional active-site residues placed throughout the protein’s scaffold. Can motifs such as P-loops confer function in a simpler context? We applied a phylogenetic analysis that yielded a sequence logo of the putative ancestral Walker-A P-loop element: a β-strand connected to an α-helix via the P-loop. Computational design incorporated this element into de novo designed β-α repeat proteins with relatively few sequence modifications. We obtained soluble, stable proteins that unlike modern P-loop NTPases bound ATP in a magnesium-independent manner. Foremost, these simple P-loop proteins avidly bound polynucleotides, RNA, and single-strand DNA, and mutations in the P-loop’s key residues abolished binding. Binding appears to be facilitated by the structural plasticity of these proteins, including quaternary structure polymorphism that promotes a combined action of multiple P-loops. Accordingly, oligomerization enabled a 55-aa protein carrying a single P-loop to confer avid polynucleotide binding. Overall, our results show that the P-loop Walker-A motif can be implemented in small and simple β-α repeat proteins, primarily as a polynucleotide binding motif.
- Published
- 2018
- Full Text
- View/download PDF
59. MISTIC2: Comprehensive server to study coevolution in protein families
- Author
-
Franco L. Simonetti, Javier Iserte, Cristina Marino-Buslje, and Eloy A Colell
- Subjects
0301 basic medicine ,Protein family ,Protein Conformation ,Protein-protein interactions ,Biology ,computer.software_genre ,purl.org/becyt/ford/1 [https] ,03 medical and health sciences ,Software ,User experience design ,Sequence Analysis, Protein ,Genetics ,Covariation ,Coevolution ,Internet ,Information retrieval ,business.industry ,Computational Biology ,Proteins ,purl.org/becyt/ford/1.2 [https] ,Visualization ,Sequence logo ,030104 developmental biology ,Ciencias de la Computación e Información ,Mutation ,Web Server Issue ,The Internet ,Web service ,business ,Sequence Alignment ,computer ,Ciencias de la Información y Bioinformática ,CIENCIAS NATURALES Y EXACTAS - Abstract
Correlated mutations between residue pairs in evolutionarily related proteins arise from constraints needed to maintain a functional and stable protein. Identifying these inter-related positions narrows down the search for structurally or functionally important sites. MISTIC is a server designed to assist users to calculate covariation in protein families and provide them with an interactive tool to visualize the results. Here, we present MISTIC2, an update to the previous server, that allows to calculate four covariation methods (MIp, mfDCA, plmDCA and gaussianDCA). The results visualization framework has been reworked for improved performance, compatibility and user experience. It includes a circos representation of the information contained in the alignment, an interactive covariation network, a 3D structure viewer and a sequence logo. Others components provide additional information such as residue annotations, a roc curve for assessing contact prediction, data tables and different ways of filtering the data and exporting figures. Comparison of different methods is easily done and scores combination is also possible. A newly implemented web service allows users to access MISTIC2 programmatically using an API to calculate covariation and retrieve results. MISTIC2 is available at: https://mistic2.leloir.org.ar. Fil: Colell, Eloy A.. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentina Fil: Iserte, Javier Alonso. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentina Fil: Simonetti, Franco Lucio. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentina Fil: Marino Buslje, Cristina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentina
- Published
- 2018
60. A new sequence logo plot to highlight enrichment and depletion
- Author
-
Kushal K. Dey, Dongyue Xie, and Matthew Stephens
- Subjects
0301 basic medicine ,genetic structures ,Computer science ,EDLogo ,medicine.disease_cause ,Biochemistry ,chemistry.chemical_compound ,0302 clinical medicine ,Protein sequencing ,Structural Biology ,lcsh:QH301-705.5 ,chemistry.chemical_classification ,0303 health sciences ,Mutation ,Applied Mathematics ,Computer Science Applications ,Amino acid ,lcsh:R858-859.7 ,DNA microarray ,Sequence motif ,Logo plots ,Research Article ,Logo ,Enrichment depletion ,lcsh:Computer applications to medicine. Medical informatics ,DNA sequencing ,Plot (graphics) ,03 medical and health sciences ,medicine ,Humans ,Amino Acid Sequence ,Molecular Biology ,030304 developmental biology ,String symbols ,Base Sequence ,business.industry ,RNA ,Pattern recognition ,Bayes Theorem ,DNA ,DNA binding site ,Sequence logo ,030104 developmental biology ,lcsh:Biology (General) ,chemistry ,Artificial intelligence ,business ,Sequence Alignment ,030217 neurology & neurosurgery ,Software - Abstract
Background Sequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos. Results We introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. And the software includes new Empirical Bayes methods to stabilize estimates of enrichment and depletion, and thus better highlight the most significant patterns in data. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles. Conclusions Our new EDLogo plots and flexible software implementation can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc.) across a wide range of applications. Electronic supplementary material The online version of this article (10.1186/s12859-018-2489-3) contains supplementary material, which is available to authorized users.
- Published
- 2018
61. Evolutionary analysis of the protein-DNA interaction
- Author
-
Aptekmann, Ariel Alejandro and Nadra, Alejandro Daniel
- Subjects
CONTENIDO DE INFORMACION ,ADN ,PROTEIN ,DNA ,SEQUENCE LOGO ,EXTREMOPHILES ,MOTIFFS ,EXTREMOFILOS ,INFORMATION CONTENT ,REGULAR EXPRESSIONS ,EXPRESIONES REGULARES ,MOTIVOS ,ARCHAEA ,PROTEINA ,LOGOS DE SECUENCIA - Abstract
En la entidad funcional “interfaz proteína-ADN”, proteína y ADN se condicionanmutuamente en un proceso coevolutivo. El hecho de que múltiples reconocedoresmoleculares tengan que coexistir en un mismo genoma y ejercer sus funciones sininterferir con las funciones de los otros, condiciona sus requisitos de especificidady discriminación. En términos evolutivos, se observa además que pares homólogosmantienen su función en un espectro amplio de condiciones físico químicas. Se agregaa esto la dificultad de que una vez determinada una red regulatoria hay un efectoinercial, que dificulta el modificarla cuanto mas cantidad de partes interactuantesformen la misma. En este trabajo nos proponemos estudiar las condiciones, procesosy mecanismos que determinan las posibilidades de este fenómeno. Esta tesis puededividirse en siete etapas, cada una correspondiente con un capítulo. En el primercapítulo comenzamos por recopilar de forma sistemática información sobre las condicionesde vida de extremófilos, principalmente Archaea. En el segundo capítulo,para aquellos organismos con un genoma secuenciado y anotado, calculamos la composiciónde distintas regiones del genoma, evidenciando que la aparente correlaciónentre contenido de G+C y la temperatura óptima de crecimiento se debe a un sesgoen los datos usados históricamente. En el tercer capítulo estudiamos el contenido deinformación de los promotores en distintos genomas, evidenciando un desvío de loesperable de acuerdo a la teoría molecular de la información, proponiendo posiblesexplicaciones para esta desviación. En el cuarto capítulo se aplica un análisis similara la comparación de promotores de genomas nuclear y mitocondrial en los cualeshay, de manera sostenida y en un mismo organismo, una diferencia de temperatura. En el quinto capítulo se identifican y caracterizan motivos funcionales a nivel degenoma, relacionados con la regulación de un factor de transcripción involucrado encrecimiento radicular en plantas (RSL4). En el sexto capítulo, usando expresionesregulares como modelos de sitios de unión, estudiamos la coevolución de motivosen el espacio de secuencia, mostrando cómo el tama˜no del alfabeto tiene un efectosobre el número de posiciones discriminantes óptimas y cómo los sistemas naturalestienden a optimizarse influenciadas por este parámetro. En resumen, mediante elanálisis bibliográfico y el uso de herramientas bioinformáticas modernas, estudiamosel sistema “interacción proteína-ADN”, considerando restricciones biofísicas yevolutivas. Este análisis nos ha permitido reforzar hipótesis previas, así como encontrarresultados novedosos. Esta tesis ha requerido la aplicación de teoremas y eldesarrollo de algoritmos, que son enunciados a modo de apéndice. At the functional entity “Protein-DNA interface”, Protein and DNA are mutuallyconditioned by a co evolutive process. Since multiple molecular recognicers coexistand remain functional on a same genome withouth interfering excesively betweenthem, each recognizer is required to have especificity. They must also mantain theirfunctions over a wide range of fisico-chemical conditions. As if those where notenough difficulties, once a regulatory network has been established, there is an inercialeffect that restricts its capability to be modified (Since more parts interacting arerequired to be changed). In this work we propose to study the conditions,mechanismsand processes that are determinant to this fenomena. This thesis can be divided intoseven stages, each corresponding to a chapter. In the first chapter we began by collectingsystematically information on the living conditions of extremophiles, especially Archaea. In the second chapter, for those organisms with a genome sequenced andannotated, we calculated the composition of different regions of the genome, showingthat the apparent correlation between G + C content and optimal growth temperatureis due to a bias in the data used historically. In the third chapter we study theinformation content of the promoters in different genomes, evidencing a deviationfrom what is expected according to the molecular theory of information, proposingpossible explanations for this deviation. In the fourth chapter a similar analysis isapplied to the comparison of promoters of nuclear and mitochondrial genomes inwhich there is, in a sustained manner and in the same organism, a temperaturedifference. The fifth chapter identifies and characterizes functional genome-relatedregulation of a transcription factor involved in root growth in plants (RSL4). In thesixth chapter, by using regular expressions as models of binding sites, we study thecoevolution of motifs in sequence space, showing how the size of the alphabet hasan effect on the number of optimal discriminant positions and how natural systemstend to be optimized by this parameter. In summary, through bibliographic analysisand the use of modern bioinformatics tools, we studied the ”protein-DNA interaction”system,considering biophysical and evolutionary restrictions. This analysis hasallowed us to reinforce previous hypotheses, as well as find novel results. This thesishas required the application of theorems and the development of algorithms, whichare enunciated as an appendix. Fil: Aptekmann, Ariel Alejandro. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina.
- Published
- 2018
62. Structure memes: Intuitive visualization of sequence logo and subfamily logo information in a 3D protein-structural context.
- Author
-
Beitz E
- Subjects
- Amino Acid Sequence, Sequence Alignment, Proteins chemistry, Software
- Abstract
The number of available protein sequences covering virtually all known species is tremendous and ever growing due to the feasibility of the underlying nucleotide sequencing. The speed at which protein structures are being determined is increasing, and as a result of refined cryo-electron microscopy the proportion of solved membrane protein folds is expanding. Sequence data are used to illustrate evolution and to group proteins into families with various levels of subfamilies. Structure data of prototypical proteins provide insight into function brought about by an interplay of specific amino acid residues that are dispersed throughout the sequence. Visually combining rich sequence information with structure data in an intuitively comprehensible way would enhance the process of elucidating key protein aspects regarding evolution, sequence relations, and function. Here, a method is described that projects the information contained in sequence logos and subfamily logos onto protein structures. The amino acid composition at a site is encoded by a mix color in the red-yellow-blue space and the information content is presented by the radius of a sphere at the α-carbon position. The resulting display is termed "structure meme." The underlying sequence and atom coordinate data are retained in the file for simple retrieval on demand using a molecular structure visualization program. Structure memes are recognizable and convey extensive information in a human-discernable way that requires little training., (© 2021 The Author. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.)
- Published
- 2021
- Full Text
- View/download PDF
63. Promoter Sequence Analysis through No Gap Multiple Sequence Alignment of Motif Pairs
- Author
-
Kouser and Lalitha Rangarajan
- Subjects
Multiple sequence alignment ,Sequence analysis ,Computer science ,Alignment score ,Structural alignment ,Promoter sequences ,Sequence comparison ,Sequence alignment ,Computational biology ,computer.software_genre ,Similarity ,Homology (biology) ,Sequence logo ,Consensus sequence ,General Earth and Planetary Sciences ,Data mining ,Sequence motif ,computer ,Alignment-free sequence analysis ,General Environmental Science - Abstract
The advancement in the sequencing technology has led to exponential increase in the biological sequence data. Therefore the need for methods and techniques that analyze this sequence data are in demand. A central task in analysis of this data is sequence alignment. In this work, we present a new multiple sequence alignment method for analyzing the similarity/homology existing in the promoter sequences. We extract the motif pair feature from the binarized position specific motif matrix (PSMM) of each promoter pair or sets. We then compare the count of motif pairs between the promoter sequences to find the similarity. The efficacy of the proposed method is tested on two datasets obtained from NCBI. The results obtained agree with our understanding of organism similarity.
- Published
- 2015
- Full Text
- View/download PDF
64. HMM Logos for visualization of protein families
- Author
-
Schultz Jörg, Schuster-Böckler Benjamin, and Rahmann Sven
- Subjects
Hidden Markov Model ,Sequence Logo ,HMM Logo ,profile ,information content ,hitting probability ,dynamic programming ,small GTPases ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Profile Hidden Markov Models (pHMMs) are a widely used tool for protein family research. Up to now, however, there exists no method to visualize all of their central aspects graphically in an intuitively understandable way. Results We present a visualization method that incorporates both emission and transition probabilities of the pHMM, thus extending sequence logos introduced by Schneider and Stephens. For each emitting state of the pHMM, we display a stack of letters. The stack height is determined by the deviation of the position's letter emission frequencies from the background frequencies. The stack width visualizes both the probability of reaching the state (the hitting probability) and the expected number of letters the state emits during a pass through the model (the state's expected contribution). A web interface offering online creation of HMM Logos and the corresponding source code can be found at the Logos web server of the Max Planck Institute for Molecular Genetics http://logos.molgen.mpg.de. Conclusions We demonstrate that HMM Logos can be a useful tool for the biologist: We use them to highlight differences between two homologous subfamilies of GTPases, Rab and Ras, and we show that they are able to indicate structural elements of Ras.
- Published
- 2004
- Full Text
- View/download PDF
65. The Menticide Sequence
- Author
-
Andrew Hammerand and Bucky Miller
- Subjects
Sequence logo ,Multiple sequence alignment ,Computer science ,Consensus sequence ,Computational biology ,Sequence motif ,Sequence (medicine) - Abstract
This chapter consists of a photo-essay comprising 13 photographs exploring surveillance and paranoia in everyday life.
- Published
- 2017
- Full Text
- View/download PDF
66. An alignment method for nucleic acid sequences against annotated genomes
- Author
-
Deforche K
- Subjects
0303 health sciences ,Theoretical computer science ,Multiple sequence alignment ,Sequence database ,030306 microbiology ,Sequence analysis ,Structural alignment ,Sequence alignment ,Computational biology ,Biology ,03 medical and health sciences ,Sequence logo ,Consensus sequence ,Alignment-free sequence analysis ,030304 developmental biology - Abstract
MotivationBiological sequence alignment is fundamental to their further interpretation. Current alignment algorithms typically align either nucleic acid or amino acid sequences. Using only nucleic acid sequence similarity, divergent sequences cannot be aligned reliably because of the limited alphabet and genetic saturation. To align divergent coding nucleic acid sequences, one can align using the translated amino acid sequences. This requires the detection of the correct open reading frame, is prone to eventual frame shift errors, and typically requires the treatment of genes separately. It was our motivation to design a nucleic acid sequence alignment algorithm to align a nucleic acid sequence against a (reference) genome sequence, that works equally well for similar and divergent sequences, and produces an optimal alignment considering simultaneously the alignment of all annotated coding sequences.ResultsWe define a genome alignment score for evaluating the quality of an alignment of a nucleic acid query sequence against a reference genome sequence, for which coding sequence features have been annotated (for example in a GenBank record). The genome alignment score combines the a ne gap score for the nucleic acid sequence with an a ne gap score for all amino acid alignments resulting from coding sequences in open reading frames contained within the query sequence. We present a Dynamic Programming algorithm to compute the optimal global or local alignment using this genomic alignment score and provide a formal proof of correctness. This algorithm allows the alignment of nucleic acid sequences from closely related and highly divergent sequences within the same software and using the same parameters, automatically correcting any eventual frame shift errors and produces at the same time the aligned translated amino acid sequences of all relevant coding sequence features.AvailabilityThe software is available as a web application at http://www.genomedetective.com/app/aga and as command-line application at https://github.com/emweb/aga
- Published
- 2017
- Full Text
- View/download PDF
67. Deconvolving sequence features that discriminate between overlapping regulatory annotations
- Author
-
Akshay Kakumanu, Silvia Velasco, Esteban Mazzoni, and Shaun Mahony
- Subjects
0301 basic medicine ,Regulatory Sequences, Nucleic Acid ,0302 clinical medicine ,Discriminative model ,Biology (General) ,Promoter Regions, Genetic ,Genetics ,0303 health sciences ,Ecology ,High-Throughput Nucleotide Sequencing ,Genome project ,Chromatin ,Molecular Sequence Annotation ,Computational Theory and Mathematics ,Modeling and Simulation ,Research Article ,Cell type ,Sequence analysis ,QH301-705.5 ,Computational biology ,Biology ,ENCODE ,Cell Line ,Cellular and Molecular Neuroscience ,03 medical and health sciences ,Annotation ,Consensus sequence ,Deoxyribonuclease I ,Humans ,Cell Lineage ,Molecular Biology ,Transcription factor ,Ecology, Evolution, Behavior and Systematics ,Embryonic Stem Cells ,030304 developmental biology ,Sequence (medicine) ,Binding Sites ,Computational Biology ,Promoter ,DNA binding site ,Sequence logo ,030104 developmental biology ,Gene Expression Regulation ,Genetic Loci ,030217 neurology & neurosurgery ,Software ,Transcription Factors - Abstract
Genomic loci with regulatory potential can be annotated with various properties. For example, genomic sites bound by a given transcription factor (TF) can be divided according to whether they are proximal or distal to known promoters. Sites can be further labeled according to the cell types and conditions in which they are active. Given such a collection of labeled sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that set of sites are associated with the cell type or associated with promoters. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, SeqUnwinder is able to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we characterize distinct sequence properties of multi-condition and cell-specific TF binding sites after controlling for uneven associations with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines., Author summary Transcription factor proteins control gene expression by recognizing and interacting with short DNA sequence patterns in regulatory regions on the genome. Current genomics experiments allow us to find regulatory regions associated with a particular biochemical activity over the entire genome; for example, all regions where a particular transcription factor interacts with the genome in a given cell type. Given a collection of regulatory regions, we often aim to discover short DNA sequence patterns that are more common in the collection than in other regions. Performing such “DNA motif-finding” analysis can give us hints about the patterns that determine gene regulation in the analyzed cell type. Here we describe a new method for DNA motif-finding called SeqUnwinder. Our approach analyzes collections of regulatory regions where each has been labeled according to various biological properties. For example, the labels could correspond to various cell types in which the regulatory region is active. SeqUnwinder then performs machine-learning analysis to unravel DNA sequence features that are characteristic of each label (e.g. features that distinguish regulatory regions in each cell type from other cell types). SeqUnwinder is the first method to enable analysis of regulatory region collections that contain several overlapping labels.
- Published
- 2017
68. The Four-Stage Sequence
- Author
-
William A. Edmundson
- Subjects
Sequence logo ,Multiple sequence alignment ,Stage (stratigraphy) ,Consensus sequence ,Computational biology ,Biology ,Sequence motif ,Sequence (medicine) - Published
- 2017
- Full Text
- View/download PDF
69. The sequence preference of DNA cleavage by T4 endonuclease VII
- Author
-
Megan E. Hardie and Vincent Murray
- Subjects
0301 basic medicine ,Cleavage factor ,Base pair ,Cleavage and polyadenylation specificity factor ,Cleavage (embryo) ,Biochemistry ,AP endonuclease ,Substrate Specificity ,03 medical and health sciences ,chemistry.chemical_compound ,0302 clinical medicine ,Bacteriophage T4 ,DNA Cleavage ,Binding Sites ,Endodeoxyribonucleases ,biology ,Base Sequence ,General Medicine ,DNA ,Molecular biology ,DNA binding site ,Sequence logo ,030104 developmental biology ,chemistry ,030220 oncology & carcinogenesis ,biology.protein - Abstract
The enzyme T4 endonuclease VII is a resolvase that acts on branched DNA intermediates during genetic recombination, by cleaving DNA with staggered cuts approximately 3–6 bp apart. In this paper, we investigated the sequence preference of this cleavage reaction utilising two different DNA sequences. For the first time, the DNA sequence preference of T4 endonuclease VII cleavage sites has been examined without the presence of a known DNA substrate to mask any inherent nucleotide preference. The use of the ABI3730 platform enables the cleavage site to be determined at nucleotide resolution. We found that T4 endonuclease VII cleaves DNA with a sequence preference. We calculated the frequency of nucleotides surrounding the cleavage sites and found that following nucleotides had the highest incidence: AWTAN*STC, where N* indicates the cleavage site between positions 0 and 1, N is any base, W is A or T, and S is G or C. An A at position −1 and T at position +2 were the most predominant nucleotides at the cleavage site. Using a Sequence Logo method, the sequence TATTAN*CT was derived at the cleavage site. Note that A and T nucleotides were highly preferred 5′ to the cleavage sites in both methods of analysis. It was proposed that the enzyme recognises the narrower minor groove of these consecutive AT base pairs and cleaves DNA 3′ to this feature.
- Published
- 2017
70. Correction: Corrigendum: The driving force of prophages and CRISPR-Cas system in the evolution of Cronobacter sakazakii
- Author
-
Haiyan Zeng, Chensi Li, Na Ling, Jumei Zhang, Tengfei Xie, Yingwang Ye, and Qingping Wu
- Subjects
0301 basic medicine ,Genome ,Multidisciplinary ,Prophages ,Computational biology ,Orientation (graph theory) ,Biology ,biology.organism_classification ,Corrigenda ,Cronobacter sakazakii ,Evolution, Molecular ,03 medical and health sciences ,Sequence logo ,030104 developmental biology ,CRISPR ,Clustered Regularly Interspaced Short Palindromic Repeats ,CRISPR-Cas Systems ,Genome, Bacterial ,Prophage ,Hypothetical gene - Abstract
Cronobacter sakazakii is an important foodborne pathogens causing rare but life-threatening diseases in neonates and infants. CRISPR-Cas system is a new prokaryotic defense system that provides adaptive immunity against phages, latter play an vital role on the evolution and pathogenicity of host bacteria. In this study, we found that genome sizes of C. sakazakii strains had a significant positive correlation with total genome sizes of prophages. Prophages contributed to 16.57% of the genetic diversity (pan genome) of C. sakazakii, some of which maybe the potential virulence factors. Subtype I-E CRISPR-Cas system and five types of CRISPR arrays were found in the conserved site of C. sakazakii strains. CRISPR1 and CRISPR2 loci with high variable spacers were active and showed potential protection against phage attacks. The number of spacers from two active CRISPR loci in clinical strains was significant less than that of foodborne strains, it maybe a reason why clinical strains were found to have more prophages than foodborne strains. The frequently gain/loss of prophages and spacers in CRISPR loci is likely to drive the quick evolution of C. sakazakii. Our study provides a new insight into the co-evolution of phages and C. sakazakii.
- Published
- 2017
- Full Text
- View/download PDF
71. CircularLogo: A lightweight web application to visualize intra-motif dependencies
- Author
-
Michael T Kalmbach, Liguo Wang, Zhenqing Ye, Jean Pierre A. Kocher, Tao Ma, and Surendra Dasari
- Subjects
0301 basic medicine ,Source code ,Theoretical computer science ,Computer science ,Intra-motif dependency ,Biomolecular structure ,computer.software_genre ,Biochemistry ,chemistry.chemical_compound ,Interactive ,RNA, Transfer ,Structural Biology ,CircularLogo ,Nucleotide ,lcsh:QH301-705.5 ,Visualization ,computer.programming_language ,media_common ,chemistry.chemical_classification ,Web application framework ,Applied Mathematics ,Computer Science Applications ,Eukaryotic Cells ,Transfer RNA ,lcsh:R858-859.7 ,Motif (music) ,DNA microarray ,Web server ,media_common.quotation_subject ,lcsh:Computer applications to medicine. Medical informatics ,JavaScript ,03 medical and health sciences ,Web application ,Nucleotide Motifs ,Binding site ,Molecular Biology ,Internet ,Binding Sites ,Information retrieval ,Base Sequence ,business.industry ,Intron ,RNA ,Sequence Analysis, DNA ,Python (programming language) ,Introns ,Sequence logo ,030104 developmental biology ,ComputingMethodologies_PATTERNRECOGNITION ,lcsh:Biology (General) ,chemistry ,Nucleic Acid Conformation ,RNA Splice Sites ,business ,computer ,Software ,DNA - Abstract
BackgroundThe sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization.ResultWe developed CircularLogo, a web-based interactive application, which is able to not only visualize the position-specific nucleotide consensus and diversity but also display the intra-motif dependencies. Applying CircularLogo to HNF6 binding sites and tRNA sequences demonstrated its ability to show intra-motif dependencies and intuitively reveal biomolecular structure. CircularLogo is implemented in JavaScript and Python based on the Django web framework. The program’s source code and user’s manual are freely available at http://circularlogo.sourceforge.net. CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html.ConclusionCircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.
- Published
- 2017
- Full Text
- View/download PDF
72. Protein Sequence Analysis
- Author
-
Probodh Borah, Ravi Prakash Yadav, Guruswami Gurusubramanian, Shunmugiah Karutha Pandian, Nachimuthu Senthil Kumar, Zothansanga, Kalibulla Syed Ibrahim, and Surender Mohan
- Subjects
Sequence logo ,ComputingMethodologies_PATTERNRECOGNITION ,Amino acid sequence analysis ,Computer science ,Data_FILES ,Consensus sequence ,ExPASy ,Protein sequence analysis ,Protein function prediction ,Computational biology ,Peptide sequence - Abstract
ExPASy (Expert Protein Analysis System) is a Bioinformatics Resource Portal from Swiss Institute of Bioinformatics that offers Bioinformatics support like accessing scientific databases and software tools for the research in life sciences.
- Published
- 2017
- Full Text
- View/download PDF
73. ArrayPitope: Automated Analysis of Amino Acid Substitutions for Peptide Microarray-Based Antibody Epitope Mapping
- Author
-
Ole Lund, Paolo Marcatili, Christian Skjødt Hansen, Morten Nielsen, Thomas Osterbye, and Søren Buus
- Subjects
0301 basic medicine ,Proteomics ,Microarrays ,Ciencias de la Salud ,lcsh:Medicine ,Biochemistry ,Epitope ,Epitopes ,0302 clinical medicine ,Protein sequencing ,Antibody Specificity ,Amino Acids ,Post-Translational Modification ,lcsh:Science ,Peptide array ,Multidisciplinary ,Alanine ,biology ,Chemistry ,Organic Compounds ,Bioassays and Physiological Analysis ,Physical Sciences ,Amino Acid Analysis ,purl.org/becyt/ford/3 [https] ,Peptide microarray ,Signal Peptides ,Research Article ,CIENCIAS MÉDICAS Y DE LA SALUD ,Protein Array Analysis ,Computational biology ,Research and Analysis Methods ,Peptide Mapping ,Antibodies ,purl.org/becyt/ford/3.3 [https] ,03 medical and health sciences ,Antigen ,Albumins ,Humans ,Molecular Biology Techniques ,Molecular Biology ,Antibody ,Molecular Biology Assays and Analysis Techniques ,Salud Ocupacional ,Linear epitope ,lcsh:R ,Organic Chemistry ,Gene Mapping ,Chemical Compounds ,Biology and Life Sciences ,Proteins ,Peptide Fragments ,Sequence logo ,030104 developmental biology ,Epitope mapping ,Amino Acid Substitution ,Aliphatic Amino Acids ,Polyclonal antibodies ,biology.protein ,Motif ,lcsh:Q ,Peptides ,Epitope Mapping ,030215 immunology - Abstract
Identification of epitopes targeted by antibodies (B cell epitopes) is of critical importance for the development of many diagnostic and therapeutic tools. For clinical usage, such epitopes must be extensively characterized in order to validate specificity and to document potential cross-reactivity. B cell epitopes are typically classified as either linear epitopes, i.e. short consecutive segments from the protein sequence or conformational epitopes adapted through native protein folding. Recent advances in high-density peptide microarrays enable high-throughput, highresolution identification and characterization of linear B cell epitopes. Using exhaustive amino acid substitution analysis of peptides originating from target antigens, these microarrays can be used to address the specificity of polyclonal antibodies raised against such antigens containing hundreds of epitopes. However, the interpretation of the data provided in such largescale screenings is far from trivial and in most cases it requires advanced computational and statistical skills. Here, we present an online application for automated identification of linear B cell epitopes, allowing the non-expert user to analyse peptide microarray data. The application takes as input quantitative peptide data of fully or partially substituted overlapping peptides from a given antigen sequence and identifies epitope residues (residues that are significantly affected by substitutions) and visualize the selectivity towards each residue by sequence logo plots. Demonstrating utility, the application was used to identify and address the antibody specificity of 18 linear epitope regions in Human Serum Albumin (HSA), using peptide microarray data consisting of fully substituted peptides spanning the entire sequence of HSA and incubated with polyclonal rabbit anti-HSA (and mouse anti-rabbit-Cy3). The application is made available at: www.cbs.dtu.dk/services/ArrayPitope. Fil: Hansen, Christian Skjødt. Technical University of Denmark; Dinamarca Fil: Østerbye, Thomas. Universidad de Copenhagen; Dinamarca Fil: Marcatili, Paolo. Technical University of Denmark; Dinamarca Fil: Lund, Ole. Technical University of Denmark; Dinamarca Fil: Buus, Søren. Universidad de Copenhagen; Dinamarca Fil: Nielsen, Morten. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Universidad Nacional de San Martín. Instituto de Investigaciones Biotecnológicas; Argentina
- Published
- 2017
- Full Text
- View/download PDF
74. Quantitative network mapping of the human kinome interactome reveals new clues for rational kinase inhibitor discovery and individualized cancer therapy
- Author
-
Zhongming Zhao, Feixiong Cheng, Quan Wang, and Peilin Jia
- Subjects
kinase-substrate interaction ,Systems biology ,interactome ,Computational biology ,Biology ,Interactome ,resistance ,Human interactome ,Interaction network ,Neoplasms ,Drug Discovery ,Humans ,Kinome ,Precision Medicine ,Protein Kinase Inhibitors ,Genetics ,phosphorylation ,Drug discovery ,Kinase ,Chromosome Mapping ,systems biology ,Sequence logo ,Oncology ,Protein Kinases ,Metabolic Networks and Pathways ,Research Paper ,Signal Transduction - Abstract
The human kinome is gaining importance through its promising cancer therapeutic targets, yet no general model to address the kinase inhibitor resistance has emerged. Here, we constructed a systems biology-based framework to catalogue the human kinome, including 538 kinase genes, in the broader context of the human interactome. Specifically, we constructed three networks: a kinase-substrate interaction network containing 7,346 pairs connecting 379 kinases to 36,576 phosphorylation sites in 1,961 substrates, a protein-protein interaction network (PPIN) containing 92,699 pairs, and an atomic resolution PPIN containing 4,278 pairs. We identified the conserved regulatory phosphorylation motifs (e.g., Ser/Thr-Pro) using a sequence logo analysis. We found the typical anticancer target selection strategy that uses network hubs as drug targets, might lead to a high adverse drug reaction risk. Furthermore, we found the distinct network centrality of kinases creates a high anticancer drug resistance risk by feedback or crosstalk mechanisms within cellular networks. This notion is supported by the systematic network and pathway analyses that anticancer drug resistance genes are significantly enriched as hubs and heavily participate in multiple signaling pathways. Collectively, this comprehensive human kinome interactome map sheds light on anticancer drug resistance mechanisms and provides an innovative resource for rational kinase inhibitor design.
- Published
- 2014
- Full Text
- View/download PDF
75. Graphic Mapping of Protein-Coding DNA Sequence in Four-Dimensional Space and its Application
- Author
-
Xiao-Hong Li, Zhao-Hui Qi, and Xiao-Qin Qi
- Subjects
Genetics ,Multiple sequence alignment ,Sequence analysis ,Computer science ,Sequence assembly ,Sequence alignment ,General Chemistry ,Computational biology ,Condensed Matter Physics ,Sequence-tagged site ,Computational Mathematics ,Sequence logo ,Consensus sequence ,General Materials Science ,Electrical and Electronic Engineering ,Alignment-free sequence analysis - Published
- 2014
- Full Text
- View/download PDF
76. Principal components analysis of protein sequence clusters
- Author
-
Bo Wang and Michael A. Kennedy
- Subjects
Principal Component Analysis ,Multiple sequence alignment ,Protein Conformation ,Sequence analysis ,business.industry ,Proteins ,Pattern recognition ,General Medicine ,Biology ,Biochemistry ,Article ,Sequence logo ,Sequence Analysis, Protein ,Structural Biology ,Genetics ,Loop modeling ,Artificial intelligence ,Sequence space (evolution) ,Amino Acids ,business ,Sequence Alignment ,Peptide sequence ,Algorithms ,Alignment-free sequence analysis ,Sequence (medicine) - Abstract
Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigen-space that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.
- Published
- 2014
- Full Text
- View/download PDF
77. Transference of Wheat Expressed Sequence Tag-Simple Sequence Repeats toPaspalumSpecies and Cross-Species Amplification ofPaspalum notatumSimple Sequence Repeats: Potential Use in Phylogenetic Analysis and Mapping
- Author
-
Maria Esperanza Sartor, Lorena Adelina Siena, Francisco Espinoza, Juan Pablo A. Ortiz, and Camilo Luis Quarin
- Subjects
COMPARATIVE MAPPING ,Genetics ,Expressed sequence tag ,Phylogenetic tree ,biology ,WHEAT EST-SSR ,biology.organism_classification ,Sequence logo ,Tandem repeat ,APOSPORY ,CIENCIAS AGRÍCOLAS ,PASPALUM SPP ,Agronomía, reproducción y protección de plantas ,Direct repeat ,Microsatellite ,Agricultura, Silvicultura y Pesca ,Agronomy and Crop Science ,Paspalum notatum ,Paspalum - Abstract
The genus Paspalum includes numerous species of agronomic importance. The objectives of this work were to evaluate the transferability and polymorphism of publicly available wheat EST-SSR (wEST-SSR) markers to Paspalum spp., assess the cross-species amplification of Paspalum notatum genomic-SSRs (PnSSR) within the genus, and evaluate both types of markers for phylogenetic analyses and mapping. Thirty two accessions, including 11 species were used. Moreover, 65 F1 hybrids of P. notatum were employed for mapping. Transferability ratio of wEST-SSRs was 72.72%. On average 19.25 bands per primer combination were obtained. Cross-species amplification of PnSSRs was 55.37%, with an average of 9.45 fragments per primer pair. Both types of markers differed in the amplification capacity between primers pairs and species. Clustering analysis with wEST-SSRs data discriminated accessions by species and taxonomic groups. Genomic relationships were in agreement with previous works indicating that wEST-SSRs are adequate for phylogenetic surveys in the genus. Mapping experiments showed that both, wEST-SSRs and PnSSRs mapped scattered in the genome. On hundred and two new markers were added to the existent P. notatum linkage groups. Primer pairs ksum206, ksum219 and PN03-F2 generated markers that mapped linked to apospory. Sequences of EST-SSRs experimentally mapped in P. notatum showed 46.23% to 55.27% similarity with the original wheat EST. A preliminary comparative mapping analysis was carried combining experimental and in silico mapping results. Fil: Siena, Lorena Adelina. Universidad Nacional de Rosario. Facultad de Ciencias Agrarias. Laboratorio de Biología Molecular; Argentina Fil: Sartor, Maria Esperanza. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Nordeste. Instituto de Botánica del Nordeste (i); Argentina Fil: Quarin, Camilo Luis. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Nordeste. Instituto de Botánica del Nordeste (i); Argentina Fil: Espinoza, Francisco. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Nordeste. Instituto de Botánica del Nordeste (i); Argentina Fil: Ortiz, Juan Pablo Amelio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Nordeste. Instituto de Botánica del Nordeste (i); Argentina. Universidad Nacional de Rosario. Facultad de Ciencias Agrarias. Laboratorio de Biología Molecular; Argentina
- Published
- 2014
- Full Text
- View/download PDF
78. Sequence determinants of protein architecture
- Author
-
S. Rackovsky
- Subjects
Sequence analysis ,Computational biology ,Biology ,Bioinformatics ,Biochemistry ,Small set ,Sequence logo ,Structural Biology ,Sequence organization ,Encoding (memory) ,Protein folding ,Molecular Biology ,Alignment-free sequence analysis ,Sequence (medicine) - Abstract
Delineation of the relationship between sequence and structure in proteins has proven elusive. Most studies of this problem use alignment methods and other approaches based on the characteristics of individual residues. It is demonstrated herein that the sequence-structure relationship is determined in significant part by global characteristics of sequence organization. Information encoded in complete sequences is required to distinguish proteins in different architectural groups. It is found that the statistically significant differences between sequences encoding different architectures are encoded in a surprisingly small set of low-wave-number sequence periodicities. It would therefore appear that unexpected simplicity in an appropriately defined Fourier space may be an inherent characteristic of the sequences of folded proteins. Proteins 2013; 81:1681–1685. © 2013 Wiley Periodicals, Inc.
- Published
- 2013
- Full Text
- View/download PDF
79. SigniSite: Identification of residue-level genotype-phenotype correlations in protein multiple sequence alignments
- Author
-
Ilka Hoof, Leon Eyrich Jessen, Morten Nielsen, and Ole Lund
- Subjects
Genotype ,Protein family ,Sequence analysis ,Sequence alignment ,Biology ,03 medical and health sciences ,0302 clinical medicine ,SDG 3 - Good Health and Well-being ,HIV Protease ,Sequence Analysis, Protein ,Drug Resistance, Viral ,Genetics ,Position-Specific Scoring Matrices ,Genetic Association Studies ,030304 developmental biology ,Internet ,0303 health sciences ,Articles ,HIV Protease Inhibitors ,Resistance mutation ,Phenotype ,3. Good health ,Sequence logo ,Mutation ,HIV-1 ,Sequence Alignment ,Software ,030217 neurology & neurosurgery - Abstract
Identifying which mutation(s) within a given genotype is responsible for an observable phenotype is important in many aspects of molecular biology. Here, we present SigniSite ,a n online application for subgroup-free residue-level genotype–phenotype correlation. In contrast to similar methods, SigniSite does not require any pre-definition of subgroups or binary classification. Input is a set of protein sequences where each sequence has an associated real number, quantifying a given phenotype. SigniSite will then identify which amino acid residues are significantly associated with the data set phenotype. As output, SigniSite displays a sequence logo, depicting the strength of the phenotype association of each residue and a heat-map identifying ‘hot’ or ‘cold’ regions. SigniSite was benchmarked against SPEER, a state-of-the-art method for the prediction of specificity determining positions (SDP) using a set of human immunodeficiency virus protease-inhibitor genotype–phenotype data and corresponding resistance mutation scores from the Stanford University HIV Drug Resistance Database, and a data set of protein families with experimentally annotated SDPs. For both data sets, SigniSite was found to outperform SPEER. SigniSite is available at: http://www.cbs.dtu.dk/services/SigniSite/.
- Published
- 2013
- Full Text
- View/download PDF
80. IgBLAST: an immunoglobulin variable domain sequence analysis tool
- Author
-
Jian Ye, Thomas L. Madden, James Ostell, and Ning Ma
- Subjects
Genetics ,Internet ,Sequence analysis ,V(D)J recombination ,Immunoglobulin Variable Region ,Context (language use) ,Articles ,Sequence Analysis, DNA ,Complementarity determining region ,Biology ,V(D)J Recombination ,Sequence logo ,Sequence Analysis, Protein ,Consensus sequence ,Humans ,Sequence Alignment ,Gene ,Software ,Sequence (medicine) - Abstract
The variable domain of an immunoglobulin (IG) sequence is encoded by multiple genes, including the variable (V) gene, the diversity (D) gene and the joining (J) gene. Analysis of IG sequences typically requires identification of each gene, as well as a comparison of sequence variations in the context of defined regions. General purpose tools, such as the BLAST program, have only limited use for such tasks, as the rearranged nature of an IG sequence and the variable length of each gene requires multiple rounds of BLAST searches for a single IG sequence. Additionally, manual assembly of different genes is difficult and error-prone. To address these issues and to facilitate other common tasks in analysing IG sequences, we have developed the sequence analysis tool IgBLAST (http://www.ncbi.nlm.nih.gov/igblast/). With this tool, users can view the matches to the germline V, D and J genes, details at rearrangement junctions, the delineation of IG V domain framework regions and complementarity determining regions. IgBLAST has the capability to analyse nucleotide and protein sequences and can process sequences in batches. Furthermore, IgBLAST allows searches against the germline gene databases and other sequence databases simultaneously to minimize the chance of missing possibly the best matching germline V gene.
- Published
- 2013
- Full Text
- View/download PDF
81. Defining the Bacteroides Ribosomal Binding Site
- Author
-
Udo Wegmann, Simon R. Carding, and Nikki Horn
- Subjects
Genetics, Microbial ,Untranslated region ,Genetic Vectors ,Gene Expression ,Genetics and Molecular Biology ,Applied Microbiology and Biotechnology ,Ribosome ,Eukaryotic translation ,medicine ,Bacteroides ,RNA, Messenger ,Molecular Biology ,Genetics ,Binding Sites ,Expression vector ,Ecology ,biology ,Human gastrointestinal tract ,food and beverages ,biology.organism_classification ,Ribosomal binding site ,Sequence logo ,medicine.anatomical_structure ,RNA, Ribosomal ,Protein Biosynthesis ,5' Untranslated Regions ,Ribosomes ,Food Science ,Biotechnology - Abstract
The human gastrointestinal tract, in particular the colon, hosts a vast number of commensal microorganisms. Representatives of the genus Bacteroides are among the most abundant bacterial species in the human colon. Bacteroidetes diverged from the common line of eubacterial descent before other eubacterial groups. As a result, they employ unique transcription initiation signals and, because of this uniqueness, they require specific genetic tools. Although some tools exist, they are not optimal for studying the roles and functions of these bacteria in the human gastrointestinal tract. Focusing on translation initiation signals in Bacteroides , we created a series of expression vectors allowing for different levels of protein expression in this genus, and we describe the use of pepI from Lactobacillus delbrueckii subsp. lactis as a novel reporter gene for Bacteroides . Furthermore, we report the identification of the 3′ end of the 16S rRNA of Bacteroides ovatus and analyze in detail its ribosomal binding site, thus defining a core region necessary for efficient translation, which we have incorporated into the design of our expression vectors. Based on the sequence logo information from the 5′ untranslated region of other Bacteroidales ribosomal protein genes, we conclude that our findings are relevant to all members of this order.
- Published
- 2013
- Full Text
- View/download PDF
82. DNAlogo: a smart mini application for generating DNA sequence logos
- Author
-
Yabin Guo
- Subjects
Sequence ,Generator (computer programming) ,Programming language ,Computer science ,business.industry ,computer.software_genre ,DNA sequencing ,Logo (programming language) ,World Wide Web ,Sequence logo ,ComputingMethodologies_PATTERNRECOGNITION ,Consensus sequence ,Nucleic acid ,The Internet ,business ,computer ,computer.programming_language - Abstract
Sequence logo is a powerful tool for presenting consensus sequences or motifs of nucleic acids and proteins (Schneider and Stephens 1990). WebLogo, a web-based sequence logo generator hosted by the University of California, Berkeley is the most popular logo generator so far (Crooks et al. 2004). WebLogo has a graphical interface and is convenient and highly configurable. However, its application is occasionally restricted by the internet speed, especially in developing countries. Moreover, when the sequence number exceeds 10,000, a command line interface will have to be used instead of graphical interface, but many users in biological sciences fields found it difficult to perform the installation and configuration of WebLogo and GhostScript (for vector map output) due to lacking relative knowledge. Here I made an application, DNAlogo, which creates DNA sequence logos in Windows with a graphical interface. The operation of DNAlogo doesn't need any knowledge on programming or bioinformatics.
- Published
- 2016
- Full Text
- View/download PDF
83. Sequence Alignment and Homology Search
- Author
-
Jui-Hung Hung and Zhiping Weng
- Subjects
0301 basic medicine ,Biological data ,Multiple sequence alignment ,Structural alignment ,Computational Biology ,Sequence Homology ,Sequence alignment ,Computational biology ,Biology ,01 natural sciences ,Genome ,General Biochemistry, Genetics and Molecular Biology ,010101 applied mathematics ,03 medical and health sciences ,Sequence logo ,030104 developmental biology ,Consensus sequence ,Human genome ,0101 mathematics ,Sequence Alignment - Abstract
Bioinformatics was brought into the spotlight in the late 1990s through the Human Genome Project. With the rapid accumulation of completed genomes, it was soon realized that for the vast majority of the newly identified genes and other functional regions of the genomes there were no other biological data. One way of inferring biological function is through homology: Because homologous genes have a common evolutionary descent, they are likely to have the same biological function. A large number of bioinformatics tools have been designed for rapidly and accurately comparing sequences of genes or proteins, comparing gene sequences with genomes, and comparing genomes. Two widely used tools for sequence alignment and homology searches, BLAST and ClustalW, are introduced here.
- Published
- 2016
84. Statistical methods for identifying sequence motifs affecting point mutations
- Author
-
Von Bing Yap, Gavin A. Huttley, Teresa Neeman, and Yicheng Zhu
- Subjects
0301 basic medicine ,Genetics ,Mutation rate ,Mutation Spectra ,Point mutation ,Sequence Analysis, DNA ,Biology ,Investigations ,03 medical and health sciences ,Sequence logo ,030104 developmental biology ,0302 clinical medicine ,Germline mutation ,Data Interpretation, Statistical ,Dynamic mutation ,Animals ,Humans ,Point Mutation ,Point accepted mutation ,CpG Islands ,Nucleotide Motifs ,030217 neurology & neurosurgery ,Software ,Suppressor mutation - Abstract
Mutation processes differ between types of point mutation, genomic locations, cells, and biological species. For some point mutations, specific neighboring bases are known to be mechanistically influential. Beyond these cases, numerous questions remain unresolved, including: what are the sequence motifs that affect point mutations? How large are the motifs? Are they strand symmetric? And, do they vary between samples? We present new log-linear models that allow explicit examination of these questions, along with sequence logo style visualization to enable identifying specific motifs. We demonstrate the performance of these methods by analyzing mutation processes in human germline and malignant melanoma. We recapitulate the known CpG effect, and identify novel motifs, including a highly significant motif associated with A→G mutations. We show that major effects of neighbors on germline mutation lie within ±2 of the mutating base. Models are also presented for contrasting the entire mutation spectra (the distribution of the different point mutations). We show the spectra vary significantly between autosomes and X-chromosome, with a difference in T→C transition dominating. Analyses of malignant melanoma confirmed reported characteristic features of this cancer, including statistically significant strand asymmetry, and markedly different neighboring influences. The methods we present are made freely available as a Python library https://bitbucket.org/pycogent3/mutationmotif.
- Published
- 2016
- Full Text
- View/download PDF
85. Perception Enhancement Using Visual Attributes in Sequence Motif Visualization
- Author
-
Bee Oy, Weiying K, and Lee Nung Kion
- Subjects
World Wide Web ,chemistry.chemical_classification ,Sequence logo ,chemistry ,Human–computer interaction ,Computer science ,Perception ,media_common.quotation_subject ,Representation (systemics) ,Nucleotide ,Sequence motif ,media_common ,Visualization - Abstract
Human factor theories are always being neglected especially in the design of biological tools. This problem was found in sequence logo which is used to visualize the conservation characteristics of the biological sequence motifs. Previous studies have found some limitations in the graphical representation which cause biasness and misinterpretation of the results in sequence logo. Therefore, the aim of this study is to investigate on the visual attributes performance in helping viewers to perceive and interpret the information based the preattentive theories and Gestalt principles of perception. A survey was carried out to gather user’s opinion. The results showed some limitations in the use of colour, negative space, size and arrangement of the nucleotides and the lack of information and interactivity in the sequence logo. Therefore, improvements in standardizing the colour, graphical representation of the nucleotides and interactivity of the tool are needed to solve the problems of biasness and misinterpretation of the results in sequence logo visualization.
- Published
- 2016
- Full Text
- View/download PDF
86. Multiple Sequence for Next-Generation Sequences
- Author
-
Ken Nguyen, Yi Pan, and Xuan Guo
- Subjects
Sequence logo ,Multiple sequence alignment ,Sequence database ,Computer science ,Sequence analysis ,Consensus sequence ,Sequence alignment ,Computational biology ,DNA sequencing ,Sequence (medicine) - Published
- 2016
- Full Text
- View/download PDF
87. Gene Slider: sequence logo interactive data-visualization for education and research
- Author
-
Ting Ting Wang, Asher Pasha, Anna van Weringh, David S. Guttman, Nicholas J. Provart, and Jamie Waese
- Subjects
0106 biological sciences ,0301 basic medicine ,Statistics and Probability ,Source code ,Computer science ,media_common.quotation_subject ,JavaScript ,01 natural sciences ,Biochemistry ,03 medical and health sciences ,chemistry.chemical_compound ,Data visualization ,Sequence Analysis, Protein ,Computer graphics (images) ,Databases, Genetic ,Position-Specific Scoring Matrices ,Promoter Regions, Genetic ,Molecular Biology ,Gene ,Conserved Sequence ,computer.programming_language ,media_common ,Internet ,Information retrieval ,Binding Sites ,business.industry ,Computational Biology ,Sequence Analysis, DNA ,Multiple species ,Computer Science Applications ,DNA binding site ,Computational Mathematics ,Sequence logo ,030104 developmental biology ,Computational Theory and Mathematics ,chemistry ,Brassicaceae ,business ,computer ,DNA ,Software ,010606 plant biology & botany ,Transcription Factors - Abstract
Summary: Gene Slider helps visualize the conservation and entropy of orthologous DNA and protein sequences by presenting them as one long sequence logo that can be zoomed in and out of, from an overview of the entire sequence down to just a few residues at a time. A search function enables users to find motifs such as cis-elements in promoter regions by simply ‘drawing’ a sequence logo representation of the desired motif as a query. In addition to displaying user-supplied FASTA files, our demonstration version of Gene Slider loads and displays a rich database of 90 000+ conserved non-coding regions across the Brassicaceae indexed to the TAIR10 Col-0 Arabidopsis thaliana sequence. It also displays transcription factor binding sites, enabling easy identification of regions that are both conserved across multiple species and may contain transcription factor binding sites. Availability and Implementation: Freely available on the web at: http://www.bar.utoronto.ca/GeneSlider and also as an app on http://araport.org. Website implemented in JavaScript and Processing.js with all major browsers supported. Source code available under GNU GPLv2 at SourceForge: https://sourceforge.net/projects/geneslider/. Contact: nicholas.provart@utoronto.ca
- Published
- 2016
88. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals
- Author
-
Aliaksandr Damienikan and Yevgeny Nikolaichik
- Subjects
0301 basic medicine ,Bioinformatics ,030106 microbiology ,Genome browser ,lcsh:Medicine ,Bacterial genome size ,Biology ,Genome ,Microbiology ,Sequence logo ,General Biochemistry, Genetics and Molecular Biology ,03 medical and health sciences ,Sigma factor ,Transcription factor binding site ,Terminator ,Gene ,Pectobacterium atrosepticum ,Genetics ,General Neuroscience ,lcsh:R ,Promoter ,General Medicine ,Genome project ,Genomics ,DNA binding site ,ComputingMethodologies_PATTERNRECOGNITION ,General Agricultural and Biological Sciences ,Genome annotation - Abstract
The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft RotEnterobacteriaceae(PectobacteriumandDickeyaspp.) andPseudomonasspp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome ofPectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of theP. atrosepticumchromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci.
- Published
- 2016
89. Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words
- Author
-
Giovanni Felici, Daniele Santoni, and Davide Vergni
- Subjects
0301 basic medicine ,Statistics and Probability ,Combinatorics of words ,Computational biology ,Biology ,General Biochemistry, Genetics and Molecular Biology ,Conserved sequence ,Evolution, Molecular ,03 medical and health sciences ,Sequence hypothesis ,0302 clinical medicine ,Protein sequencing ,Random sequence ,Amino Acid Sequence ,Peptide sequence ,chemistry.chemical_classification ,Genetics ,Models, Genetic ,General Immunology and Microbiology ,Applied Mathematics ,Proteins ,Amino acid association ,General Medicine ,Amino acid ,Sequence logo ,030104 developmental biology ,chemistry ,Modeling and Simulation ,Sequence space (evolution) ,General Agricultural and Biological Sciences ,Protein sequence ,030217 neurology & neurosurgery - Abstract
Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambigu- ous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid contraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence in- distinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1, 047 nat- ural protein sequences and 10, 470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natu- ral proteins. We analize the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones
- Published
- 2016
- Full Text
- View/download PDF
90. SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
- Author
-
Liisa Holm and J. Patrik Koskinen
- Subjects
Statistics and Probability ,Protein structure database ,Sequence analysis ,Sequence alignment ,Biology ,computer.software_genre ,Biochemistry ,law.invention ,03 medical and health sciences ,Bacterial Proteins ,law ,Sequence Analysis, Protein ,Databases, Protein ,Molecular Biology ,Alignment-free sequence analysis ,030304 developmental biology ,Sequence (medicine) ,0303 health sciences ,Sequence profiling tool ,Genome ,Sequence Homology, Amino Acid ,030302 biochemistry & molecular biology ,Suffix array ,Original Papers ,Computer Science Applications ,Computational Mathematics ,Sequence logo ,Computational Theory and Mathematics ,Macromolecular Structure, Dynamics and Function ,Metagenome ,Data mining ,computer ,Sequence Alignment ,Algorithms ,Software - Abstract
Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects. Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50–100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases. Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans. Contact: liisa.holm@helsinki.fi.
- Published
- 2012
91. Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion
- Author
-
Morten Nielsen and Martin Christen Frølund Thomsen
- Subjects
chemistry.chemical_classification ,Internet ,Binding Sites ,Multiple sequence alignment ,Amino Acid Motifs ,Sequence alignment ,Articles ,Computational biology ,Biology ,Bioinformatics ,Amino acid ,User-Computer Interface ,Sequence logo ,chemistry ,Sequence Analysis, Protein ,Computer Graphics ,Genetics ,Consensus sequence ,Position-Specific Scoring Matrices ,Amino acid binding ,Sequence Alignment ,Software ,Sequence (medicine) - Abstract
Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).
- Published
- 2012
- Full Text
- View/download PDF
92. Application of 2D graphic representation of protein sequence based on Huffman tree method
- Author
-
Ling Li, Xiao-Qin Qi, Zhao-Hui Qi, and Jun Feng
- Subjects
chemistry.chemical_classification ,Sequence ,Computer science ,Sequence analysis ,Molecular Sequence Data ,Representation (systemics) ,Proteins ,Health Informatics ,Huffman coding ,Computer Science Applications ,Amino acid ,Sequence logo ,symbols.namesake ,Protein sequencing ,chemistry ,Computer Graphics ,symbols ,Computer-Aided Design ,Amino Acid Sequence ,Peptide sequence ,Algorithm - Abstract
Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0-1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0-1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes.
- Published
- 2012
- Full Text
- View/download PDF
93. Identical sequence patterns in the ends of exons and introns of human protein-coding genes
- Author
-
Raphael Tavares, Paulo S. Oliveira, Carlos Gil Ferreira, Gabriel Renaud, Emmanuel Dias-Neto, and Fabio Passetti
- Subjects
Genetics ,Splice site mutation ,Base Sequence ,Sequence analysis ,Organic Chemistry ,Intron ,Proteins ,Exons ,Biology ,Biochemistry ,Introns ,Computational Mathematics ,Sequence logo ,Exon ,Structural Biology ,RNA splicing ,Consensus sequence ,Humans ,Gene - Abstract
Intron splicing is one of the most important steps involved in the maturation process of a pre-mRNA. Although the sequence profiles around the splice sites have been studied extensively, the levels of sequence identity between the exonic sequences preceding the donor sites and the intronic sequences preceding the acceptor sites has not been examined as thoroughly. In this study we investigated identity patterns between the last 15 nucleotides of the exonic sequence preceding the 5' splice site and the intronic sequence preceding the 3' splice site in a set of human protein-coding genes that do not exhibit intron retention. We found that almost 60% of consecutive exons and introns in human protein-coding genes share at least two identical nucleotides at their 3' ends and, on average, the sequence identity length is 2.47 nucleotides. Based on our findings we conclude that the 3' ends of exons and introns tend to have longer identical sequences within a gene than when being taken from different genes. Our results hold even if the pairs are non-consecutive in the transcription order.
- Published
- 2012
- Full Text
- View/download PDF
94. Numerical Characterization of DNA Sequence Based on Dinucleotides
- Author
-
Cun-Quan Zhang, Xingqin Qi, Qin Wu, and Edgar Fuller
- Subjects
Article Subject ,Sequence analysis ,Molecular Sequence Data ,lcsh:Medicine ,beta-Globins ,Biology ,lcsh:Technology ,General Biochemistry, Genetics and Molecular Biology ,DNA sequencing ,Consensus sequence ,Animals ,Humans ,lcsh:Science ,Alignment-free sequence analysis ,General Environmental Science ,Sequence (medicine) ,Genetics ,Base Sequence ,lcsh:T ,Nucleotides ,Euclidean space ,lcsh:R ,Exons ,Sequence Analysis, DNA ,General Medicine ,Composition (combinatorics) ,Sequence logo ,lcsh:Q ,Algorithm ,Research Article - Abstract
Sequence comparison is a primary technique for the analysis of DNA sequences. In order to make quantitative comparisons, one devises mathematical descriptors that capture the essence of the base composition and distribution of the sequence. Alignment methods and graphical techniques (where each sequence is represented by a curve in high-dimension Euclidean space) have been used popularly for a long time. In this contribution we will introduce a new nongraphical and nonalignment approach based on the frequencies of the dinucleotideXYin DNA sequences. The most important feature of this method is that it not only identifies adjacentXYpairs but also nonadjacentXYones whereXandYare separated by some number of nucleotides. This methodology preserves information in DNA sequence that is ignored by other methods. We test our method on the coding regions of exon-1 ofβ–globin for 11 species, and the utility of this new method is demonstrated.
- Published
- 2012
- Full Text
- View/download PDF
95. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse
- Author
-
Bin Zhang, Sasha Tkachev, Vaughan Latham, Jon M. Kornhauser, Beth L. Murray, Peter Hornbeck, Elzbieta Skrzypek, and Michael Sullivan
- Subjects
Computational biology ,Biology ,computer.software_genre ,Mice ,03 medical and health sciences ,0302 clinical medicine ,Genetics ,Animals ,Humans ,BioPAX : Biological Pathways Exchange ,Plug-in ,Amino Acid Sequence ,Phosphorylation ,Databases, Protein ,030304 developmental biology ,0303 health sciences ,Ubiquitination ,Proteins ,Acetylation ,Articles ,Rats ,PhosphoSitePlus ,Structure and function ,Visualization ,Sequence logo ,Scripting language ,030220 oncology & carcinogenesis ,Posttranslational modification ,Cattle ,Protein Processing, Post-Translational ,computer - Abstract
PhosphoSitePlus (http://www.phosphosite.org) is an open, comprehensive, manually curated and interactive resource for studying experimentally observed post-translational modifications, primarily of human and mouse proteins. It encompasses 1,30,000 non-redundant modification sites, primarily phosphorylation, ubiquitinylation and acetylation. The interface is designed for clarity and ease of navigation. From the home page, users can launch simple or complex searches and browse high-throughput data sets by disease, tissue or cell line. Searches can be restricted by specific treatments, protein types, domains, cellular components, disease, cell types, cell lines, tissue and sequences or motifs. A few clicks of the mouse will take users to substrate pages or protein pages with sites, sequences, domain diagrams and molecular visualization of side-chains known to be modified; to site pages with information about how the modified site relates to the functions of specific proteins and cellular processes and to curated information pages summarizing the details from one record. PyMOL and Chimera scripts that colorize reactive groups on residues that are modified can be downloaded. Features designed to facilitate proteomic analyses include downloads of modification sites, kinase-substrate data sets, sequence logo generators, a Cytoscape plugin and BioPAX download to enable pathway visualization of the kinase-substrate interactions in PhosphoSitePlus®.
- Published
- 2011
- Full Text
- View/download PDF
96. The analysis of core promoter sequences based on their chemical features
- Author
-
Xiao Hui Wang, Xiao Yan Huang, Hong Lin Zhai, and Zhi Jie Shan
- Subjects
Sequence analysis ,Chemistry ,Process Chemistry and Technology ,Promoter ,Computational biology ,Molecular biology ,DNA sequencing ,Computer Science Applications ,Analytical Chemistry ,Conserved sequence ,Sequence logo ,Consensus sequence ,RNA polymerase II holoenzyme ,Spectroscopy ,Software ,Sequence (medicine) - Abstract
The biochemical behaviors of promoter sequences are closely associated with their chemical properties and structures. In this study, an approach to the analysis of promoter sequences was developed based on the chemical features in DNA sequences. Utilizing the chemical parameters of nucleotides, a string of character sequence was translated into numerical sequences, and then the profiles of chemical properties in the sequence can be observed, which are helpful to better understanding for the behaviors of the sequence. The proposed approach was applied to the analysis of core promoter sequences of Escherichia colt K-12. Apart from the validation of the motifs at the 35 and 10 regions, several possible functional sites were observed, and the interaction mechanism between promoter sequence and RNA Polymerase holoenzyme was explored. The obtained results indicate that the consensus of important chemical features is higher than that of the characters in sequences, and our study could provide biologists some valuable hints. (C) 2011 Elsevier B.V. All rights reserved.
- Published
- 2011
- Full Text
- View/download PDF
97. Strategy of Repeats in DNA Sequence Assembly
- Author
-
Min Feng Xue, Lin Li, Yan An Zhang, and Yue Qi Han
- Subjects
Sequence logo ,Basis (linear algebra) ,Computer science ,Consensus sequence ,Direct repeat ,Sequence assembly ,General Medicine ,Computational biology ,Bioinformatics ,DNA sequencing - Abstract
This study is a concentrated at the strategy of repeats in DNA sequence assembly. To deal with the sequences with repeats is a difficult problem in DNA sequence assembly. On basis of strategy learning about masking repeats,this study proposes a method based on mate-pair analysis, and a developing strategy on Amos platform.
- Published
- 2011
- Full Text
- View/download PDF
98. A reduced computational load protein coding predictor using equivalent amino acid sequence of DNA string with period-3 based time and frequency domain analysis
- Author
-
Gananath Dash, Pramod Kumar Meher, Mukesh Kumar Raval, and Jayakishan Meher
- Subjects
Genetics ,chemistry.chemical_classification ,Gene prediction ,String (computer science) ,Genomics ,Computational biology ,Biology ,Amino acid ,Psychiatry and Mental health ,Sequence logo ,chemistry ,Coding region ,Peptide sequence ,Sequence (medicine) - Abstract
Development of efficient gene prediction algorithms is one of the fundamental efforts in gene prediction study in the area of genomics. In genomic signal processing the basic step of the identification of protein coding regions in DNA sequences is based on the period-3 property exhibited by nucleotides in exons. Several approaches based on signal processing tools and numerical representations have been applied to solve this problem, trying to achieve more accurate predictions. This paper presents a new indicator sequence based on amino acid sequence, called as aminoacid indicator sequence, derived from DNA string that uses the existing signal processing based time-domain and frequency domain methods to predict these regions within the billions long DNA sequence of eukaryotic cells which reduces the computational load by one-third. It is known that each triplet of bases, called as codon, instructs the cell machinery to synthesize an amino acid. The codon sequence therefore uniquely identifies an amino acid sequence which defines a protein. Thus the protein coding region is attributed by the codons in amino acid sequence. This property is used for detection of period-3 regions using amino acid sequence. Physico-chemical properties of amino acids are used for numerical representation. Various accuracy measures such as exonic peaks, discriminating factor, sensitivity, specificity, miss rate, wrong rate and approximate correlation are used to demonstrate the efficacy of the proposed predictor. The proposed method is validated on various organisms using the standard data-set HMR195, Burset and Guigo and KEGG. The simulation result shows that the proposed method is an effective approach for protein coding prediction.
- Published
- 2011
- Full Text
- View/download PDF
99. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein–DNA interactions
- Author
-
Martha L. Bulyk, Kimberly Robasky, Harvard University--MIT Division of Health Sciences and Technology, and Bulyk, Martha L.
- Subjects
Sequence analysis ,Protein Array Analysis ,Computational biology ,Biology ,Bioinformatics ,03 medical and health sciences ,0302 clinical medicine ,Protein sequencing ,Sequence Analysis, Protein ,Genetics ,Animals ,Humans ,Databases, Protein ,Peptide sequence ,030304 developmental biology ,0303 health sciences ,Internet ,Binding Sites ,Nucleic acid sequence ,Online database ,Articles ,DNA ,Position weight matrix ,3. Good health ,DNA-Binding Proteins ,Sequence logo ,030217 neurology & neurosurgery ,Software - Abstract
The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants (‘words’) of length k (‘k-mers’), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe >130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org., National Institutes of Health (U.S.) (grant number R01 HG003985)
- Published
- 2010
100. A new model of amino acids evolution, evolution index of amino acids and its application in graphical representation of protein sequences
- Author
-
Yi Zhang
- Subjects
chemistry.chemical_classification ,Sequence ,General Physics and Astronomy ,Computational biology ,Amino acid ,Sequence hypothesis ,Sequence logo ,Biochemistry ,Similarity (network science) ,chemistry ,Antifreeze protein ,Physical and Theoretical Chemistry ,Invariant (mathematics) ,Representation (mathematics) ,Mathematics - Abstract
In this paper, we gave a novel evolution model of amino acids, proposed ‘evolution index’ of 20 amino acids grounded on the a process of stepwise subdividing ‘synonymous codon domain’. Meanwhile, the rationale was given as to how the ‘evolution index’ of amino acids can be used for predictive science by comparing our result with that of co-evolution theory. Then, by reducing an amino acids sequence into a single numerical sequence, we brought forward a new graphical representation scheme and analyzed the similarity between 10 species with only one invariant. Furthermore, as an extended study, the relationship between 27 antifreeze proteins is got.
- Published
- 2010
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.