1. Homology-Based Annotation of Large Protein Datasets
- Author
-
Marco Punta, Jaina Mistry, Biologie Computationnelle et Quantitative = Laboratory of Computational and Quantitative Biology (LCQB), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut de Biologie Paris Seine (IBPS), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS), European Bioinformatics Institute [Hinxton] (EMBL-EBI), EMBL Heidelberg, Carugo, O and Eisenhaber, Institut de Biologie Paris Seine (IBPS), and Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Protein family ,Computer science ,In silico ,[SDV]Life Sciences [q-bio] ,Sequence clustering ,Computational biology ,Protein family databases ,Homology (biology) ,DNA sequencing ,Protein annotation ,Homology ,03 medical and health sciences ,Annotation ,030104 developmental biology ,Protein sequencing ,ComputingMethodologies_PATTERNRECOGNITION ,Protein Annotation ,Profile-hidden Markov models - Abstract
International audience; Advances in DNA sequencing technologies have led to an increasing amount of protein sequence data being generated. Only a small fraction of this protein sequence data will have experimental annotation associated with them. Here, we describe a protocol for in silico homology-based annotation of large protein datasets that makes extensive use of manually curated collections of protein families. We focus on annotations provided by the Pfam database and suggest ways to identify family outliers and family variations. This protocol may be useful to people who are new to protein data analysis, or who are unfamiliar with the current computational tools that are available.
- Published
- 2016
- Full Text
- View/download PDF