Start Over

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Authors :: Li W
O'Neill KR
Haft DH
DiCuccio M
Chetvernin V
Badretdin A
Coulouris G
Chitsaz F
Derbyshire MK
Durkin AS
Gonzales NR
Gwadz M
Lanczycki CJ
Song JS
Thanki N
Wang J
Yamashita RA
Yang M
Zheng C
Marchler-Bauer A
Thibaud-Nissen F
Source :: Nucleic acids research [Nucleic Acids Res] 2021 Jan 08; Vol. 49 (D1), pp. D1020-D1028.
Publication Year :: 2021
Abstract: The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.<br /> (Published by Oxford University Press on behalf of Nucleic Acids Research 2020.)

Subjects :: Data Curation methods
Data Mining methods
Genomics methods
Internet
Proteins classification
User-Computer Interface
Computational Biology methods
Databases, Genetic
Genome, Archaeal genetics
Genome, Bacterial genetics
Molecular Sequence Annotation methods
Proteins genetics

Details

Language :: English
ISSN :: 1362-4962
Volume :: 49
Issue :: D1
Database :: MEDLINE
Journal :: Nucleic acids research
Publication Type :: Academic Journal
Accession number :: 33270901
Full Text :: https://doi.org/10.1093/nar/gkaa1105

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources