102 results on '"Adam Frankish"'
Search Results
2. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain
- Author
-
Anoushka Joglekar, Andrey Prjibelski, Ahmed Mahfouz, Paul Collier, Susan Lin, Anna Katharina Schlusche, Jordan Marrocco, Stephen R. Williams, Bettina Haase, Ashley Hayes, Jennifer G. Chew, Neil I. Weisenfeld, Man Ying Wong, Alexander N. Stein, Simon A. Hardwick, Toby Hunt, Qi Wang, Christoph Dieterich, Zachary Bent, Olivier Fedrigo, Steven A. Sloan, Davide Risso, Erich D. Jarvis, Paul Flicek, Wenjie Luo, Geoffrey S. Pitt, Adam Frankish, August B. Smit, M. Elizabeth Ross, and Hagen U. Tilgner
- Subjects
Science - Abstract
Alternative RNA splicing varies across the brain. Its mapping at single cell resolution is unclear. Here, the authors provide a spatial and single-cell splicing atlas reporting brain region- and cell type-specific expression of different isoforms in the postnatal mouse brain.
- Published
- 2021
- Full Text
- View/download PDF
3. Transcriptional activity and strain-specific history of mouse pseudogenes
- Author
-
Cristina Sisu, Paul Muir, Adam Frankish, Ian Fiddes, Mark Diekhans, David Thybert, Duncan T. Odom, Paul Flicek, Thomas M. Keane, Tim Hubbard, Jennifer Harrow, and Mark Gerstein
- Subjects
Science - Abstract
Pseudogenes are key markers of genome remodelling processes. Here the authors present genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains, update human pseudogene annotations, and characterise the transcription and evolution of mouse pseudogenes.
- Published
- 2020
- Full Text
- View/download PDF
4. Expert curation of the human and mouse olfactory receptor gene repertoires identifies conserved coding regions split across two exons
- Author
-
If H. A. Barnes, Ximena Ibarra-Soria, Stephen Fitzgerald, Jose M. Gonzalez, Claire Davidson, Matthew P. Hardy, Deepa Manthravadi, Laura Van Gerven, Mark Jorissen, Zhen Zeng, Mona Khan, Peter Mombaerts, Jennifer Harrow, Darren W. Logan, and Adam Frankish
- Subjects
Olfactory receptor gene ,Annotation ,Curation ,Mouse ,Human ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with 874 in human and 1483 loci in mouse (including pseudogenes). The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. Results Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon. Conclusions This work provides the most comprehensive curation effort of the human and mouse OR gene repertoires to date. The complete annotation has been integrated into the GENCODE reference gene set, for immediate availability to the research community.
- Published
- 2020
- Full Text
- View/download PDF
5. The value of primary transcripts to the clinical and non‐clinical genomics community: Survey results and roadmap for improvements
- Author
-
Joannella Morales, Aoife C. McMahon, Jane Loveland, Emily Perry, Adam Frankish, Sarah Hunt, Irina M. Armean, Paul Flicek, and Fiona Cunningham
- Subjects
default transcript ,survey ,transcript annotation ,variant interpretation ,Genetics ,QH426-470 - Abstract
Abstract Background Variant interpretation is dependent on transcript annotation and remains time consuming and challenging. There are major obstacles for historical data reuse and for interpretation of new variants. First, both RefSeq and Ensembl/GENCODE produce transcript sets in common use, but there is currently no easy way to translate between the two. Second, the resources often used for variant interpretation (e.g. ClinVar, gnomAD, UniProt) do not use the same transcript set, nor default transcript or protein sequence. Method Ensembl ran a survey in 2018 to sample attitudes to choosing one default transcript per locus, and to gather data on reference sequences used by the scientific community. This was publicised on the Ensembl and UCSC genome browsers, by email and on social media. Results The survey had 788 responses from 32 different countries, the results of which we report here. Conclusions We present our roadmap to create an effective default set of transcripts for resources, and for reporting interpretation of clinical variants.
- Published
- 2021
- Full Text
- View/download PDF
6. Cell type-specific novel long non-coding RNA and circular RNA in the BLUEPRINT hematopoietic transcriptomes atlas
- Author
-
Luigi Grassi, Osagie G. Izuogu, Natasha A.N. Jorge, Denis Seyres, Mariona Bustamante, Frances Burden, Samantha Farrow, Neda Farahi, Fergal J. Martin, Adam Frankish, Jonathan M. Mudge, Myrto Kostadima, Romina Petersen, John J. Lambourne, Sophia Rowlston, Enca Martin-Rendon, Laura Clarke, Kate Downes, Xavier Estivill, Paul Flicek, Joost H.A. Martens, Marie-Laure Yaspo, Hendrik G. Stunnenberg, Willem H. Ouwehand, Fabio Passetti, Ernest Turro, and Mattia Frontini
- Subjects
Diseases of the blood and blood-forming organs ,RC633-647.5 - Abstract
Transcriptional profiling of hematopoietic cell subpopulations has helped to characterize the developmental stages of the hematopoietic system and the molecular bases of malignant and non-malignant blood diseases. Previously, only the genes targeted by expression microarrays could be profiled genome-wide. High-throughput RNA sequencing, however, encompasses a broader repertoire of RNA molecules, without restriction to previously annotated genes. We analyzed the BLUEPRINT consortium RNA-sequencing data for mature hematopoietic cell types. The data comprised 90 total RNA-sequencing samples, each composed of one of 27 cell types, and 32 small RNA-sequencing samples, each composed of one of 11 cell types. We estimated gene and isoform expression levels for each cell type using existing annotations from Ensembl. We then used guided transcriptome assembly to discover unannotated transcripts. We identified hundreds of novel non-coding RNA genes and showed that the majority have cell type-dependent expression. We also characterized the expression of circular RNA and found that these are also cell type-specific. These analyses refine the active transcriptional landscape of mature hematopoietic cells, highlight abundant genes and transcriptional isoforms for each blood cell type, and provide a valuable resource for researchers of hematologic development and diseases. Finally, we made the data accessible via a web-based interface: https://blueprint.haem.cam.ac.uk/bloodatlas/.
- Published
- 2020
- Full Text
- View/download PDF
7. Genome annotation for clinical genomic diagnostics: strengths and weaknesses
- Author
-
Charles A. Steward, Alasdair P. J. Parker, Berge A. Minassian, Sanjay M. Sisodiya, Adam Frankish, and Jennifer Harrow
- Subjects
Medicine ,Genetics ,QH426-470 - Abstract
Abstract The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.
- Published
- 2017
- Full Text
- View/download PDF
8. Getting the Entire Message: Progress in Isoform Sequencing
- Author
-
Simon A. Hardwick, Anoushka Joglekar, Paul Flicek, Adam Frankish, and Hagen U. Tilgner
- Subjects
RNA ,isoforms ,long-read ,splicing ,epitranscriptome ,Genetics ,QH426-470 - Abstract
The advent of second-generation sequencing and its application to RNA sequencing have revolutionized the field of genomics by allowing quantification of gene expression, as well as the definition of transcription start/end sites, exons, splice sites and RNA editing sites. However, due to the sequencing of fragments of cDNAs, these methods have not given a reliable picture of complete RNA isoforms. Third-generation sequencing has filled this gap and allows end-to-end sequencing of entire RNA/cDNA molecules. This approach to transcriptomics has been a “niche” technology for a couple of years but now is becoming mainstream with many different applications. Here, we review the background and progress made to date in this rapidly growing field. We start by reviewing the progressive realization that alternative splicing is omnipresent. We then focus on long-noncoding RNA isoforms and the distinct combination patterns of exons in noncoding and coding genes. We consider the implications of the recent technologies of direct RNA sequencing and single-cell isoform RNA sequencing. Finally, we discuss the parameters that define the success of long-read RNA sequencing experiments and strategies commonly used to make the most of such data.
- Published
- 2019
- Full Text
- View/download PDF
9. Evidence for transcript networks composed of chimeric RNAs in human cells.
- Author
-
Sarah Djebali, Julien Lagarde, Philipp Kapranov, Vincent Lacroix, Christelle Borel, Jonathan M Mudge, Cédric Howald, Sylvain Foissac, Catherine Ucla, Jacqueline Chrast, Paolo Ribeca, David Martin, Ryan R Murray, Xinping Yang, Lila Ghamsari, Chenwei Lin, Ian Bell, Erica Dumais, Jorg Drenkow, Michael L Tress, Josep Lluís Gelpí, Modesto Orozco, Alfonso Valencia, Nynke L van Berkum, Bryan R Lajoie, Marc Vidal, John Stamatoyannopoulos, Philippe Batut, Alex Dobin, Jennifer Harrow, Tim Hubbard, Job Dekker, Adam Frankish, Kourosh Salehi-Ashtiani, Alexandre Reymond, Stylianos E Antonarakis, Roderic Guigó, and Thomas R Gingeras
- Subjects
Medicine ,Science - Abstract
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5' and 3' transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
- Published
- 2012
- Full Text
- View/download PDF
10. A draft human pangenome reference
- Author
-
Wen-Wei Liao, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, Julian K. Lucas, Jean Monlong, Haley J. Abel, Silvia Buonaiuto, Xian H. Chang, Haoyu Cheng, Justin Chu, Vincenza Colonna, Jordan M. Eizenga, Xiaowen Feng, Christian Fischer, Robert S. Fulton, Shilpa Garg, Cristian Groza, Andrea Guarracino, William T. Harvey, Simon Heumos, Kerstin Howe, Miten Jain, Tsung-Yu Lu, Charles Markello, Fergal J. Martin, Matthew W. Mitchell, Katherine M. Munson, Moses Njagi Mwaniki, Adam M. Novak, Hugh E. Olsen, Trevor Pesout, David Porubsky, Pjotr Prins, Jonas A. Sibbesen, Jouni Sirén, Chad Tomlinson, Flavia Villani, Mitchell R. Vollger, Lucinda L. Antonacci-Fulton, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Andrew Carroll, Pi-Chuan Chang, Sarah Cody, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Peter Ebert, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Giulio Formenti, Adam Frankish, Yan Gao, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Leanne Haggerty, Kendra Hoekzema, Thibaut Hourlier, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Ann McCartney, Jennifer McDaniel, Jacquelyn Mountcastle, Maria Nattestad, Sergey Nurk, Nathan D. Olson, Alice B. Popejoy, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. Schultz, Kishwar Shafin, Michael W. Smith, Heidi J. Sofia, Ahmad N. Abou Tayoun, Françoise Thibaud-Nissen, Francesca Floriana Tricomi, Justin Wagner, Brian Walenz, Jonathan M. D. Wood, Aleksey V. Zimin, Guillaume Bourque, Mark J. P. Chaisson, Paul Flicek, Adam M. Phillippy, Justin M. Zook, Evan E. Eichler, David Haussler, Ting Wang, Erich D. Jarvis, Karen H. Miga, Erik Garrison, Tobias Marschall, Ira M. Hall, Heng Li, and Benedict Paten
- Subjects
Cancer Research ,Multidisciplinary - Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
- Published
- 2023
- Full Text
- View/download PDF
11. SCN1A: bioinformatically informed revised boundaries for promoter and enhancer regions
- Author
-
Susanna Pagni, Helena Martins Custodio, Adam Frankish, Jonathan M Mudge, James D Mills, and Sanjay M Sisodiya
- Subjects
Genetics ,General Medicine ,Molecular Biology ,Genetics (clinical) - Abstract
Pathogenic variations in the sodium voltage-gated channel alpha subunit 1 (SCN1A) gene are responsible for multiple epilepsy phenotypes, including Dravet syndrome, febrile seizures (FS) and genetic epilepsy with FS plus. Phenotypic heterogeneity is a hallmark of SCN1A-related epilepsies, the causes of which are yet to be clarified. Genetic variation in the non-coding regulatory regions of SCN1A could be one potential causal factor. However, a comprehensive understanding of the SCN1A regulatory landscape is currently lacking. Here, we summarized the current state of knowledge of SCN1A regulation, providing details on its promoter and enhancer regions. We then integrated currently available data on SCN1A promoters by extracting information related to the SCN1A locus from genome-wide repositories and clearly defined the promoter and enhancer regions of SCN1A. Further, we explored the cellular specificity of differential SCN1A promoter usage. We also reviewed and integrated the available human brain-derived enhancer databases and mouse-derived data to provide a comprehensive computationally developed summary of SCN1A brain-active enhancers. By querying genome-wide data repositories, extracting SCN1A-specific data and integrating the different types of independent evidence, we created a comprehensive catalogue that better defines the regulatory landscape of SCN1A, which could be used to explore the role of SCN1A regulatory regions in disease.
- Published
- 2023
- Full Text
- View/download PDF
12. Recombination between heterologous human acrocentric chromosomes
- Author
-
Barcelona Supercomputing Center, Human Pangenome Reference Consortium: "Haley J. Abel, Lucinda L. Antonacci-Fulton, Mobin Asri, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Guillaume Bourque, Silvia Buonaiuto, Andrew Carroll, Mark J. P. Chaisson, Pi-Chuan Chang, Xian H. Chang, Haoyu Cheng, Justin Chu, Sarah Cody, Vincenza Colonna, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Daniel Doerr, Peter Ebert, Jana Ebler, Evan E. Eichler, Jordan M. Eizenga, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Xiaowen Feng, Christian Fischer, Paul Flicek, Giulio Formenti, Adam Frankish, Robert S. Fulton, Yan Gao, Shilpa Garg, Erik Garrison, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Cristian Groza, Andrea Guarracino, Leanne Haggerty, Ira Hall, William T. Harvey, Marina Haukness, David Haussler, Simon Heumos, Glenn Hickey, Kendra Hoekzema, Thibaut Hourlier, Kerstin Howe, Miten Jain, Erich D. Jarvis, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Heng Li, Wen-Wei Liao, Shuangjia Lu, Tsung-Yu Lu, Julian K. Lucas, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Charles Markello, Tobias Marschall, Fergal J. Martin, Ann McCartney, Jennifer McDaniel, Karen H. Miga, Matthew W. Mitchell, Jean Monlong, Jacquelyn Mountcastle, Katherine M. Munson, Moses Njagi Mwaniki, Maria Nattestad, Adam M. Novak, Sergey Nurk, Hugh E. Olsen, Nathan D. Olson, Benedict Paten, Trevor Pesout, Adam M. Phillippy, Alice B. Popejoy, David Porubsky, Pjotr Prins, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. Schultz, Kishwar Shafin, Jonas A. Sibbesen, Jouni Sirén, Michael W. Smith, Heidi J. Sofia, Ahmad N. Abou Tayoun, Françoise Thibaud-Nissen, Chad Tomlinson, Francesca Floriana Tricomi, Flavia Villani, Mitchell R. Vollger, Justin Wagner, Brian Walenz, Ting Wang, Jonathan M. D. Wood, Aleksey, Guarracino, Andrea, Buonaiuto, Silvia, Gomes de Lima, Leonardo, Potapova, Tamara, Rhie, Arang, Marco, Santiago, Barcelona Supercomputing Center, Human Pangenome Reference Consortium: "Haley J. Abel, Lucinda L. Antonacci-Fulton, Mobin Asri, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Guillaume Bourque, Silvia Buonaiuto, Andrew Carroll, Mark J. P. Chaisson, Pi-Chuan Chang, Xian H. Chang, Haoyu Cheng, Justin Chu, Sarah Cody, Vincenza Colonna, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Daniel Doerr, Peter Ebert, Jana Ebler, Evan E. Eichler, Jordan M. Eizenga, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Xiaowen Feng, Christian Fischer, Paul Flicek, Giulio Formenti, Adam Frankish, Robert S. Fulton, Yan Gao, Shilpa Garg, Erik Garrison, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Cristian Groza, Andrea Guarracino, Leanne Haggerty, Ira Hall, William T. Harvey, Marina Haukness, David Haussler, Simon Heumos, Glenn Hickey, Kendra Hoekzema, Thibaut Hourlier, Kerstin Howe, Miten Jain, Erich D. Jarvis, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Heng Li, Wen-Wei Liao, Shuangjia Lu, Tsung-Yu Lu, Julian K. Lucas, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Charles Markello, Tobias Marschall, Fergal J. Martin, Ann McCartney, Jennifer McDaniel, Karen H. Miga, Matthew W. Mitchell, Jean Monlong, Jacquelyn Mountcastle, Katherine M. Munson, Moses Njagi Mwaniki, Maria Nattestad, Adam M. Novak, Sergey Nurk, Hugh E. Olsen, Nathan D. Olson, Benedict Paten, Trevor Pesout, Adam M. Phillippy, Alice B. Popejoy, David Porubsky, Pjotr Prins, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. Schultz, Kishwar Shafin, Jonas A. Sibbesen, Jouni Sirén, Michael W. Smith, Heidi J. Sofia, Ahmad N. Abou Tayoun, Françoise Thibaud-Nissen, Chad Tomlinson, Francesca Floriana Tricomi, Flavia Villani, Mitchell R. Vollger, Justin Wagner, Brian Walenz, Ting Wang, Jonathan M. D. Wood, Aleksey, Guarracino, Andrea, Buonaiuto, Silvia, Gomes de Lima, Leonardo, Potapova, Tamara, Rhie, Arang, and Marco, Santiago
- Abstract
The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats and extended segmental duplications1,2. Although the resolution of these regions in the first complete assembly of a human genome—the Telomere-to-Telomere Consortium’s CHM13 assembly (T2T-CHM13)—provided a model of their homology3, it remained unclear whether these patterns were ancestral or maintained by ongoing recombination exchange. Here we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologous sequences. Utilizing an all-to-all comparison of the human pangenome from the Human Pangenome Reference Consortium4 (HPRC), we find that contigs from all of the SAACs form a community. A variation graph5 constructed from centromere-spanning acrocentric contigs indicates the presence of regions in which most contigs appear nearly identical between heterologous acrocentric chromosomes in T2T-CHM13. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the pseudo-homologous regions than in the corresponding short and long arms, indicating higher rates of recombination6,7. The pseudo-homologous regions include sequences that have previously been shown to lie at the breakpoint of Robertsonian translocations8, and their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14 and 21. The ubiquity of signals of recombination between heterologous acrocentric chromosomes seen in the HPRC draft pangenome suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed from cytogenetic studies 50 years ago9., Our work depends on the HPRC draft human pangenome resource established in the accompanying Article4, and we thank the production and assembly groups for their efforts in establishing this resource. This work used the computational resources of the UTHSC Octopus cluster and NIH HPC Biowulf cluster. We acknowledge support in maintaining these systems that was critical to our analyses. The authors thank M. Miller for the development of a graphical synopsis of our study (Fig. 5); and R. Williams and N. Soranzo for support and guidance in the design and discussion of our work. This work was supported, in part, by National Institutes of Health/NIDA U01DA047638 (E.G.), National Institutes of Health/NIGMS R01GM123489 (E.G.), NSF PPoSS Award no. 2118709 (E.G. and C.F.), the Tennessee Governor’s Chairs programme (C.F. and E.G.), National Institutes of Health/NCI R01CA266339 (T.P., L.G.d.L. and J.L.G.), and the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (A.R., S.K. and A.M.P.). We acknowledge support from Human Technopole (A.G.), Consiglio Nazionale delle Ricerche, Italy (S.B. and V.C.), and Stowers Institute for Medical Research (T.P., L.G.d.L., B.R. and J.L.G.)., Peer Reviewed, "Article signat per 13 autors/es: Andrea Guarracino, Silvia Buonaiuto, Leonardo Gomes de Lima, Tamara Potapova, Arang Rhie, Sergey Koren, Boris Rubinstein, Christian Fischer, Human Pangenome Reference Consortium, Jennifer L. Gerton, Adam M. Phillippy, Vincenza Colonna & Erik Garrison " Human Pangenome Reference Consortium: "Haley J. Abel, Lucinda L. Antonacci-Fulton, Mobin Asri, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Guillaume Bourque, Silvia Buonaiuto, Andrew Carroll, Mark J. P. Chaisson, Pi-Chuan Chang, Xian H. Chang, Haoyu Cheng, Justin Chu, Sarah Cody, Vincenza Colonna, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Daniel Doerr, Peter Ebert, Jana Ebler, Evan E. Eichler, Jordan M. Eizenga, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Xiaowen Feng, Christian Fischer, Paul Flicek, Giulio Formenti, Adam Frankish, Robert S. Fulton, Yan Gao, Shilpa Garg, Erik Garrison, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Cristian Groza, Andrea Guarracino, Leanne Haggerty, Ira Hall, William T. Harvey, Marina Haukness, David Haussler, Simon Heumos, Glenn Hickey, Kendra Hoekzema, Thibaut Hourlier, Kerstin Howe, Miten Jain, Erich D. Jarvis, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Heng Li, Wen-Wei Liao, Shuangjia Lu, Tsung-Yu Lu, Julian K. Lucas, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Charles Markello, Tobias Marschall, Fergal J. Martin, Ann McCartney, Jennifer McDaniel, Karen H. Miga, Matthew W. Mitchell, Jean Monlong, Jacquelyn Mountcastle, Katherine M. Munson, Moses Njagi Mwaniki, Maria Nattestad, Adam M. Novak, Sergey Nurk, Hugh E. Olsen, Nathan D. Olson, Benedict Paten, Trevor Pesout, Adam M. Phillippy, Alice B. Popejoy, David Porubsky, Pjotr Prins, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. S, Postprint (published version)
- Published
- 2023
13. SCN1A overexpression, associated with a genomic region marked by a risk variant for a common epilepsy, raises seizure susceptibility
- Author
-
Katri Silvennoinen, Kinga Gawel, Despina Tsortouktzidis, Julika Pitsch, Saud Alhusaini, Karen M. J. van Loo, Richard Picardo, Zuzanna Michalak, Susanna Pagni, Helena Martins Custodio, James Mills, Christopher D. Whelan, Greig I. de Zubicaray, Katie L. McMahon, Wietske van der Ent, Karolina J. Kirstein-Smardzewska, Ettore Tiraboschi, Jonathan M. Mudge, Adam Frankish, Maria Thom, Margaret J. Wright, Paul M. Thompson, Susanne Schoch, Albert J. Becker, Camila V. Esguerra, and Sanjay M. Sisodiya
- Subjects
Epilepsy ,Sclerosis ,Genomics ,Zebrafish Proteins ,Hippocampus ,Seizures, Febrile ,Pathology and Forensic Medicine ,NAV1.1 Voltage-Gated Sodium Channel ,Cellular and Molecular Neuroscience ,Epilepsy, Temporal Lobe ,Animals ,Humans ,Gliosis ,Neurology (clinical) ,Zebrafish - Abstract
Acta Neuropathol (Berl) 144(1), 107-127 (2022). doi:10.1007/s00401-022-02429-0, Published by Springer, Berlin ; Heidelberg
- Published
- 2022
- Full Text
- View/download PDF
14. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research
- Author
-
Joannella Morales, Shashikant Pujar, Jane E. Loveland, Alex Astashyn, Ruth Bennett, Andrew Berry, Eric Cox, Claire Davidson, Olga Ermolaeva, Catherine M. Farrell, Reham Fatima, Laurent Gil, Tamara Goldfarb, Jose M. Gonzalez, Diana Haddad, Matthew Hardy, Toby Hunt, John Jackson, Vinita S. Joardar, Michael Kay, Vamsi K. Kodali, Kelly M. McGarvey, Aoife McMahon, Jonathan M. Mudge, Daniel N. Murphy, Michael R. Murphy, Bhanu Rajput, Sanjida H. Rangwala, Lillian D. Riddick, Françoise Thibaud-Nissen, Glen Threadgold, Anjana R. Vatsan, Craig Wallin, David Webb, Paul Flicek, Ewan Birney, Kim D. Pruitt, Adam Frankish, Fiona Cunningham, and Terence D. Murphy
- Subjects
Genome ,Multidisciplinary ,National Library of Medicine (U.S.) ,Information Dissemination ,Databases, Genetic ,Computational Biology ,Humans ,Molecular Sequence Annotation ,Genomics ,United States - Abstract
Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1and RefSeq2launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref.3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.
- Published
- 2022
- Full Text
- View/download PDF
15. Ensembl 2022
- Author
-
Fiona Cunningham, James E Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Ruth Bennett, Andrew Berry, Jyothish Bhai, Alexandra Bignell, Konstantinos Billis, Sanjay Boddu, Lucy Brooks, Mehrnaz Charkhchi, Carla Cummins, Luca Da Rin Fioretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Jose Gonzalez Martinez, Cristina Guijarro-Clarke, Arthur Gymer, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Thomas Juettemann, Vinay Kaikala, Mike Kay, Ilias Lavidas, Tuan Le, Diana Lemos, José Carlos Marugán, Shamika Mohanan, Aleena Mushtaq, Marc Naven, Denye N Ogeh, Anne Parker, Andrew Parton, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, Manoj Pandian Sakthivel, Ahamed Imran Abdul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, José G Pérez-Silva, William Stark, Emily Steed, Kyösti Sutinen, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Michal Szpak, Anja Thormann, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Natalie Willhoft, Andrea Winterbottom, Elizabeth Wass, Marc Chakiachvili, Bethany Flint, Adam Frankish, Stefano Giorgetti, Leanne Haggerty, Sarah E Hunt, Garth R IIsley, Jane E Loveland, Fergal J Martin, Benjamin Moore, Jonathan M Mudge, Matthieu Muffato, Emily Perry, Magali Ruffier, John Tate, David Thybert, Stephen J Trevanion, Sarah Dyer, Peter W Harrison, Kevin L Howe, Andrew D Yates, Daniel R Zerbino, and Paul Flicek
- Subjects
Genome ,ComputingMethodologies_PATTERNRECOGNITION ,AcademicSubjects/SCI00010 ,Databases, Genetic ,Genetics ,Database Issue ,Animals ,Computational Biology ,Humans ,Molecular Sequence Annotation ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,Software - Abstract
Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.
- Published
- 2021
- Full Text
- View/download PDF
16. Ensembl 2023
- Author
-
Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Alexandra Bignell, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Mehrnaz Charkhchi, Alexander Cockburn, Luca Da Rin Fiorretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Cristi Guijarro, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Diego Marques-Coelho, José Carlos Marugán, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, José G Pérez-Silva, Ahamed Imran Abdul Salam, Nuno Saraiva-Agostinho, Helen Schuilenburg, Dan Sheppard, Swati Sinha, Botond Sipos, William Stark, Emily Steed, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Michal Szpak, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Elizabeth Wass, Natalie Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, John Tate, David Thybert, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Magali Ruffier, Fiona Cunningham, Sarah Dyer, Robert D Finn, Kevin L Howe, Peter W Harrison, Andrew D Yates, Paul Flicek
- Abstract
Ensembl (https://www.ensembl.org) has produced high-quality genomic resources for vertebrates and model organisms for more than twenty years. During that time, our resources, services and tools have continually evolved in line with both the publicly available genome data and the downstream research and applications that utilise the Ensembl platform. In recent years we have witnessed a dramatic shift in the genomic landscape. There has been a large increase in the number of high-quality reference genomes through global biodiversity initiatives. In parallel, there have been major advances towards pangenome representations of higher species, where many alternative genome assemblies representing different breeds, cultivars, strains and haplotypes are now available. In order to support these efforts and accelerate downstream research, it is our goal at Ensembl to create high-quality annotations, tools and services for species across the tree of life. Here, we report our resources for popular reference genomes, the dramatic growth of our annotations (including haplotypes from the first human pangenome graphs), updates to the Ensembl Variant Effect Predictor (VEP), interactive protein structure predictions from AlphaFold DB, and the beta release of our new website.
- Published
- 2022
17. Ensembl 2021
- Author
-
Kevin L Howe, Premanand Achuthan, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Andrey G Azov, Ruth Bennett, Jyothish Bhai, Konstantinos Billis, Sanjay Boddu, Mehrnaz Charkhchi, Carla Cummins, Luca Da Rin Fioretto, Claire Davidson, Kamalkumar Dodiya, Bilal El Houdaigui, Reham Fatima, Astrid Gall, Carlos Garcia Giron, Tiago Grego, Cristina Guijarro-Clarke, Leanne Haggerty, Anmol Hemrom, Thibaut Hourlier, Osagie G Izuogu, Thomas Juettemann, Vinay Kaikala, Mike Kay, Ilias Lavidas, Tuan Le, Diana Lemos, Jose Gonzalez Martinez, José Carlos Marugán, Thomas Maurel, Aoife C McMahon, Shamika Mohanan, Benjamin Moore, Matthieu Muffato, Denye N Oheh, Dimitrios Paraschas, Anne Parker, Andrew Parton, Irina Prosovetskaia, Manoj P Sakthivel, Ahamed I Abdul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, Emily Steed, Michal Szpak, Marek Szuba, Kieron Taylor, Anja Thormann, Glen Threadgold, Brandon Walts, Andrea Winterbottom, Marc Chakiachvili, Ameya Chaubal, Nishadi De Silva, Bethany Flint, Adam Frankish, Sarah E Hunt, Garth R IIsley, Nick Langridge, Jane E Loveland, Fergal J Martin, Jonathan M Mudge, Joanella Morales, Emily Perry, Magali Ruffier, John Tate, David Thybert, Stephen J Trevanion, Fiona Cunningham, Andrew D Yates, Daniel R Zerbino, and Paul Flicek
- Subjects
Internet ,0303 health sciences ,SARS-CoV-2 ,AcademicSubjects/SCI00010 ,030302 biochemistry & molecular biology ,COVID-19 ,Computational Biology ,Molecular Sequence Annotation ,Genomics ,03 medical and health sciences ,ComputingMethodologies_PATTERNRECOGNITION ,Vertebrates ,Genetics ,Animals ,Humans ,Database Issue ,Databases, Nucleic Acid ,Pandemics ,030304 developmental biology - Abstract
The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed and comprehensive annotation of gene structures, regulatory elements and variants, and enable comparative genomics by inferring the evolutionary history of genes and genomes. Our integrated genomic data are made available in a variety of ways, including genome browsers, search interfaces, specialist tools such as the Ensembl Variant Effect Predictor, download files and programmatic interfaces. Here, we present recent Ensembl developments including two new website portals. Ensembl Rapid Release (http://rapid.ensembl.org) is designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects. Our SARS-CoV-2 genome browser (https://covid-19.ensembl.org) integrates our own annotation with publicly available genomic data from numerous sources to facilitate the use of genomics in the international scientific response to the COVID-19 pandemic. We also report on other updates to our annotation resources, tools and services. All Ensembl data and software are freely available without restriction.
- Published
- 2020
- Full Text
- View/download PDF
18. Progress, Challenges, and Surprises in Annotating the Human Genome
- Author
-
Paul Flicek, Daniel R. Zerbino, and Adam Frankish
- Subjects
Computer science ,Computational biology ,Genome ,Article ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,Genetics ,Humans ,human ,genes ,genome ,Molecular Biology ,Gene ,Genetics (clinical) ,030304 developmental biology ,Sequence (medicine) ,variants ,0303 health sciences ,Genome, Human ,business.industry ,Molecular Sequence Annotation ,annotation ,Knowledge base ,Human genome ,regulatory elements ,Transcription (software) ,business ,030217 neurology & neurosurgery ,Reference genome - Abstract
Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.
- Published
- 2020
- Full Text
- View/download PDF
19. GENCODE: reference annotation for the human and mouse genomes in 2023
- Author
-
Adam Frankish, Sílvia Carbonell-Sala, Mark Diekhans, Irwin Jungreis, Jane E Loveland, Jonathan M Mudge, Cristina Sisu, James C Wright, Carme Arnan, If Barnes, Abhimanyu Banerjee, Ruth Bennett, Andrew Berry, Alexandra Bignell, Carles Boix, Ferriol Calvet, Daniel Cerdán-Vélez, Fiona Cunningham, Claire Davidson, Sarah Donaldson, Cagatay Dursun, Reham Fatima, Stefano Giorgetti, Carlos Garcıa Giron, Jose Manuel Gonzalez, Matthew Hardy, Peter W Harrison, Thibaut Hourlier, Zoe Hollis, Toby Hunt, Benjamin James, Yunzhe Jiang, Rory Johnson, Mike Kay, Julien Lagarde, Fergal J Martin, Laura Martínez Gómez, Surag Nair, Pengyu Ni, Fernando Pozo, Vivek Ramalingam, Magali Ruffier, Bianca M Schmitt, Jacob M Schreiber, Emily Steed, Marie-Marthe Suner, Dulika Sumathipala, Irina Sycheva, Barbara Uszczynska-Ratajczak, Elizabeth Wass, Yucheng T Yang, Andrew Yates, Zahoor Zafrulla, Jyoti S Choudhary, Mark Gerstein, Roderic Guigo, Tim J P Hubbard, Manolis Kellis, Anshul Kundaje, Benedict Paten, Michael L Tress, and Paul Flicek
- Subjects
Genetics ,610 Medicine & health - Abstract
Data availability: No new data were generated or analysed in support of this research. Copyright © The Author(s) 2022. GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org. National Human Genome Research Institute of the National Institutes of Health [U41HG007234, R01HG004037]; Wellcome Trust [WT222155/Z/20/Z]; European Molecular Biology Laboratory. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funding for open access charge: National Institutes of Health.
- Published
- 2022
- Full Text
- View/download PDF
20. Ensembl 2023
- Author
-
Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Alexandra Bignell, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Mehrnaz Charkhchi, Alexander Cockburn, Luca Da Rin Fiorretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Cristi Guijarro, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Diego Marques-Coelho, José Carlos Marugán, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, José G Pérez-Silva, Ahamed Imran Abdul Salam, Nuno Saraiva-Agostinho, Helen Schuilenburg, Dan Sheppard, Swati Sinha, Botond Sipos, William Stark, Emily Steed, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Michal Szpak, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Elizabeth Wass, Natalie Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, John Tate, David Thybert, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Magali Ruffier, Fiona Cunningham, Sarah Dyer, Robert D Finn, Kevin L Howe, Peter W Harrison, Andrew D Yates, and Paul Flicek
- Subjects
Genetics - Abstract
Ensembl (https://www.ensembl.org) has produced high-quality genomic resources for vertebrates and model organisms for more than twenty years. During that time, our resources, services and tools have continually evolved in line with both the publicly available genome data and the downstream research and applications that utilise the Ensembl platform. In recent years we have witnessed a dramatic shift in the genomic landscape. There has been a large increase in the number of high-quality reference genomes through global biodiversity initiatives. In parallel, there have been major advances towards pangenome representations of higher species, where many alternative genome assemblies representing different breeds, cultivars, strains and haplotypes are now available. In order to support these efforts and accelerate downstream research, it is our goal at Ensembl to create high-quality annotations, tools and services for species across the tree of life. Here, we report our resources for popular reference genomes, the dramatic growth of our annotations (including haplotypes from the first human pangenome graphs), updates to the Ensembl Variant Effect Predictor (VEP), interactive protein structure predictions from AlphaFold DB, and the beta release of our new website.
- Published
- 2022
- Full Text
- View/download PDF
21. Ensembl 2022
- Author
-
Fiona Cunningham, James E Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Ruth Bennett, Andrew Berry, Jyothish Bhai, Alexandra Bignell, Konstantinos Billis, Sanjay Boddu, Lucy Brooks, Mehrnaz Charkhchi, Carla Cummins, Luca Da Rin Fioretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Jose Gonzalez Martinez, Cristina Guijarro-Clarke, Arthur Gymer, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Thomas Juettemann, Vinay Kaikala, Mike Kay, Ilias Lavidas, Tuan Le, Diana Lemos, José Carlos Marugán, Shamika Mohanan, Aleena Mushtaq, Marc Naven, Denye N Ogeh, Anne Parker, Andrew Parton, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, Manoj Pandian Sakthivel, Ahamed Imran Abdul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, José G Pérez-Silva, William Stark, Emily Steed, Kyösti Sutinen, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Michal Szpak, Anja Thormann, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Natalie Willhoft, Andrea Winterbottom, Elizabeth Wass, Marc Chakiachvili, Bethany Flint, Adam Frankish, Stefano Giorgetti, Leanne Haggerty, Sarah E Hunt, Garth R IIsley, Jane E Loveland, Fergal J Martin, Benjamin Moore, Jonathan M Mudge, Matthieu Muffato, Emily Perry, Magali Ruffier, John Tate, David Thybert, Stephen J Trevanion, Sarah Dyer, Peter W Harrison, Kevin L Howe, Andrew D Yates, Daniel R Zerbino, Paul Flicek
- Abstract
Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.
- Published
- 2021
22. Non-coding regulatory elements: Potential roles in disease and the case of epilepsy
- Author
-
Susanna Pagni, James D Mills, Sanjay M. Sisodiya, Adam Frankish, and Jonathan M. Mudge
- Subjects
Regulation of gene expression ,Histology ,Epilepsy ,Genome ,RNA ,Computational biology ,Biology ,Noncoding DNA ,Article ,Pathology and Forensic Medicine ,MicroRNAs ,Neurology ,Regulatory sequence ,Physiology (medical) ,microRNA ,Humans ,Human genome ,RNA, Long Noncoding ,Neurology (clinical) ,Enhancer - Abstract
Non-coding DNA (ncDNA) refers to the portion of the genome that does not code for proteins and accounts for the greatest physical proportion of the human genome. ncDNA includes sequences that are transcribed into RNA molecules, such as ribosomal RNAs (rRNAs), microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and un-transcribed sequences that have regulatory functions, including gene promoters and enhancers. Variation in non-coding regions of the genome have an established role in human disease, with growing evidence from many areas, including several cancers, Parkinson's disease and autism. Here, we review the features and functions of the regulatory elements that are present in the non-coding genome and the role that these regions have in human disease. We then review the existing research in epilepsy and emphasise the potential value of further exploring non-coding regulatory elements in epilepsy. In addition, we outline the most widely used techniques for recognising regulatory elements throughout the genome, current methodologies for investigating variation and the main challenges associated with research in the field of non-coding DNA.
- Published
- 2021
23. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
- Author
-
Hazuki Takahashi, Christopher Vollmers, Stefan Goetz, Margaret E. Hunter, Gabriela Balderrama-Gutierrez, David Moraga, Maite De María, Mark Diekhans, Francisco Jose Pardo-Palacios, Piero Carninci, Jonathan M. Mudge, Roderic Guigó, Gloria M. Sheynkman, Ingrid Youngworth, Kin Fai Au, Matthew Adams, Muhammed Hasan Çelik, Haoran Li, Alison D. Tang, Ana Conesa, Silvia Carbonell-Sala, Barbara J. Wold, Angela N. Brooks, Brian Williams, Julien Lagarde, Cindy Liang, Fairlie Reese, Ali Mortazavi, Andrey D. Prjibelski, Natàlia Garcia-Reyero, Jane E. Loveland, Nancy D. Denslow, Dingjie Wang, Amit Behera, Adam Frankish, Hagen Tilgner, and Carlos Menor
- Subjects
Identification (biology) ,RNA-Seq ,Computational biology ,Biology - Abstract
With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.
- Published
- 2021
- Full Text
- View/download PDF
24. The value of primary transcripts to the clinical and non‐clinical genomics community: Survey results and roadmap for improvements
- Author
-
Jane E. Loveland, Adam Frankish, Sarah E. Hunt, Fiona Cunningham, Joannella Morales, Aoife McMahon, Emily Perry, Irina M. Armean, and Paul Flicek
- Subjects
Computer science ,Genomics ,Computational biology ,QH426-470 ,Web Browser ,Genome ,Set (abstract data type) ,Annotation ,transcript annotation ,Databases, Genetic ,Genetics ,RefSeq ,Ensembl ,Animals ,Humans ,survey ,RNA, Messenger ,Molecular Biology ,Genetics (clinical) ,default transcript ,GENCODE ,variant interpretation ,Computational Biology ,Original Articles ,Original Article ,UniProt ,Biomarkers ,Software - Abstract
Background Variant interpretation is dependent on transcript annotation and remains time consuming and challenging. There are major obstacles for historical data reuse and for interpretation of new variants. First, both RefSeq and Ensembl/GENCODE produce transcript sets in common use, but there is currently no easy way to translate between the two. Second, the resources often used for variant interpretation (e.g. ClinVar, gnomAD, UniProt) do not use the same transcript set, nor default transcript or protein sequence. Method Ensembl ran a survey in 2018 to sample attitudes to choosing one default transcript per locus, and to gather data on reference sequences used by the scientific community. This was publicised on the Ensembl and UCSC genome browsers, by email and on social media. Results The survey had 788 responses from 32 different countries, the results of which we report here. Conclusions We present our roadmap to create an effective default set of transcripts for resources, and for reporting interpretation of clinical variants., After decades of avoiding the demand to highlight one transcript per locus in Ensembl, we ran a survey 2018 to assay opinions across the scientific community. Ignoring the problem of ‘one transcript’ was not making the issue go away; many important genomic resources had instead adopted their own methods of selecting one transcript (e.g. HGMD, Ensembl, gnomAD, UniProt, ClinVar, etc.). Here we report our results and roadmap to create an effective default set of transcripts for resources, and for reporting interpretation of clinical variants.
- Published
- 2021
25. A community-driven roadmap to advance research on translated open reading frames detected by Ribo-seq
- Author
-
Jose Manuel Gonzalez, Pavel V. Baranov, Juan Pablo Couso, Jonathan M. Mudge, Ariel A. Bazzini, Xavier Roucou, Mark Gerstein, Maria Jesus Martin, Uwe Ohler, Jian Chen, Nicholas T. Ingolia, Thomas F. Martinez, Yuchen Yang, Jonathan S. Weissman, Norbert Hubner, John R. Prensner, Michele Magrane, Paul Flicek, Jorge Ruiz-Orera, Alan Saghatelian, Jana Felicitas Schulz, Brunet, Elspeth A. Bruford, Gerben Menschaert, M. Mar Albà, Adam Frankish, S. van Heesch, and Anne-Ruxandra Carvunis
- Subjects
animal structures ,GENCODE ,Computer science ,HUGO Gene Nomenclature Committee ,Biological database ,Ensembl ,natural sciences ,Human genome ,Computational biology ,Ribosome profiling ,ORFS ,UniProt - Abstract
Ribosome profiling (Ribo-seq) has catalyzed a paradigm shift in our understanding of the translational ‘vocabulary’ of the human genome, discovering thousands of translated open reading frames (ORFs) within long non-coding RNAs and presumed untranslated regions of protein-coding genes. However, reference gene annotation projects have been circumspect in their incorporation of these ORFs due to uncertainties about their experimental reproducibility and physiological roles. Yet, it is indisputable that certain Ribo-seq ORFs make stable proteins, others mediate gene regulation, and many have medical implications. Ultimately, the absence of standardized ORF annotation has created a circular problem: while Ribo-seq ORFs remain unannotated by reference biological databases, this lack of characterisation will thwart research efforts examining their roles. Here, we outline the initial stages of a community-led effort supported by GENCODE / Ensembl, HGNC and UniProt to produce a consolidated catalog of human Ribo-seq ORFs.
- Published
- 2021
- Full Text
- View/download PDF
26. THE VALUE OF PRIMARY TRANSCRIPTS TO THE CLINICAL AND NON-CLINICAL GENOMICS COMMUNITY: SURVEY RESULTS AND ROADMAP FOR IMPROVEMENTS. AUTHORS
- Author
-
Fiona Cunningham, Joannella Morales, Aoife C Mcmahon, Jane Loveland, Emily Perry, Adam Frankish, Sarah Hunt, Irina M Armean, and Paul Flicek
- Published
- 2021
- Full Text
- View/download PDF
27. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain
- Author
-
Hagen Tilgner, Simon A. Hardwick, Geoffrey S. Pitt, Man Ying Wong, Andrey D. Prjibelski, Wenjie Luo, Ahmed Mahfouz, Stephen R. Williams, Zachary Bent, Susan Lin, Qi Wang, Adam Frankish, Alexander N. Stein, Bettina Haase, Paul Flicek, Olivier Fedrigo, Christoph Dieterich, Jordan Marrocco, Toby Hunt, Davide Risso, Erich D. Jarvis, August B. Smit, Paul Collier, Jennifer Chew, Steven A. Sloan, Ashley Hayes, Anoushka Joglekar, M. Elizabeth Ross, Neil I. Weisenfeld, Anna Katharina Schlusche, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience - Cellular & Molecular Mechanisms, and Amsterdam Neuroscience - Neurodegeneration
- Subjects
0301 basic medicine ,Gene isoform ,Cell type ,RNA splicing ,Science ,General Physics and Astronomy ,Prefrontal Cortex ,Biology ,Hippocampus ,General Biochemistry, Genetics and Molecular Biology ,Article ,Transcriptome ,03 medical and health sciences ,Exon ,Mice ,0302 clinical medicine ,Animals ,Protein Isoforms ,Prefrontal cortex ,Transcriptomics ,Gene ,Regulation of gene expression ,Spatial Analysis ,Multidisciplinary ,Computational Biology ,Gene Expression Regulation, Developmental ,Development of the nervous system ,General Chemistry ,Cell biology ,Alternative Splicing ,030104 developmental biology ,Animals, Newborn ,Computational neuroscience ,Models, Animal ,Female ,Single-Cell Analysis ,030217 neurology & neurosurgery - Abstract
Splicing varies across brain regions, but the single-cell resolution of regional variation is unclear. We present a single-cell investigation of differential isoform expression (DIE) between brain regions using single-cell long-read sequencing in mouse hippocampus and prefrontal cortex in 45 cell types at postnatal day 7 (www.isoformAtlas.com). Isoform tests for DIE show better performance than exon tests. We detect hundreds of DIE events traceable to cell types, often corresponding to functionally distinct protein isoforms. Mostly, one cell type is responsible for brain-region specific DIE. However, for fewer genes, multiple cell types influence DIE. Thus, regional identity can, although rarely, override cell-type specificity. Cell types indigenous to one anatomic structure display distinctive DIE, e.g. the choroid plexus epithelium manifests distinct transcription-start-site usage. Spatial transcriptomics and long-read sequencing yield a spatially resolved splicing map. Our methods quantify isoform expression with cell-type and spatial resolution and it contributes to further our understanding of how the brain integrates molecular and cellular complexity., Alternative RNA splicing varies across the brain. Its mapping at single cell resolution is unclear. Here, the authors provide a spatial and single-cell splicing atlas reporting brain region- and cell type-specific expression of different isoforms in the postnatal mouse brain.
- Published
- 2021
- Full Text
- View/download PDF
28. RNAcentral 2021:Secondary structure integration, improved sequence search and new member databases
- Author
-
Alex Bateman, Dimitra Karagkouni, Robin R. Gutell, Lina Ma, Ruth C. Lovering, Prita Mani, Artemis G. Hatzigeorgiou, Pieter-Jan Volders, Elspeth A. Bruford, Simon Kay, Kevin J. Peterson, Lauren M. Lui, Steven J Marygold, Todd M. Lowe, Jamie J. Cannone, Anton S. Petrov, Patricia P. Chan, Robert D. Finn, Adam Frankish, Stefan E. Seemann, David Hoksza, Bastian Fromm, Ioanna Kalvari, Maciej Szymanski, Ruth L. Seal, Ruth Barshir, Pieter Mestdagh, Simona Panni, Carlos Eduardo Ribas, Michelle S. Scott, Pablo Porras, Simon Fishilevich, Anton I. Petrov, Sam Griffiths-Jones, Blake A. Sweeney, Zhang Zhang, Jonathan M. Mudge, Zasha Weinberg, Sridhar Ramachandran, Jan Gorodkin, Shuai Weng, Eric P. Nawrocki, Wojciech M. Karlowski, Barbara Kramarz, Philia Bouchard-Bourelle, and Gil dos Santos
- Subjects
RNA, Untranslated ,Interface (Java) ,AcademicSubjects/SCI00010 ,CURATION ,Rfam ,Biology ,computer.software_genre ,ANNOTATION ,MiRBase ,03 medical and health sciences ,Annotation ,Betacoronavirus ,0302 clinical medicine ,Genetics ,Medicine and Health Sciences ,Database Issue ,Animals ,Humans ,Sequence Ontology ,Gene ,030304 developmental biology ,0303 health sciences ,Internet ,Database ,Base Sequence ,Sequence Analysis, RNA ,Fungi ,RNA ,Molecular Sequence Annotation ,Non-coding RNA ,GENE ,Gene Ontology ,Nucleic Acid Conformation ,Databases, Nucleic Acid ,computer ,Apicomplexa ,030217 neurology & neurosurgery ,Software - Abstract
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that provides a single access point to 44 RNA resources and >18 million ncRNA sequences from a wide range of organisms and RNA types. RNAcentral now also includes secondary (2D) structure information for >13 million sequences, making RNAcentral the world’s largest RNA 2D structure database. The 2D diagrams are displayed using R2DT, a new 2D structure visualization method that uses consistent, reproducible and recognizable layouts for related RNAs. The sequence similarity search has been updated with a faster interface featuring facets for filtering search results by RNA type, organism, source database or any keyword. This sequence search tool is available as a reusable web component, and has been integrated into several RNAcentral member databases, including Rfam, miRBase and snoDB. To allow for a more fine-grained assignment of RNA types and subtypes, all RNAcentral sequences have been annotated with Sequence Ontology terms. The RNAcentral database continues to grow and provide a central data resource for the RNA community. RNAcentral is freely available at https://rnacentral.org.
- Published
- 2021
- Full Text
- View/download PDF
29. The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data
- Author
-
Carla Cummins, Andrew D. Yates, Astrid Gall, Daniel R. Zerbino, Anne Parker, Denye Ogeh, David Thybert, Adam Frankish, Bruno Contreras-Moreira, Jyothish Bhai, Fergal J. Martin, Andrea Winterbottom, Benjamin Moore, Magali Ruffier, Robert D. Finn, John Tate, Thiago A. L. Genez, Kevin L. Howe, Stephen J. Trevanion, Dan Sheppard, Sarah E. Hunt, Paul Flicek, Manoj Pandian Sakthivel, Andrew Parton, Anja Thormann, Marc Chakiachvili, and Nishadi De Silva
- Subjects
Potential impact ,Coronavirus disease 2019 (COVID-19) ,Coronaviridae ,SARS-CoV-2 ,AcademicSubjects/SCI00010 ,Computer science ,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ,COVID-19 ,Genetic Variation ,Molecular Sequence Annotation ,Genome, Viral ,Computational biology ,Web Browser ,Biology ,Data science ,Annotation ,Identification (information) ,Workflow ,Resource (project management) ,Databases, Genetic ,Genetics ,Humans ,Database Issue ,Ensembl ,Gene - Abstract
The COVID-19 pandemic has seen unprecedented use of SARS-CoV-2 genome sequencing for epidemiological tracking and identification of emerging variants. Understanding the potential impact of these variants on the infectivity of the virus and the efficacy of emerging therapeutics and vaccines has become a cornerstone of the fight against the disease. To support the maximal use of genomic information for SARS-CoV-2 research, we launched the Ensembl COVID-19 browser, incorporating a new Ensembl gene set, multiple variant sets (including novel variation calls), and annotation from several relevant resources integrated into the reference SARS-CoV-2 assembly. This work included key adaptations of existing Ensembl genome annotation methods to model ribosomal slippage, stringent filters to elucidate the highest confidence variants and utilisation of our comparative genomics pipelines on viruses for the first time. Since May 2020, the content has been regularly updated and tools such as the Ensembl Variant Effect Predictor have been integrated. The Ensembl COVID-19 browser is freely available at https://covid-19.ensembl.org.
- Published
- 2020
- Full Text
- View/download PDF
30. Author Correction: Perspectives on ENCODE
- Author
-
Jane Loveland, Axel Visel, Michael Snyder, Adam Frankish, Giovanni Quinones-Valdez, J. Michael Cherry, Eugene Yeo, Daniel Barrell, Jonathan Mudge, and Anshul Kundaje
- Subjects
Multidisciplinary - Published
- 2022
- Full Text
- View/download PDF
31. Ensembl 2021
- Author
-
Kevin L Howe, Premanand Achuthan, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Andrey G Azov, Ruth Bennett, Jyothish Bhai, Konstantinos Billis, Sanjay Boddu, Mehrnaz Charkhchi, Carla Cummins, Luca Da Rin Fioretto, Claire Davidson, Kamalkumar Dodiya, Bilal El Houdaigui, Reham Fatima, Astrid Gall, Carlos Garcia Giron, Tiago Grego, Cristina Guijarro-Clarke, Leanne Haggerty, Anmol Hemrom, Thibaut Hourlier, Osagie G Izuogu, Thomas Juettemann, Vinay Kaikala, Mike Kay, Ilias Lavidas, Tuan Le, Diana Lemos, Jose Gonzalez Martinez, José Carlos Marugán, Thomas Maurel, Aoife C McMahon, Shamika Mohanan, Benjamin Moore, Matthieu Muffato, Denye N Oheh, Dimitrios Paraschas, Anne Parker, Andrew Parton, Irina Prosovetskaia, Manoj P Sakthivel, Ahamed I Abdul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, Emily Steed, Michal Szpak, Marek Szuba, Kieron Taylor, Anja Thormann, Glen Threadgold, Brandon Walts, Andrea Winterbottom, Marc Chakiachvili, Ameya Chaubal, Nishadi De Silva, Bethany Flint, Adam Frankish, Sarah E Hunt, Garth R IIsley, Nick Langridge, Jane E Loveland, Fergal J Martin, Jonathan M Mudge, Joanella Morales, Emily Perry, Magali Ruffier, John Tate, David Thybert, Stephen J Trevanion, Fiona Cunningham, Andrew D Yates, Daniel R Zerbino, Paul Flicek.
- Abstract
The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed and comprehensive annotation of gene structures, regulatory elements and variants, and enable comparative genomics by inferring the evolutionary history of genes and genomes. Our integrated genomic data are made available in a variety of ways, including genome browsers, search interfaces, specialist tools such as the Ensembl Variant Effect Predictor, download files and programmatic interfaces. Here, we present recent Ensembl developments including two new website portals. Ensembl Rapid Release (http://rapid.ensembl.org) is designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects. Our SARS-CoV-2 genome browser (https://covid-19.ensembl.org) integrates our own annotation with publicly available genomic data from numerous sources to facilitate the use of genomics in the international scientific response to the COVID-19 pandemic. We also report on other updates to our annotation resources, tools and services. All Ensembl data and software are freely available without restriction.
- Published
- 2020
32. Cell-type, single-cell, and spatial signatures of brain-region specific splicing in postnatal development
- Author
-
Andrey D Przhibelskiy, Toby Hunt, Adam Frankish, Man Ying Wong, Geoffrey S. Pitt, Jennifer Chew, Hagen Tilgner, Davide Risso, Alexander N. Stein, Olivier Fedrigo, Paul Flicek, Ahmed Mahfouz, Steven A. Sloan, Zachary Bent, August B. Smit, Paul Collier, Jordan Marrocco, Wenjie Luo, Erich D. Jarvis, Anoushka Joglekar, Margaret Elizabeth Ross, Susan Lin, Anna Katharina Schlusche, Neil I. Weisenfeld, Simon A. Hardwick, Bettina Haase, Ashley Hayes, and Stephen R. Williams
- Subjects
Gene isoform ,Transcriptome ,Exon ,Cell type ,medicine.anatomical_structure ,Cell ,Alternative splicing ,RNA splicing ,medicine ,Computational biology ,Biology ,Gene - Abstract
Alternative RNA splicing varies across brain regions, but the single-cell resolution of such regional variation is unknown. Here we present the first single-cell investigation of differential isoform expression (DIE) between brain regions, by performing single cell long-read transcriptome sequencing in the mouse hippocampus and prefrontal cortex in 45 cell types at postnatal day 7 (www.isoformAtlas.com). Using isoform tests for brain-region specific DIE, which outperform exon-based tests, we detect hundreds of brain-region specific DIE events traceable to specific cell-types. Many DIE events correspond to functionally distinct protein isoforms, some with just a 6-nucleotide exon variant. In most instances, one cell type is responsible for brain-region specific DIE. Cell types indigenous to only one anatomic structure display distinctive DIE, where for example, the choroid plexus epithelium manifest unique transcription start sites. However, for some genes, multiple cell-types are responsible for DIE in bulk data, indicating that regional identity can, although less frequently, override cell-type specificity. We validated our findings with spatial transcriptomics and long-read sequencing, yielding the first spatially resolved splicing map in the postnatal mouse brain (www.isoformAtlas.com). Our methods are highly generalizable. They provide a robust means of quantifying isoform expression with cell-type and spatial resolution, and reveal how the brain integrates molecular and cellular complexity to serve function.
- Published
- 2020
- Full Text
- View/download PDF
33. Transcriptional activity and strain-specific history of mouse pseudogenes
- Author
-
Jennifer Harrow, Duncan T. Odom, David Thybert, Paul Flicek, Thomas M. Keane, Adam Frankish, Ian T. Fiddes, Cristina Sisu, Mark Gerstein, Paul R. Muir, Tim Hubbard, Mark Diekhans, Sisu, Cristina [0000-0001-9371-0797], Diekhans, Mark [0000-0002-0430-0989], Odom, Duncan T. [0000-0001-6201-5599], Flicek, Paul [0000-0002-3897-7955], Keane, Thomas M. [0000-0001-7532-6898], Hubbard, Tim [0000-0002-1767-9318], Gerstein, Mark [0000-0002-9746-3719], Apollo - University of Cambridge Repository, Odom, Duncan T [0000-0001-6201-5599], and Keane, Thomas M [0000-0001-7532-6898]
- Subjects
0301 basic medicine ,Mouse ,Transcription, Genetic ,Pseudogene ,Science ,General Physics and Astronomy ,Biology ,631/208/212/2304 ,Genome informatics ,Genome ,General Biochemistry, Genetics and Molecular Biology ,Evolution, Molecular ,03 medical and health sciences ,0302 clinical medicine ,Species Specificity ,Ribosomal protein ,631/136/334/1874/345 ,Animals ,Humans ,lcsh:Science ,Gene ,Conserved Sequence ,Genetics ,45/91 ,Transcriptional activity ,Multidisciplinary ,Repertoire ,Strain (biology) ,article ,Molecular Sequence Annotation ,General Chemistry ,Genome evolution ,Mice, Inbred C57BL ,030104 developmental biology ,Gene Ontology ,631/114/2401 ,631/114/2785 ,lcsh:Q ,Data integration ,64/60 ,030217 neurology & neurosurgery ,Pseudogenes ,Reference genome - Abstract
Pseudogenes are ideal markers of genome remodelling. In turn, the mouse is an ideal platform for studying them, particularly with the recent availability of strain-sequencing and transcriptional data. Here, combining both manual curation and automatic pipelines, we present a genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains (available via the mouse.pseudogene.org resource). We also annotate 165 unitary pseudogenes in mouse, and 303, in human. The overall pseudogene repertoire in mouse is similar to that in human in terms of size, biotype distribution, and family composition (e.g. with GAPDH and ribosomal proteins being the largest families). Notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of all pseudogenes are unique, reflecting strain-specific evolution. Finally, we find that ~15% of the mouse pseudogenes are transcribed, and that highly transcribed parent genes tend to give rise to many processed pseudogenes., Pseudogenes are key markers of genome remodelling processes. Here the authors present genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains, update human pseudogene annotations, and characterise the transcription and evolution of mouse pseudogenes.
- Published
- 2020
- Full Text
- View/download PDF
34. Cell type-specific novel long non-coding RNA and circular RNA in the BLUEPRINT hematopoietic transcriptomes atlas
- Author
-
Natasha Andressa Nogueira Jorge, Jonathan M. Mudge, Mattia Frontini, Myrto Kostadima, Osagie G. Izuogu, Enca Martin-Rendon, Hendrik G. Stunnenberg, Romina Petersen, Denis Seyres, Xavier Estivill, Paul Flicek, Luigi Grassi, Samantha Farrow, Fabio Passetti, John J. Lambourne, Willem H. Ouwehand, Neda Farahi, Frances Burden, Adam Frankish, Sophia Rowlston, Joost H.A. Martens, Marie-Laure Yaspo, Kate Downes, Mariona Bustamante, Laura Clarke, Ernest Turro, Fergal J. Martin, Ouwehand, Willem [0000-0002-7744-1790], and Apollo - University of Cambridge Repository
- Subjects
Cell type ,Cell ,Sang -- Malalties ,Computational biology ,Biology ,Article ,Transcriptome ,03 medical and health sciences ,0302 clinical medicine ,Circular RNA ,medicine ,Gene ,Sequence Analysis, RNA ,Gene Expression Profiling ,RNA ,High-Throughput Nucleotide Sequencing ,Hematology ,RNA, Circular ,Long non-coding RNA ,medicine.anatomical_structure ,RNA, Long Noncoding ,DNA microarray ,Genètica ,030215 immunology - Abstract
Transcriptional profiling of hematopoietic cell subpopulations has helped to characterize the developmental stages of the hematopoietic system and the molecular bases of malignant and non-malignant blood diseases. Previously, only the genes targeted by expression microarrays could be profiled genome-wide. High-throughput RNA sequencing, however, encompasses a broader repertoire of RNA molecules, without restriction to previously annotated genes. We analyzed the BLUEPRINT consortium RNA-sequencing data for mature hematopoietic cell types. The data comprised 90 total RNA-sequencing samples, each composed of one of 27 cell types, and 32 small RNA-sequencing samples, each composed of one of 11 cell types. We estimated gene and isoform expression levels for each cell type using existing annotations from Ensembl. We then used guided transcriptome assembly to discover unannotated transcripts. We identified hundreds of novel non-coding RNA genes and showed that the majority have cell type-dependent expression. We also characterized the expression of circular RNA and found that these are also cell type-specific. These analyses refine the active transcriptional landscape of mature hematopoietic cells, highlight abundant genes and transcriptional isoforms for each blood cell type, and provide a valuable resource for researchers of hematologic development and diseases. Finally, we made the data accessible via a web-based interface: https://blueprint.haem.cam.ac.uk/bloodatlas/. The work was funded by a grant from the European Commission 7th Framework Program (FP7/2007–2013, grant 282510, BLUEPRINT) to XE, PF, JHAM, MY, HGS and WHO. WHO is an NIHR senior investigator and receives funding from Bristol-Myers Squibb, the British Heart Foundation, the Medical Research Council and the NIHR. OGI, FJM, AF, JMM, LC and PF are funded by the Wellcome Trust (WT108749/Z/15/Z) with additional funding for specific project components such as GENCODE from the National Human Genome Research Institute of the National Institutes of Health (2U41HG007234). FP is supported by the Fundação Carlos Chagas Filho de Amparo à Pesquisado Estado do Rio de Janeiro (FAPERJ; E-26/203.229/2016). NANJ is a recipient of a scholarship from the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES; Finance Code 001). MF is supported by the British Heart Foundation (FS/18/53/33863)
- Published
- 2020
- Full Text
- View/download PDF
35. Expert curation of the human and mouse olfactory receptor gene repertoires identifies conserved coding regions split across two exons
- Author
-
Mona Khan, If H. A. Barnes, Darren W. Logan, Matthew P. Hardy, Claire Davidson, Zhen Zeng, Mark Jorissen, Jose Manuel Gonzalez, Peter Mombaerts, Adam Frankish, Ximena Ibarra-Soria, Jennifer Harrow, Deepa Manthravadi, Stephen Fitzgerald, Laura Van Gerven, Barnes, If H. A. [0000-0001-9303-4610], Apollo - University of Cambridge Repository, Barnes, If HA [0000-0001-9303-4610], and Barnes, If H A [0000-0001-9303-4610]
- Subjects
Mouse ,Receptors, Odorant ,Genome ,Conserved sequence ,Human and rodent genomics ,Mice ,0302 clinical medicine ,Databases, Genetic ,Gene duplication ,TOPOLOGY ,Coding region ,Conserved Sequence ,Data Curation ,Genetics & Heredity ,0303 health sciences ,EXPANSION ,Exons ,FAMILY ,Curation ,ALIGNMENT ,medicine.anatomical_structure ,Life Sciences & Biomedicine ,Pseudogenes ,Research Article ,Human ,Biotechnology ,lcsh:QH426-470 ,DATABASE ,Annotation ,lcsh:Biotechnology ,Pseudogene ,Quantitative Trait Loci ,Locus (genetics) ,Computational biology ,Biology ,Olfactory receptor gene ,03 medical and health sciences ,lcsh:TP248.13-248.65 ,medicine ,Genetics ,Animals ,Humans ,Gene family ,Gene ,030304 developmental biology ,TOOLS ,Science & Technology ,Olfactory receptor ,Genome, Human ,GENCODE ,MODEL ,lcsh:Genetics ,Biotechnology & Applied Microbiology ,Genetic Loci ,030217 neurology & neurosurgery - Abstract
Background Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with 874 in human and 1483 loci in mouse (including pseudogenes). The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. Results Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon. Conclusions This work provides the most comprehensive curation effort of the human and mouse OR gene repertoires to date. The complete annotation has been integrated into the GENCODE reference gene set, for immediate availability to the research community.
- Published
- 2020
- Full Text
- View/download PDF
36. Ensembl 2019
- Author
-
Fiona Cunningham, Premanand Achuthan, Wasiu Akanni, James Allen, M Ridwan Amode, Irina M Armean, Ruth Bennett, Jyothish Bhai, Konstantinos Billis, Sanjay Boddu, Carla Cummins, Claire Davidson, Kamalkumar Jayantilal Dodiya, Astrid Gall, Carlos García Girón, Laurent Gil, Tiago Grego, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Mike Kay, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, José C Marugán, Thomas Maurel, Aoife C McMahon, Benjamin Moore, Joannella Morales, Jonathan M Mudge, Michael Nuhn, Denye Ogeh, Anne Parker, Andrew Parton, Mateus Patricio, Ahamed Imran Abdul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Eloise Stapleton, Marek Szuba, Kieron Taylor, Glen Threadgold, Anja Thormann, Alessandro Vullo, Brandon Walts, Andrea Winterbottom, Amonida Zadissa, Marc Chakiachvili, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nick Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Daniel M Staines, Stephen J Trevanion, Bronwen L Aken, Andrew D Yates, Daniel R Zerbino, and Paul Flicek
- Subjects
0301 basic medicine ,Genome ,Computational Biology ,Molecular Sequence Annotation ,Genomics ,03 medical and health sciences ,Mice ,030104 developmental biology ,0302 clinical medicine ,ComputingMethodologies_PATTERNRECOGNITION ,Databases, Genetic ,Vertebrates ,Genetics ,Database Issue ,Animals ,Humans ,030217 neurology & neurosurgery ,Software - Abstract
The Ensembl project (https://www.ensembl.org) makes key genomic data sets available to the entire scientific community without restrictions. Ensembl seeks to be a fundamental resource driving scientific progress by creating, maintaining and updating reference genome annotation and comparative genomics resources. This year we describe our new and expanded gene, variant and comparative annotation capabilities, which led to a 50% increase in the number of vertebrate genomes we support. We have also doubled the number of available human variants and added regulatory regions for many mouse cell types and developmental stages. Our data sets and tools are available via the Ensembl website as well as a through a RESTful webservice, Perl application programming interface and as data files for download.
- Published
- 2018
37. GENCODE reference annotation for the human and mouse genomes
- Author
-
Rory Johnson, Fernando Pozo, Andrew D. Yates, Magali Ruffier, Alexandra Bignell, Anne Parker, Adam Frankish, Osagie G. Izuogu, If Barnes, Jose Manuel Gonzalez, Tomás Di Domenico, Irwin Jungreis, Cristina Sisu, Fergal J. Martin, Anne-Maud Ferreira, Paul R. Muir, Benedict Paten, Matthew P. Hardy, Manolis Kellis, Jacqueline Chrast, Jane E. Loveland, Yan Zhang, Michael L. Tress, Fabio C. P. Navarro, Barbara Uszczynska-Ratajczak, Jonathan M. Mudge, Mark Diekhans, Shamika Mohanan, James C. Wright, Fiona Cunningham, Toby Hunt, Baikang Pei, Jinuri Xu, Tim Hubbard, Julien Lagarde, Roderic Guigó, Alexandre Reymond, Silvia Carbonell Sala, Bianca M. Schmitt, Sarah Donaldson, Mark Gerstein, Jyoti S. Choudhary, Eloise Stapleton, Bronwen Aken, Laura Martinez, Tiago Grego, Ian T. Fiddes, Joel Armstrong, Carlos García Girón, Marie-Marthe Suner, Paul Flicek, Andrew Berry, Irina Sycheva, Thibaut Hourlier, Daniel R. Zerbino, Wellcome Trust, European Molecular Biology Organization, NIH - National Human Genome Research Institute (NHGRI) (Estados Unidos), and Swiss National Science Foundation
- Subjects
DNA ELEMENTS ,GENES ,PROTEINS ,DATABASE ,610 Medicine & health ,Genomics ,Computational biology ,Biology ,LONG NONCODING RNAS ,Genome ,Mice ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,Databases, Genetic ,PROGRAM ,Genetics ,Database Issue ,Animals ,Humans ,Ensembl ,030304 developmental biology ,Internet ,0303 health sciences ,IDENTIFICATION ,Genome, Human ,GENCODE ,Computational Biology ,Genome, Human/genetics ,Molecular Sequence Annotation ,Pseudogenes/genetics ,Software ,Gene Annotation ,PRINCIPAL ,ENCYCLOPEDIA ,CATALOG ,3. Good health ,ComputingMethodologies_PATTERNRECOGNITION ,Human genome ,Pseudogenes ,030217 neurology & neurosurgery - Abstract
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org. We thank Tim Hubbard and Jennifer Harrow for their lead- ership in the GENCODE project from 2003-2016 as well as all groups and group members involved in the GENCODE project since its inception including the HAVANA manual annotation group formerly at Wellcome Sanger Institute now at EMBL-EBI (founder), the Guigo group at Centre for Genomic Regulation (founder), the Gerstein group at Yale (founder), the Center for Biomolecular Science & En- gineering at UCSC (founder), the Ensembl team at EMBL- EBI (joined 2007), the Kellis group at MIT (joined 2007), the Tress group at CNIO (joined 2007), the Choudhary group formerly at Wellcome Sanger Institute now at Insti- tute of Cancer Research (joined 2012), the Reymond group at University of Lausanne (2003–2017), the Antonarakis group at University of Geneva (2003–2007), the Wei group at Genome Institute of Singapore (2003–2007), the Gin- geras group at Affymetrix Ltd (2003–2007) and the Brent group at Washington University in St. Louis (2007–2012). Sí
- Published
- 2018
- Full Text
- View/download PDF
38. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
- Author
-
Jyoti S. Choudhary, Liang He, Jose Manuel Gonzalez, Susan Tweedie, Jonathan M. Mudge, Manolis Kellis, Toby Hunt, Elspeth A. Bruford, Claire Davidson, Ruth L. Seal, Yue Li, Robert M. Waterhouse, James C. Wright, M. Kay, Adam Frankish, Irwin Jungreis, and Stephen Fitzgerald
- Subjects
Resource ,Sequence analysis ,Pseudogene ,Computational biology ,Biology ,ENCODE ,Genome ,Animals ,Exons ,Genome, Human ,Genome-Wide Association Study ,High-Throughput Nucleotide Sequencing ,Humans ,Open Reading Frames ,Pseudogenes ,Sequence Analysis, DNA ,03 medical and health sciences ,0302 clinical medicine ,Genetics ,Coding region ,Gene ,Genetics (clinical) ,030304 developmental biology ,0303 health sciences ,GENCODE ,Human genome ,030217 neurology & neurosurgery - Abstract
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.
- Published
- 2019
39. Towards a complete map of the human long non-coding RNA transcriptome
- Author
-
Julien Lagarde, Rory Johnson, Barbara Uszczynska-Ratajczak, Roderic Guigó, and Adam Frankish
- Subjects
0301 basic medicine ,medicine.medical_specialty ,Computational biology ,Biology ,Genome ,Article ,Transcriptome ,03 medical and health sciences ,Annotation ,Genetics ,medicine ,Humans ,610 Medicine & health ,Molecular Biology ,Gene ,Genetics (clinical) ,Gene map ,Genome, Human ,Gene Expression Profiling ,Chromosome Mapping ,Long non-coding RNA ,Gene expression profiling ,030104 developmental biology ,Medical genetics ,RNA, Long Noncoding ,Genome-Wide Association Study - Abstract
Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.
- Published
- 2018
- Full Text
- View/download PDF
40. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing
- Author
-
Rory Johnson, Thomas R. Gingeras, Carrie A. Davis, Sílvia Pérez-Lluch, Adam Frankish, Silvia Carbonell, Roderic Guigó, Julien Lagarde, Amaya Abad, Barbara Uszczynska-Ratajczak, and Jennifer Harrow
- Subjects
0301 basic medicine ,02 engineering and technology ,Bioinformatics ,Transcriptome ,Mice ,transcriptomics ,lncRNA ,Intergenic region ,610 Medicine & health ,PacBio ,0303 health sciences ,education.field_of_study ,long read sequencing ,High-Throughput Nucleotide Sequencing ,RNA sequencing ,Genomics ,Long non-coding RNA ,annotation ,Sequence annotation ,lincRNA ,CaptureSeq ,RNA, Long Noncoding ,third generation sequencing ,0206 medical engineering ,Population ,Computational biology ,Biology ,Article ,Open Reading Frames ,03 medical and health sciences ,Annotation ,Gene expression analysis ,Genetics ,Animals ,Humans ,education ,Gene ,030304 developmental biology ,KANTR ,GENCODE ,Gene Expression Profiling ,Computational Biology ,Reproducibility of Results ,RNA ,Molecular Sequence Annotation ,RNA probes ,030104 developmental biology ,020602 bioinformatics ,Long noncoding RNA - Abstract
Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete: many gene models are fragmentary, while thousands more remain uncatalogued–particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing. We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3574 / 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.Abbreviationsbpbase pairFLfull lengthntnucleotideROIread of insert, i.e. PacBio readSJsplice junctionSMRTsingle-molecule real-timeTMtranscript model
- Published
- 2017
- Full Text
- View/download PDF
41. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation
- Author
-
Claire Davidson, Jonathan M. Mudge, Marie-Marthe Suner, Vamsi K. Kodali, Ruth L. Seal, Ruth Bennett, Shashikant Pujar, Fergal J. Martin, Carol J. Bult, M. Kay, Catherine M. Farrell, Nuala A. O'Leary, Craig Wallin, Carlos García Girón, Andrew Berry, If Barnes, Tamara Goldfarb, Adam Frankish, Kelly M. McGarvey, Bronwen Aken, Kim D. Pruitt, Mark Diekhans, Sophia Zhu, Terence Murphy, Michael R. Murphy, Elspeth A. Bruford, Monica S. McAndrews, John D. Jackson, Lillian D. Riddick, Bhanu Rajput, David Webb, Sanjida H. Rangwala, Eric Cox, Jose Manuel Gonzalez, Jane E. Loveland, Toby Hunt, and Vinita Joardar
- Subjects
0301 basic medicine ,Guidelines as Topic ,Mouse Genome Informatics ,Biology ,Bioinformatics ,03 medical and health sciences ,Mice ,Open Reading Frames ,User-Computer Interface ,0302 clinical medicine ,Web page ,Consensus Sequence ,Databases, Genetic ,Genetics ,Ensembl ,Database Issue ,Animals ,Humans ,Data Curation ,Information retrieval ,National Library of Medicine (U.S.) ,Molecular Sequence Annotation ,United States ,Identifier ,Gene nomenclature ,030104 developmental biology ,User interface ,030217 neurology & neurosurgery ,Reference genome - Abstract
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
- Published
- 2017
42. Cell type specific novel lincRNAs and circRNAs in the BLUEPRINT haematopoietic transcriptomes atlas
- Author
-
Luigi Grassi, N. A. Jeorge, Kate Downes, Frances Burden, Paul Flicek, Xavier Estivill, John J. Lambourne, Neda Farahi, Henk Stunnenberg, Willem H. Ouwehand, Romina Petersen, Denis Seyres, Fergal J. Martin, Fabio Passetti, Adam Frankish, Sophia Rowlston, Laura Clarke, Joost H.A. Martens, S. Farrow, Myrto Kostadima, Osagie G. Izuogu, Ernest Turro, Marie-Laure Yaspo, Jonathan M. Mudge, Mariona Bustamante, Mattia Frontini, and Enca Martin-Rendon
- Subjects
0303 health sciences ,Small RNA ,Cell type ,RNA ,Computational biology ,Gene Annotation ,Biology ,Genome ,Transcriptome ,03 medical and health sciences ,0302 clinical medicine ,Gene expression ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
Transcriptional profiling of hematopoietic cell subpopulations has helped characterize the developmental stages of the hematopoietic system and the molecular basis of malignant and non-malignant blood diseases for the past three decades. The introduction of high-throughput RNA sequencing has increased knowledge of the full repertoire of RNA molecules in hematopoietic cells of different types, without relying on prior gene annotation. Here, we introduce the analysis of the BLUEPRINT consortium gene expression data for mature hematopoietic cells, comprising 90 total RNA and 32 small RNA sequencing experiments, from 27 different cell types. We used these data to describe the transcriptional profile of each we used guided transcriptome assembly to extend the annotation of the transcribed genome, which led to the identification of hundreds of novel non-coding RNA genes, which display a high degree of cell type specificity. We also characterized the expression of circular RNAs and found that these are also highly cell type specific. This resource refines the active transcriptional landscape of mature hematopoietic cells, highlights abundant genes and transcriptional isoforms for each cell type, and provides valuable data and visualisation tools for the scientific community working on hematological development and diseases.
- Published
- 2019
- Full Text
- View/download PDF
43. Systematic re-annotation of 191 genes associated with early-onset epilepsy unmasks de novo variants linked to Dravet syndrome in novel SCN1A exons
- Author
-
Stephen Abbs, Caroline Nava, Sanjay M. Sisodiya, Jyoti S. Choudhary, Dimitrios Vitsios, Dmitri D. Pervouchine, Electra Tapanari, Fadi F. Hamdan, Hannah Stamberger, Berten Ceulemans, Detelina Grozeva, Jennifer Harrow, Patricia Leroy, Marie Marthe Suner, Charles A. Steward, Gianpiero L. Cavalleri, Margarida Viola, Anne Fabienne Lepine, José M. González, Mark Diekhans, Barbara Uszczynska-Ratajczak, Paul Flicek, Nicholas Lench, F. Lucy Raymond, Adam Frankish, Robert Petryszak, Stephen Fitzgerald, Sarah Weckhuysen, Alba Sanchis-Juan, James C. Wright, Peter De Jonghe, Roderic Guigó, Anthony Rogers, Slavé Petrovski, Don Keiller, Jonathan M. Mudge, Jolien Roovers, and Berge A. Minassian
- Subjects
Genetics ,0303 health sciences ,GENCODE ,Gene Annotation ,Biology ,medicine.disease ,Genome ,3. Good health ,03 medical and health sciences ,Exon ,0302 clinical medicine ,Dravet syndrome ,Genotype ,medicine ,Exome ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
The early infantile epileptic encephalopathies (EIEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 novel putative transcript models. The extended transcriptional footprint of these genes allowed for 294 intronic or intergenic variants, found in human mutation databases, to be reclassified as exonic, while a further 70 intronic variants were reclassified as splice-site proximal. Using SCN1A as a case study due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome, or a similar phenotype, with a panel of novel exon sequences representing eight established genes and identified two de novo SCN1A variants that now, through improved gene annotation can be ascribed to residing among novel exons. These two (from 122 screened patients, 1.6%) new molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously-classified SCN1A intronic Dravet-associated variant that now lies within a deeply conserved novel exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders. We would expect to find new molecular diagnoses in our 191 genes that were originally suspected by clinicians for patients, with a negative diagnosis.
- Published
- 2019
- Full Text
- View/download PDF
44. Integrative transcriptomic analysis suggests new autoregulatory splicing events coupled with nonsense-mediated mRNA decay
- Author
-
Dmitri D. Pervouchine, Beatrice Borsari, Yaroslav Popov, Andrew Berry, Roderic Guigó, and Adam Frankish
- Subjects
Spliceosome ,RNA Splicing ,Nonsense-mediated decay ,Biology ,03 medical and health sciences ,Exon ,0302 clinical medicine ,Genetics ,RNA Precursors ,RNA and RNA-protein complexes ,Humans ,RNA, Messenger ,RNA, Small Interfering ,Frameshift Mutation ,Gene ,3' Untranslated Regions ,030304 developmental biology ,U2AF2 ,0303 health sciences ,Serine-Arginine Splicing Factors ,Alternative splicing ,Computational Biology ,Nuclear Proteins ,RNA-Binding Proteins ,Exons ,Splicing Factor U2AF ,mRNA surveillance ,Cell biology ,Heterogeneous-Nuclear Ribonucleoprotein Group M ,Nonsense Mediated mRNA Decay ,Alternative Splicing ,Codon, Nonsense ,RNA splicing ,Spliceosomes ,Transcriptome ,030217 neurology & neurosurgery - Abstract
Nonsense-mediated decay (NMD) is a eukaryotic mRNA surveillance system that selectively degrades transcripts with premature termination codons (PTC). Many RNA-binding proteins (RBP) regulate their expression levels by a negative feedback loop, in which RBP binds its own pre-mRNA and causes alternative splicing to introduce a PTC. We present a bioinformatic analysis integrating three data sources, eCLIP assays for a large RBP panel, shRNA inactivation of NMD pathway, and shRNA-depletion of RBPs followed by RNA-seq, to identify novel such autoregulatory feedback loops. We show that RBPs frequently bind their own pre-mRNAs, their exons respond prominently to NMD pathway disruption, and that the responding exons are enriched with nearby eCLIP peaks. We confirm previously proposed models of autoregulation in SRSF7 and U2AF1 genes and present two novel models, in which (i) SFPQ binds its mRNA and promotes switching to an alternative distal 3′-UTR that is targeted by NMD, and (ii) RPS3 binding activates a poison 5′-splice site in its pre-mRNA that leads to a frame shift and degradation by NMD. We also suggest specific splicing events that could be implicated in autoregulatory feedback loops in RBM39, HNRNPM, and U2AF2 genes. The results are available through a UCSC Genome Browser track hub.
- Published
- 2019
45. Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A
- Author
-
Stephen Abbs, Dimitrios Vitsios, Dmitri D. Pervouchine, Paul Flicek, Hannah Stamberger, Roderic Guigó, Barbara Uszczynska-Ratajczak, F. Lucy Raymond, Margarida Viola, Jennifer Harrow, Adam Frankish, Robert Petryszak, Sarah Weckhuysen, Alba Sanchis-Juan, Caroline Nava, Electra Tapanari, José M. González, Anthony Rogers, Slavé Petrovski, Anne Fabienne Lepine, Patricia Leroy, Detelina Grozeva, Marie Marthe Suner, Mark Diekhans, Gianpiero L. Cavalleri, Don Keiller, Berten Ceulemans, Nicholas Lench, Jonathan M. Mudge, Jolien Roovers, Stephen Fitzgerald, Berge A. Minassian, Charles A. Steward, Peter De Jonghe, Sanjay M. Sisodiya, Fadi F. Hamdan, Steward, Charles A [0000-0001-8829-5349], Suner, Marie-Marthe [0000-0002-0380-7171], Uszczynska-Ratajczak, Barbara [0000-0003-0150-3841], Viola, Margarida [0000-0001-6577-3520], Diekhans, Mark [0000-0002-0430-0989], Vitsios, Dimitrios [0000-0002-8939-5445], Flicek, Paul [0000-0002-3897-7955], Apollo - University of Cambridge Repository, and Steward, Charles A. [0000-0001-8829-5349]
- Subjects
0301 basic medicine ,lcsh:QH426-470 ,lcsh:Medicine ,Biology ,Genome ,03 medical and health sciences ,Exon ,0302 clinical medicine ,Dravet syndrome ,Genetics ,medicine ,Molecular Biology ,Gene ,Exome ,Genetics (clinical) ,631/61/212/2301 ,Molecular medicine ,GENCODE ,lcsh:R ,article ,Gene Annotation ,medicine.disease ,Phenotype ,692/4017 ,3. Good health ,lcsh:Genetics ,030104 developmental biology ,Human medicine ,Medical genomics ,030217 neurology & neurosurgery - Abstract
The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60–65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional ‘footprint’ of these genes by over 674 kb. Using SCN1A as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo SCN1A variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified SCN1A intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.
- Published
- 2019
46. Novel autoregulatory cases of alternative splicing coupled with nonsense-mediated mRNA decay
- Author
-
Popov Y, Roderic Guigó, Andrew Berry, Beatrice Borsari, Adam Frankish, and Dmitri D. Pervouchine
- Subjects
0303 health sciences ,Messenger RNA ,Nonsense-mediated decay ,Alternative splicing ,Biology ,mRNA surveillance ,Cell biology ,03 medical and health sciences ,Exon ,0302 clinical medicine ,RNA splicing ,Gene expression ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
Nonsense-mediated decay (NMD) is a eukaryotic mRNA surveillance system that selectively degrades transcripts with premature termination codons (PTC). Many RNA-binding proteins (RBP) regulate their expression levels by a negative feedback loop, in which RBP binds its own pre-mRNA and causes alternative splicing to introduce a PTC. We present a bioinformatic framework to identify novel such autoregulatory feedback loops by combining eCLIP assays for a large panel of RBPs with the data on shRNA inactivation of NMD pathway, and shRNA-depletion of RBPs followed by RNA-seq. We show that RBPs frequently bind their own pre-mRNAs and respond prominently to NMD pathway disruption. Poison and essential exons, i.e., exons that trigger NMD when included in the mRNA or skipped, respectively, respond oppositely to the inactivation of NMD pathway and to the depletion of their host genes, which allows identification of novel autoregulatory mechanisms for a number of human RBPs. For example, SRSF7 binds its own pre-mRNA and facilitates the inclusion of two poison exons; SFPQ binding promotes switching to an alternative distal 3’-UTR that is targeted by NMD; RPS3 activates a poison 5’-splice site in its pre-mRNA that leads to a frame shift; U2AF1 binding activates one of its two mutually exclusive exons, leading to NMD; TBRG4 is regulated by cluster splicing of its two essential exons. Our results indicate that autoregulatory negative feedback loop of alternative splicing and NMD is a generic form of post-transcriptional control of gene expression.
- Published
- 2018
- Full Text
- View/download PDF
47. Pseudogenes in the mouse lineage: transcriptional activity and strain-specific history
- Author
-
Cristina Sisu, Ian T. Fiddes, Duncan T. Odom, David Thybert, Adam Frankish, M Diekhans, Mark Gerstein, Paul R. Muir, Paul Flicek, Jennifer Harrow, Thomas M. Keane, and Tim Hubbard
- Subjects
Genetics ,Annotation ,Transcription (biology) ,Pseudogene ,Genomics ,Ribosomal RNA ,Biology ,Gene ,Genome ,Reference genome - Abstract
Pseudogenes are ideal markers of genome remodeling. In turn, the mouse is an ideal platform for studying them, particularly with the availability of developmental transcriptional data and the sequencing of 18 strains. Here, we present a comprehensive genome-wide annotation of the pseudogenes in the mouse reference genome and associated strains. We compiled this by combining manual curation of over 10,000 pseudogenes with results from automatic annotation pipelines. Also, by comparing the human and mouse, we annotated 165 unitary pseudogenes in mouse, and 303 unitaries in human. We make all our annotation available throughmouse.pseudogene.org. The overall mouse pseudogene repertoire (in the reference and strains) is similar to human in terms of overall size, biotype distribution (~80% processed/~20% duplicated) and top family composition (with many GAPDH and ribosomal pseudogenes). However, notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of the pseudogenes are unique, reflecting strain-specific functions and evolution. Additionally, we find that ~15% of the pseudogenes are transcribed, a fraction similar to that for human, and that pseudogene transcription exhibits greater tissue and strain specificity compared to protein-coding genes. Finally, we show that highly transcribed parent genes tend to give rise to processed pseudogenes.
- Published
- 2018
- Full Text
- View/download PDF
48. Nearly all new protein-coding predictions in the CHESS database are not protein-coding
- Author
-
Mark Gerstein, Rory Johnson, Toby Hunt, Manolis Kellis, Cristina Sisu, Adam Frankish, James C. Wright, Julien Lagarde, Paul R. Muir, Michael L. Tress, Barbara Uszczynska-Ratajczak, Paul Flicek, Irwin Jungreis, Jonathan M. Mudge, and Roderic Guigó
- Subjects
0303 health sciences ,Database ,Protein domain ,Genomics ,Gene Annotation ,Biology ,computer.software_genre ,ENCODE ,Homology (biology) ,Conserved sequence ,03 medical and health sciences ,0302 clinical medicine ,False positive paradox ,computer ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology - Abstract
In a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.
- Published
- 2018
- Full Text
- View/download PDF
49. Multiple laboratory mouse reference genomes define strain specific haplotypes and novel functional loci
- Author
-
Dent Earl, Monica Abrudan, Benedict Paten, Adam Frankish, Lesley Shirley, Cristina Sisu, Ian T. Fiddes, Mark Gerstein, James G. R. Gilbert, Mark G. Thomas, Anne Czechanski, Duncan T. Odom, William Chow, Stephan C. Collins, Clayton E. Mathews, Mark Diekhans, Ruth Bennett, Jane E. Loveland, David J. Adams, Jingtao Lilue, Beiyuan Fu, Dirk-Dominic Dolle, Fengtang Yang, Laura G. Reinholdt, Glen Threadgold, Anne C. Ferguson-Smith, Jonathan Wood, Kim Wong, Leo Goodstadt, Paul R. Muir, Thomas M. Keane, Phan Sk, Jonathan Flint, Naomi R Park, Richard Mott, Joel Armstrong, Thybert D, Jen Harrow, Petr Danecek, Marcela K. Sjoberg-Herrera, Sarah Pelan, Anthony G. Doran, Kerstin Howe, Charles A. Steward, Mario Stanke, Binnaz Yalcin, Joanna Collins, Lelliott C, Matthew Dunn, Fabio C. P. Navarro, Michael A. Quail, Paul Flicek, James Torrance, Richard Durbin, Köenig S, Lars Romoth, and Mikhail Kolmogorov
- Subjects
Genetics ,0303 health sciences ,Strain (biology) ,Haplotype ,Genomics ,Retrotransposon ,Biology ,Genome ,03 medical and health sciences ,0302 clinical medicine ,Genome Reference Consortium ,Gene ,030217 neurology & neurosurgery ,030304 developmental biology ,Reference genome - Abstract
The most commonly employed mammalian model organism is the laboratory mouse. A wide variety of genetically diverse inbred mouse strains, representing distinct physiological states, disease susceptibilities, and biological mechanisms have been developed over the last century. We report full length draft de novo genome assemblies for 16 of the most widely used inbred strains and reveal for the first time extensive strain-specific haplotype variation. We identify and characterise 2,567 regions on the current Genome Reference Consortium mouse reference genome exhibiting the greatest sequence diversity between strains. These regions are enriched for genes involved in defence and immunity, and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. Several immune related loci, some in previously identified QTLs for disease response have novel haplotypes not present in the reference that may explain the phenotype. We used these genomes to improve the mouse reference genome resulting in the completion of 10 new gene structures, and 62 new coding loci were added to the reference genome annotation. Notably this high quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage. Interestingly, Efcab3-like−/− mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of Efcab3-like regulating brain development.
- Published
- 2018
- Full Text
- View/download PDF
50. Sixteen diverse laboratory mouse reference genomes define strain specific haplotypes and novel functional loci
- Author
-
Leo Goodstadt, Mark Gerstein, Mark G. Thomas, Jingtao Lilue, Glen Threadgold, Fengtang Yang, Sarah Pelan, Jane E. Loveland, Kim Wong, Fabio C. P. Navarro, Jennifer Harrow, Ruth Bennett, Richard Durbin, Dent Earl, Monica Abrudan, Mario Stanke, David J. Adams, Adam Frankish, Son Pham, Anne Czechanski, Charles A. Steward, Jonathan Flint, Beiyuan Fu, Ian T. Fiddes, William Chow, Duncan T. Odom, Marcela K. Sjoberg-Herrera, Naomi Park, Paul Flicek, Anne C. Ferguson-Smith, James G. R. Gilbert, Lelliott C, Mikhail Kolmogorov, Mark Diekhans, Laura G. Reinholdt, Stefanie Nachtweide, Cristina Sisu, Thomas M. Keane, James Torrance, Richard Mott, Benedict Paten, Petr Danecek, Dirk-Dominik Dolle, Paul R. Muir, Ximena Ibarra-Soria, Stephan C. Collins, Binnaz Yalcin, Darren W. Logan, Lars Romoth, Matthew Dunn, Lesley Shirley, Kerstin Howe, David Thybert, Michael A. Quail, Clayton E. Mathews, Jonathan Wood, Anthony G. Doran, Joanna Collins, Joel Armstrong, European Bioinformatics Institute [Hinxton] (EMBL-EBI), EMBL Heidelberg, University of California [Santa Cruz] (UCSC), University of California, The Wellcome Trust Genome Campus, The Wellcome Trust Sanger Institute [Cambridge], Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Université de Strasbourg (UNISTRA)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), Centre des Sciences du Goût et de l'Alimentation [Dijon] (CSGA), Institut National de la Recherche Agronomique (INRA)-Université de Bourgogne (UB)-AgroSup Dijon - Institut National Supérieur des Sciences Agronomiques, de l'Alimentation et de l'Environnement-Centre National de la Recherche Scientifique (CNRS), Université Bourgogne Franche-Comté [COMUE] (UBFC), The Jackson Laboratory [Bar Harbor] (JAX), University of Cambridge [UK] (CAM), University of California [Los Angeles] (UCLA), Yale University [New Haven], OxAM House, University of California [San Diego] (UC San Diego), University of Florida [Gainesville] (UF), University College of London [London] (UCL), University of Greifswald, BioTuring Inc., Brunel University London [Uxbridge], Pontificia Universidad Católica de Chile (UC), University of Nottingham, UK (UON), Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA), Centre National de la Recherche Scientifique (CNRS)-AgroSup Dijon - Institut National Supérieur des Sciences Agronomiques, de l'Alimentation et de l'Environnement-Institut National de la Recherche Agronomique (INRA)-Université de Bourgogne (UB), Lilue, Jingtao [0000-0002-1958-0231], Diekhans, Mark [0000-0002-0430-0989], Flicek, Paul [0000-0002-3897-7955], Gerstein, Mark [0000-0002-9746-3719], Kolmogorov, Mikhail [0000-0002-5489-9045], Lelliott, Chris J [0000-0001-8087-4530], Logan, Darren W [0000-0003-1545-5510], Mott, Richard [0000-0002-1022-9330], Navarro, Fabio CP [0000-0002-5640-9070], Odom, Duncan T [0000-0001-6201-5599], Sjoberg-Herrera, Marcela [0000-0001-7173-048X], Thybert, David [0000-0001-7806-7318], Wong, Kim [0000-0002-0984-1477], Yalcin, Binnaz [0000-0002-1924-6807], Yang, Fengtang [0000-0002-3573-2354], Keane, Thomas M [0000-0001-7532-6898], Apollo - University of Cambridge Repository, University of California [Santa Cruz] (UC Santa Cruz), University of California (UC), and Julien, Sabine
- Subjects
0301 basic medicine ,Transposable element ,[SDV.IMM] Life Sciences [q-bio]/Immunology ,Retrotransposon ,Mice, Inbred Strains ,[SDV.GEN.GA] Life Sciences [q-bio]/Genetics/Animal genetics ,Biology ,de novo assembly ,Genome ,Polymorphism, Single Nucleotide ,Article ,03 medical and health sciences ,Mice ,0302 clinical medicine ,Species Specificity ,Mice, Inbred NOD ,Animals, Laboratory ,Genetics ,Animals ,Gene ,mouse ,Phylogeny ,Mice, Inbred BALB C ,Mice, Inbred C3H ,Strain (biology) ,Haplotype ,Laboratory mouse ,allele ,Chromosome Mapping ,Molecular Sequence Annotation ,Mice, Inbred C57BL ,[SDV.GEN.GA]Life Sciences [q-bio]/Genetics/Animal genetics ,030104 developmental biology ,Haplotypes ,Genetic Loci ,Mice, Inbred DBA ,Mice, Inbred CBA ,[SDV.IMM]Life Sciences [q-bio]/Immunology ,subspecies ,030217 neurology & neurosurgery ,Reference genome - Abstract
We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development. Medical Research Council and the Wellcome Trust
- Published
- 2018
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.