Start Over

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny.

Authors :: Hunt M
Hinrichs AS
Anderson D
Karim L
Dearlove BL
Knaggs J
Constantinides B
Fowler PW
Rodger G
Street T
Lumley S
Webster H
Sanderson T
Ruis C
Kotzen B
de Maio N
Amenga-Etego LN
Amuzu DSY
Avaro M
Awandare GA
Ayivor-Djanie R
Barkham T
Bashton M
Batty EM
Bediako Y
Belder D
Benedetti E
Bergthaler A
Boers SA
Campos J
Carr RAA
Chen YYC
Cuba F
Dattero ME
Dejnirattisai W
Dilthey A
Duedu KO
Endler L
Engelmann I
Francisco NM
Fuchs J
Gnimpieba EZ
Groc S
Gyamfi J
Heemskerk D
Houwaart T
Hsiao NY
Huska M
Hölzer M
Iranzadeh A
Jarva H
Jeewandara C
Jolly B
Joseph R
Kant R
Ki KKK
Kurkela S
Lappalainen M
Lataretu M
Lemieux J
Liu C
Malavige GN
Mashe T
Mongkolsapaya J
Montes B
Mora JAM
Morang'a CM
Mvula B
Nagarajan N
Nelson A
Ngoi JM
da Paixão JP
Panning M
Poklepovich T
Quashie PK
Ranasinghe D
Russo M
San JE
Sanderson ND
Scaria V
Screaton G
Sessions OM
Sironen T
Sisay A
Smith D
Smura T
Supasa P
Suphavilai C
Swann J
Tegally H
Tegomoh B
Vapalahti O
Walker A
Wilkinson RJ
Williamson C
Zair X
de Oliveira T
Peto TE
Crook D
Corbett-Detig R
Iqbal Z
Source :: BioRxiv : the preprint server for biology [bioRxiv] 2024 Nov 05. Date of Electronic Publication: 2024 Nov 05.
Publication Year :: 2024
Abstract: The SARS-CoV-2 genome occupies a unique place in infection biology - it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 4,471,579 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of June 2024, viewable at https://viridian.taxonium.org . Each genome was constructed using a novel assembly tool called Viridian ( https://github.com/iqbal-lab-org/viridian ), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.