Back to Search
Start Over
Shortcomings of SARS-CoV-2 genomic metadata
- Publication Year :
- 2020
- Publisher :
- Center for Open Science, 2020.
-
Abstract
- Metadata is integral to data-driven association studies relevant to epidemiology, viral population dynamics and public health response. However, SARS-CoV-2 metadata quality remains inadequate. Here I exemplify this through a brief analysis of two metadata categories in the GISAID SARS-CoV-2 genomic database: “originating lab” and “submitting lab”. My analysis reveals a startling prevalence of spelling errors and inconsistent naming conventions, which together occur in an estimated ~9.8% and ~11.6% of “originating labs” and “submitting labs” respectively. In addition, I find numerous ambiguous lab names, such as “Biology Dpt” and “Hospital” which provide very little information with regard to the actual source of a sample and could easily associate with multiple sources worldwide. Importantly, all of these issues can impair the ability and accuracy of association studies by deceptively causing a group of samples to identify with multiple sources when they truly all identify with one source, or vice versa. GISAID’s “originating lab” and “submitting lab” categories are specifically relevant to identifying problematic sites in SARS-CoV-2 genomic data through lab association. Thus, I advocate that both data submitters and maintainers strive for a higher metadata quality standard now and in the future.
Details
- Database :
- OpenAIRE
- Accession number :
- edsair.doi...........aeedddfe29d5c9db239c988354c6a89a