Back to Search Start Over

Accessible data curation and analytics for international-scale citizen science datasets

Authors :
Benjamin Murray
Eric Kerfoot
Liyuan Chen
Jie Deng
Mark S. Graham
Carole H. Sudre
Erika Molteni
Liane S. Canas
Michela Antonelli
Kerstin Klaser
Alessia Visconti
Alexander Hammers
Andrew T. Chan
Paul W. Franks
Richard Davies
Jonathan Wolf
Tim D. Spector
Claire J. Steves
Marc Modat
Sebastien Ourselin
Source :
Scientific Data, Vol 8, Iss 1, Pp 1-17 (2021)
Publication Year :
2021
Publisher :
Nature Portfolio, 2021.

Abstract

Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.

Subjects

Subjects :
Science

Details

Language :
English
ISSN :
20524463
Volume :
8
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Scientific Data
Publication Type :
Academic Journal
Accession number :
edsdoj.4657a31169ad4964afaba25bc27b394f
Document Type :
article
Full Text :
https://doi.org/10.1038/s41597-021-01071-x