11 results on '"Yannick Marcon"'
Search Results
2. Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
- Author
-
Yannick Marcon, Tom Bishop, Demetris Avraam, Xavier Escriba-Montagut, Patricia Ryser-Welch, Stuart Wheater, Paul Burton, and Juan R González
- Subjects
Biology (General) ,QH301-705.5 - Abstract
Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).
- Published
- 2021
- Full Text
- View/download PDF
3. Fostering population-based cohort data discovery: The Maelstrom Research cataloguing toolkit.
- Author
-
Julie Bergeron, Dany Doiron, Yannick Marcon, Vincent Ferretti, and Isabel Fortier
- Subjects
Medicine ,Science - Abstract
BACKGROUND:The lack of accessible and structured documentation creates major barriers for investigators interested in understanding, properly interpreting and analyzing cohort data and biological samples. Providing the scientific community with open information is essential to optimize usage of these resources. A cataloguing toolkit is proposed by Maelstrom Research to answer these needs and support the creation of comprehensive and user-friendly study- and network-specific web-based metadata catalogues. METHODS:Development of the Maelstrom Research cataloguing toolkit was initiated in 2004. It was supported by the exploration of existing catalogues and standards, and guided by input from partner initiatives having used or pilot tested incremental versions of the toolkit. RESULTS:The cataloguing toolkit is built upon two main components: a metadata model and a suite of open-source software applications. The model sets out specific fields to describe study profiles; characteristics of the subpopulations of participants; timing and design of data collection events; and datasets/variables collected at each data collection event. It also includes the possibility to annotate variables with different classification schemes. When combined, the model and software support implementation of study and variable catalogues and provide a powerful search engine to facilitate data discovery. CONCLUSIONS:The Maelstrom Research cataloguing toolkit already serves several national and international initiatives and the suite of software is available to new initiatives through the Maelstrom Research website. With the support of new and existing partners, we hope to ensure regular improvements of the toolkit.
- Published
- 2018
- Full Text
- View/download PDF
4. dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning
- Author
-
Han, Cao, Youcheng, Zhang, Jan, Baumbach, Paul R, Burton, Dominic, Dwyer, Nikolaos, Koutsouleris, Julian, Matschinske, Yannick, Marcon, Sivanesan, Rajan, Thilo, Rieg, Patricia, Ryser-Welch, Julian, Späth, Carl, Herrmann, and Emanuel, Schwarz
- Subjects
Machine Learning ,Statistics and Probability ,Computational Mathematics ,Computational Theory and Mathematics ,Privacy ,Humans ,Programming Languages ,Molecular Biology ,Biochemistry ,Software ,Algorithms ,Computer Science Applications - Abstract
MotivationIn multi-cohort machine learning studies, it is critical to differentiate between effects that are reproducible across cohorts and those that are cohort-specific. Multi-task learning (MTL) is a machine learning approach that facilitates this differentiation through the simultaneous learning of prediction tasks across cohorts. Since multi-cohort data can often not be combined into a single storage solution, there would be the substantial utility of an MTL application for geographically distributed data sources.ResultsHere, we describe the development of ‘dsMTL’, a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. First, we derive the theoretical properties of these methods and the relevant machine learning workflows to ensure the validity of the software implementation. Second, we implement dsMTL as a library for the R programming language, building on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. Third, we demonstrate the applicability of dsMTL for comorbidity modeling in distributed data. We show that comorbidity modeling using dsMTL outperformed conventional, federated machine learning, as well as the aggregation of multiple models built on the distributed datasets individually. The application of dsMTL was computationally efficient and highly scalable when applied to moderate-size (n Availability and implementationdsMTL is freely available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package).Supplementary informationSupplementary data are available at Bioinformatics online.
- Published
- 2022
5. Software Application Profile: ShinyDataSHIELD—an R Shiny application to perform federated non-disclosive data analysis in multicohort studies
- Author
-
Xavier Escribà-Montagut, Yannick Marcon, Demetris Avraam, Soumya Banerjee, Tom R P Bishop, Paul Burton, and Juan R González
- Subjects
Epidemiology ,R Shiny ,DataSHIELD ,Genetic epidemiology ,General Medicine ,Multicohort studies ,Non-disclosive analysis ,Federated analysis - Abstract
Motivation DataSHIELD is an open-source software infrastructure enabling the analysis of data distributed across multiple databases (federated data) without leaking individuals’ information (non-disclosive). It has applications in many scientific domains, ranging from biosciences to social sciences and including high-throughput genomic studies. R is the language used to interact with (and build) DataSHIELD. This creates difficulties for researchers who do not have experience writing R code or lack the time to learn how to use the DataSHIELD functions. To help new researchers use the DataSHIELD infrastructure and to improve the user-friendliness for experienced researchers, we present ShinyDataSHIELD. Implementation ShinyDataSHIELD is a web application with an R backend that serves as a graphical user interface (GUI) to the DataSHIELD infrastructure. General features The version of the application presented here includes modules to perform: (i) exploratory analysis through descriptive summary statistics and graphical representations (scatter plots, histograms, heatmaps and boxplots); (ii) statistical modelling (generalized linear fixed and mixed-effects models, survival analysis through Cox regression); (iii) genome-wide association studies (GWAS); and (iv) omic analysis (transcriptomics, epigenomics and multi-omic integration). Availability ShinyDataSHIELD is publicly hosted online [https://datashield-demo.obiba.org/], the source code and user guide are deposited on Zenodo DOI 10.5281/zenodo.6500323, freely available to non-commercial users under ‘Commons Clause’ License Condition v1.0. Docker images are also available [https://hub.docker.com/r/brgelab/shiny-data-shield].
- Published
- 2023
6. Life course of retrospective harmonization initiatives:key elements to consider
- Author
-
Isabel Fortier, Tina W. Wey, Julie Bergeron, Angela Pinot de Moira, Anne-Marie Nybo-Andersen, Tom Bishop, Madeleine J. Murtagh, Milica Miočević, Morris A. Swertz, Esther van Enckevort, Yannick Marcon, Michaela. Th. Mayrhofer, Jos Pedro Ornelas, Sylvain Sebert, Ana Cristina Santos, Artur Rocha, Rebecca C. Wilson, Lauren E. Griffith, and Paul Burton
- Subjects
longitudinal data ,PREGNANCY ,cohort studies ,POOLED ANALYSES ,Developmental Origins of Health and Disease (DOHAD) ,Medicine (miscellaneous) ,Data harmonization ,HEALTH ,data processing ,DEVELOPMENTAL ORIGINS - Abstract
Optimizing research on the developmental origins of health and disease (DOHaD) involves implementing initiatives maximizing the use of the available cohort study data; achieving sufficient statistical power to support subgroup analysis; and using participant data presenting adequate follow-up and exposure heterogeneity. It also involves being able to undertake comparison, cross-validation, or replication across data sets. To answer these requirements, cohort study data need to be findable, accessible, interoperable, and reusable (FAIR), and more particularly, it often needs to be harmonized. Harmonization is required to achieve or improve comparability of the putatively equivalent measures collected by different studies on different individuals. Although the characteristics of the research initiatives generating and using harmonized data vary extensively, all are confronted by similar issues. Having to collate, understand, process, host, and co-analyze data from individual cohort studies is particularly challenging. The scientific success and timely management of projects can be facilitated by an ensemble of factors. The current document provides an overview of the ‘life course’ of research projects requiring harmonization of existing data and highlights key elements to be considered from the inception to the end of the project.
- Published
- 2023
7. Towards an Interoperable Ecosystem of Research Cohort andReal-world Data Catalogues Enabling Multi-center Studies
- Author
-
Morris Swertz, Esther van Enckevort, José Luis Oliveira, Isabel Fortier, Julie Bergeron, Nicolas H. Thurin, Eleanor Hyde, Alexander Kellmann, Romin Pahoueshnja, Miriam Sturkenboom, Marianne Cunnington, Anne-Marie Nybo Andersen, Yannick Marcon, Gonçalo Gonçalves, Rosa Gini, and Groningen Institute for Gastro Intestinal Genetics and Immunology (3GI)
- Subjects
Cohort Studies ,Common Data Elements ,Information Dissemination ,Humans ,General Medicine ,Ecosystem - Abstract
Objectives: Existing individual-level human data cover large populations on many dimensions such as lifestyle, demography, laboratory measures, clinical parameters, etc. Recent years have seen large investments in data catalogues to FAIRify data descriptions to capitalise on this great promise, i.e. make catalogue contents more Findable, Accessible, Interoperable and Reusable. However, their valuable diversity also created heterogeneity, which poses challenges to optimally exploit their richness.Methods: In this opinion review, we analyse catalogues for human subject research ranging from cohort studies to surveillance, administrative and healthcare records.Results: We observe that while these catalogues are heterogeneous, have various scopes, and use different terminologies, still the underlying concepts seem potentially harmonizable. We propose a unified framework to enable catalogue data sharing, with catalogues of multi-center cohorts nested as a special case in catalogues of real-world data sources. Moreover, we list recommendations to create an integrated community of metadata catalogues and an open catalogue ecosystem to sustain these efforts and maximise impact.Conclusions: We propose to embrace the autonomy of motivated catalogue teams and invest in their collaboration via minimal standardisation efforts such as clear data licensing, persistent identifiers for linking same records between catalogues, minimal metadata ‘common data elements’ using shared ontologies, symmetric architectures for data sharing (push/pull) with clear provenance tracks to process updates and acknowledge original contributors. And most importantly, we encourage the creation of environments for collaboration and resource sharing between catalogue developers, building on international networks such as OpenAIRE and research data alliance, as well as domain specific ESFRIs such as BBMRI and ELIXIR. OBJECTIVES: Existing individual-level human data cover large populations on many dimensions such as lifestyle, demography, laboratory measures, clinical parameters, etc. Recent years have seen large investments in data catalogues to FAIRify data descriptions to capitalise on this great promise, i.e. make catalogue contents more Findable, Accessible, Interoperable and Reusable. However, their valuable diversity also created heterogeneity, which poses challenges to optimally exploit their richness. METHODS: In this opinion review, we analyse catalogues for human subject research ranging from cohort studies to surveillance, administrative and healthcare records. RESULTS: We observe that while these catalogues are heterogeneous, have various scopes, and use different terminologies, still the underlying concepts seem potentially harmonizable. We propose a unified framework to enable catalogue data sharing, with catalogues of multi-center cohorts nested as a special case in catalogues of real-world data sources. Moreover, we list recommendations to create an integrated community of metadata catalogues and an open catalogue ecosystem to sustain these efforts and maximise impact. CONCLUSIONS: We propose to embrace the autonomy of motivated catalogue teams and invest in their collaboration via minimal standardisation efforts such as clear data licensing, persistent identifiers for linking same records between catalogues, minimal metadata 'common data elements' using shared ontologies, symmetric architectures for data sharing (push/pull) with clear provenance tracks to process updates and acknowledge original contributors. And most importantly, we encourage the creation of environments for collaboration and resource sharing between catalogue developers, building on international networks such as OpenAIRE and research data alliance, as well as domain specific ESFRIs such as BBMRI and ELIXIR.
- Published
- 2022
- Full Text
- View/download PDF
8. dsMTL - a computational framework for privacy-preserving, distributed multi-task machine learning
- Author
-
Julian Matschinske, Patricia Ryser-Welch, Yannick Marcon, Sivanesan Rajan, Nikolaos Koutsouleris, Dominic B. Dwyer, Carl Herrmann, Paul Burton, Youcheng Zhang, Thilo Rieg, Julian Späth, Han Cao, Jan Baumbach, and Emanuel Schwarz
- Subjects
Computer science ,business.industry ,R Programming Language ,Multi-task learning ,Machine learning ,computer.software_genre ,Data sharing ,Privacy preserving ,Task (computing) ,Identification (information) ,Data Protection Act 1998 ,Limit (mathematics) ,Artificial intelligence ,business ,computer - Abstract
Multitask learning allows the simultaneous learning of multiple ‘communicating’ algorithms. It is increasingly adopted for biomedical applications, such as the modeling of disease progression. As data protection regulations limit data sharing for such analyses, an implementation of multitask learning on geographically distributed data sources would be highly desirable. Here, we describe the development of dsMTL, a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. dsMTL is implemented as a library for the R programming language and builds on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. We provide a comparative evaluation of dsMTL for the identification of biological signatures in distributed datasets using two case studies, and evaluate the computational performance of the supervised and unsupervised algorithms. dsMTL provides an easy- to-use framework for privacy-preserving, federated analysis of geographically distributed datasets, and has several application areas, including comorbidity modeling and translational research focused on the simultaneous prediction of different outcomes across datasets. dsMTL is available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package).
- Published
- 2021
9. Software application profile: opal and mica: open-source software solutions for epidemiological data management, harmonization and dissemination
- Author
-
Yannick Marcon, Isabel Fortier, Paul Burton, Vincent Ferretti, and Dany Doiron
- Subjects
Canada ,Software Application Profile ,Epidemiology ,Computer science ,Data management ,Interoperability ,computer.software_genre ,JavaScript ,World Wide Web ,03 medical and health sciences ,0302 clinical medicine ,Server ,Web application ,Humans ,030212 general & internal medicine ,computer.programming_language ,Internet ,business.industry ,Information Dissemination ,General Medicine ,Application profile ,Metadata ,Epidemiologic Studies ,Database Management Systems ,Web service ,business ,computer ,030217 neurology & neurosurgery ,Software - Abstract
Motivation Improving the dissemination of information on existing epidemiological studies and facilitating the interoperability of study databases are essential to maximizing the use of resources and accelerating improvements in health. To address this, Maelstrom Research proposes Opal and Mica, two inter-operable open-source software packages providing out-of-the-box solutions for epidemiological data management, harmonization and dissemination. Implementation Opal and Mica are two standalone but inter-operable web applications written in Java, JavaScript and PHP. They provide web services and modern user interfaces to access them. General features Opal allows users to import, manage, annotate and harmonize study data. Mica is used to build searchable web portals disseminating study and variable metadata. When used conjointly, Mica users can securely query and retrieve summary statistics on geographically dispersed Opal servers in real-time. Integration with the DataSHIELD approach allows conducting more complex federated analyses involving statistical models. Availability Opal and Mica are open-source and freely available at [www.obiba.org] under a General Public License (GPL) version 3, and the metadata models and taxonomies that accompany them are available under a Creative Commons licence.
- Published
- 2017
10. DataSHIELD: Taking the analysis to the data, not the data to the analysis
- Author
-
Kirsti Kvaløy, Joel T. Minion, Chris Dibben, Isabel Fortier, Kim W. Carter, Gillian M. Raab, Ronald P. Stolk, Paul Burton, Mathieu Boniol, Susan E. Wallace, Catherine M. Phillips, Kristian Hveem, Chris Newby, Elinor Jones, Ivan J. Perry, Maria Bota, Richard W. Francis, Seán R. Millar, Oliver Butters, Julia Isaeva, Paolo Boffetta, Nuala A. Sheehan, Andrew Turner, Isabelle Budin-Ljøsne, Lisette Giepmans, Frank Popham, Andy Boyd, John Macleod, Bruce H R Woffenbuttel, Ipek Demir, Bartha Maria Knoppers, Carsten Oliver Schmidt, Eva Reischl, Barnaby Murtagh, Vincent Ferretti, Marja-Liisa Nuotio, Melanie Waldenberger, Philippe Laflamme, Yannick Marcon, Markus Perola, Edwin R. van den Heuvel, Jennifer R. Harris, Madeleine J Murtagh, Tero Hiekkalinna, N. Deklerk, Annette Peters, Amadou Gaye, Rebecca Wilson, Dany Doiron, Institute for Molecular Medicine Finland, Quantitative Genetics, Gaye, A., Marcon, Y., Isaeva, J., Laflamme, P., Turner, A., Jones, E.M., Minion, J., Boyd, A.W., Newby, C.J., Nuotio, M., Wilson, R., Butters, O., Murtagh, B., Demir, I., Doiron, D., Giepmans, L., Wallace, S.E., Budin-ljøsne, I., Oliver schmidt, C., Boffetta, P., Boniol, M., Bota, M., Carter, K.W., Deklerk, N., Dibben, C., Francis, R.W., Hiekkalinna, T., Hveem, K., Kvaløy, K., Millar, S., Perry, I.J., Peters, A., Phillips, C.M., Popham, F., Raab, G., Reischl, E., Sheehan, N., Waldenberger, M., Perola, M., Van den heuvel, E., Macleod, J., Knoppers, B.M., Stolk, R.P., Fortier, I., Harris, J.R., Woffenbuttel, B.H.R., Murtagh, M.J., Ferretti, V., Burton, P.R., Life Course Epidemiology (LCE), Lifestyle Medicine (LM), School of Social and Community Medicine [Bristol], University of Bristol [Bristol], McGill University Health Center [Montreal] (MUHC), Norwegian Institute of Public Health [Oslo] (NIPH), Department of Statistical Science, University College of London, University College of London [London] (UCL), Department of Infection, Immunity and Inflammation, Health Sciences, University of Leiceste, Institute for Molecular Medicine Finland (FIMM), Department of Chronic Disease Prevention, Unit of Public Health Genomics, National Institute for Health and Welfare [Helsinki], Department of Health Sciences [Leicester], University of Leicester, Department of Sociology [Leicester], Department of Epidemiology, University Medical Center Groningen, University of Groningen [Groningen], Greifswald University Hospital, International Prevention Research Institute (IPRI), The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai [New York] (MSSM), The University of Western Australia (UWA), School of Geosciences [Edinburgh], University of Edinburgh, Norwegian University of Science and Technology [Trondheim] (NTNU), Norwegian University of Science and Technology (NTNU), HRB Centre for Diet and Health Research, Department of Epidemiology and Public Health, University College Cork, Research Unit of Molecular Epidemiology, Research Center for Environmental Health, MRC/CSO Social and Public Health Sciences Unit, University of Glasgow, University of Tartu, University Medical Center Groningen, Medical Statistic, Centre of Genomics and Policy [Montréal] (CGP), McGill University = Université McGill [Montréal, Canada], University Medical Center Groningen, LifeLines Cohort Study, Department of Endocrinology, University Medical Center Groningen, Ontario Institute for Cancer Research [Canada] (OICR), and Ontario Institute for Cancer Research
- Subjects
Biomedical Research ,Datashield ,Elsi ,Bioinformatics ,Confidentiality ,Disclosure ,Distributed Computing ,Intellectual Property ,Pooled Analysis ,Privacy ,Databases, Factual ,Epidemiology ,Pooling ,Datasets as Topic ,Information Storage and Retrieval ,DISEASE ,Firewall (construction) ,Medicine ,pooled analysi ,Biomedical Research Computational Biology *Computer Security *Confidentiality Databases ,Data processing ,PRIVACY ,Intellectual property ,General Medicine ,bioinformatics ,confidentiality ,3142 Public health care science, environmental and occupational health ,3. Good health ,ELSI ,INDIVIDUAL-LEVEL ,pooled analysis ,disclosure ,Data Matters ,education ,MODELS ,EPIDEMIOLOGIC RESEARCH ,Pooled analysis ,RS ,distributed computing ,DataSHIELD ,Humans ,QUALITY ,GENOME-WIDE ASSOCIATION ,Computer Security ,METAANALYSIS ,Data collection ,bioinformatic ,business.industry ,DATASHAPER APPROACH ,Computational Biology ,intellectual property ,Factual *Datasets as Topic Great Britain Humans *Information Storage and Retrieval DataSHIELD Elsi bioinformatics confidentiality disclosure distributed computing intellectual property pooled analysis privacy ,Data science ,United Kingdom ,Distributed computing ,Data set ,Data access ,COHORT PROFILE ,[SDV.SPEE]Life Sciences [q-bio]/Santé publique et épidémiologie ,business - Abstract
Gaye, Amadou Marcon, Yannick Isaeva, Julia LaFlamme, Philippe Turner, Andrew Jones, Elinor M Minion, Joel Boyd, Andrew W Newby, Christopher J Nuotio, Marja-Liisa Wilson, Rebecca Butters, Oliver Murtagh, Barnaby Demir, Ipek Doiron, Dany Giepmans, Lisette Wallace, Susan E Budin-Ljosne, Isabelle Oliver Schmidt, Carsten Boffetta, Paolo Boniol, Mathieu Bota, Maria Carter, Kim W deKlerk, Nick Dibben, Chris Francis, Richard W Hiekkalinna, Tero Hveem, Kristian Kvaloy, Kirsti Millar, Sean Perry, Ivan J Peters, Annette Phillips, Catherine M Popham, Frank Raab, Gillian Reischl, Eva Sheehan, Nuala Waldenberger, Melanie Perola, Markus van den Heuvel, Edwin Macleod, John Knoppers, Bartha M Stolk, Ronald P Fortier, Isabel Harris, Jennifer R Woffenbuttel, Bruce H R Murtagh, Madeleine J Ferretti, Vincent Burton, Paul R eng MC_UP_A540₁021/Medical Research Council/United Kingdom MR/K006525/1/Medical Research Council/United Kingdom Medical Research Council/United Kingdom Wellcome Trust/United Kingdom Research Support, Non-U.S. Gov't England 2014/09/30 06:00 Int J Epidemiol. 2014 Dec;43(6):1929-44. doi: 10.1093/ije/dyu188. Epub 2014 Sep 26.; International audience; BACKGROUND: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. METHODS: Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. RESULTS: Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. CONCLUSIONS: DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.
- Published
- 2014
11. Data harmonization and federated analysis of population-based studies
- Author
-
Ronald P. Stolk, Anne-Marie Tassé, Rolf Holle, Amadou Gaye, Luisa Foco, Yannick Marcon, Melanie Waldenberger, Dany Doiron, Markus Perola, Bruce H. R. Wolffenbuttel, Cosetta Minelli, Vincent Ferretti, Kirsti Kvaløy, Hans L. Hillege, Paul Burton, Isabel Fortier, Life Course Epidemiology (LCE), Lifestyle Medicine (LM), Cardiovascular Centre (CVC), Groningen Kidney Center (GKC), and Center for Liver, Digestive and Metabolic Diseases (CLDM)
- Subjects
Computer science ,Epidemiology ,media_common.quotation_subject ,Population ,Harmonization ,Public Health And Health Services ,computer.software_genre ,03 medical and health sciences ,0302 clinical medicine ,Excellence ,media_common.cataloged_instance ,030212 general & internal medicine ,European union ,education ,030304 developmental biology ,media_common ,0303 health sciences ,education.field_of_study ,business.industry ,Data dictionary ,Analytic Perspective ,Biobank ,Data science ,The Internet ,business ,computer ,Data integration - Abstract
Background Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses. Methods Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study’s questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis. Results Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method. Conclusion New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein. © Doiron et al.; licensee BioMed Central Ltd. 2013 This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
- Published
- 2013
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.