Back to Search
Start Over
Optimizing the synthesis of clinical trial data using sequential trees
- Source :
- Journal of the American Medical Informatics Association : JAMIA
- Publication Year :
- 2020
- Publisher :
- Oxford University Press (OUP), 2020.
-
Abstract
- Objective With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. Materials and Methods Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. Results As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. Conclusions The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.
- Subjects :
- AcademicSubjects/SCI01060
data synthesis
Computer science
data sharing
Datasets as Topic
Health Informatics
Overfitting
Research and Applications
computer.software_genre
01 natural sciences
Synthetic data
010104 statistics & probability
03 medical and health sciences
0302 clinical medicine
Data Anonymization
Hinge loss
Humans
030212 general & internal medicine
0101 mathematics
AcademicSubjects/MED00580
Analysis of Variance
Clinical Trials as Topic
Information Dissemination
secondary use
Univariate
Particle swarm optimization
privacy enhancing technologies
Featured
Data sharing
Variable (computer science)
clinical trial transparency
Metric (mathematics)
Data mining
AcademicSubjects/SCI01530
computer
Algorithms
Confidentiality
Subjects
Details
- ISSN :
- 1527974X
- Volume :
- 28
- Database :
- OpenAIRE
- Journal :
- Journal of the American Medical Informatics Association
- Accession number :
- edsair.doi.dedup.....47898bdf8dc18a706b8bc1ead99bd9fc