Back to Search Start Over

Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV

Authors :
Kuo, Nicholas I-Hsien
Garcia, Federico
Sönnerborg, Anders
Zazzi, Maurizio
Böhm, Michael
Kaiser, Rolf
Polizzotto, Mark
Jorm, Louisa
Barbieri, Sebastiano
Publication Year :
2022

Abstract

Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.<br />Comment: In the near future, we will make our codes and synthetic datasets publicly available to facilitate future research. Follow us on https://healthgym.ai/

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2208.08655
Document Type :
Working Paper