51. Handling incomplete heterogeneous data using VAEs
- Author
-
Pablo M. Olmos, Isabel Valera, Zoubin Ghahramani, Alfredo Nazábal, Comunidad de Madrid, Ministerio de Ciencia e Innovación (España), European Commission, Ghahramani, Zoubin [0000-0002-7464-6475], and Apollo - University of Cambridge Repository
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Computer Science - Artificial Intelligence ,Machine Learning (stat.ML) ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,Machine Learning (cs.LG) ,Artificial Intelligence ,Statistics - Machine Learning ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Incomplete heterogenous data ,Imputation (statistics) ,010306 general physics ,Categorical variable ,Telecomunicaciones ,business.industry ,Generative models ,Missing data ,Artificial Intelligence (cs.AI) ,Signal Processing ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Variational autoencoders ,Software ,Count data - Abstract
Variational autoencoders (VAEs), as well as other generative models, have been shown to be efficient and accurate for capturing the latent structure of vast amounts of complex high-dimensional data. However, existing VAEs can still not directly handle data that are heterogenous (mixed continuous and discrete) or incomplete (with missing data at random), which is indeed common in real-world applications. In this paper, we propose a general framework to design VAEs suitable for fitting incomplete heterogenous data. The proposed HI-VAE includes likelihood models for real-valued, positive real valued, interval, categorical, ordinal and count data, and allows accurate estimation (and potentially imputation) of missing data. Furthermore, HI-VAE presents competitive predictive performance in supervised tasks, outperforming supervised models when trained on incomplete data. The authors wish to thank Christopher K. I. Williams, for fruitful discussions and helpful comments to the manuscript. Alfredo Nazabal would like to acknowledge the funding provided by the UK Government’s Defence & Security Programme in support of the Alan Turing Institute, EPSRC Grant EP/N510129/1. The work of Pablo M. Olmos is sup-ported by Spanish government MCI under grant RTI2018-099655-B-100, by Comunidad de Madrid under grants IND2017/TIC-7618, IND2018/TIC-9649, and Y2018/TCS-4705, by BBVA Foundation under the Deep-DARWiNproject, and by the European Union (FEDER and the European Research Council (ERC) through the European Unions Horizon 2020 research and innovation program under Grant 714161). Zoubin Ghahramani acknowledges support from the Alan Turing Institute (EPSRC Grant EP/N510129/1) and EPSRC Grant EP/N014162/1, and donations from Google and Microsoft Research. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
- Published
- 2020