Back to Search Start Over

A multi-source heterogeneous medical data enhancement framework based on lakehouse.

Authors :
Sheng M
Wang S
Zhang Y
Hao R
Liang Y
Luo Y
Yang W
Wang J
Li Y
Zheng W
Li W
Source :
Health information science and systems [Health Inf Sci Syst] 2024 Jul 05; Vol. 12 (1), pp. 37. Date of Electronic Publication: 2024 Jul 05 (Print Publication: 2024).
Publication Year :
2024

Abstract

Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.<br />Competing Interests: Conflict of interestAll authors declare that there is no Conflict of interest.<br /> (© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.)

Details

Language :
English
ISSN :
2047-2501
Volume :
12
Issue :
1
Database :
MEDLINE
Journal :
Health information science and systems
Publication Type :
Academic Journal
Accession number :
38974364
Full Text :
https://doi.org/10.1007/s13755-024-00295-6