1. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
- Author
-
Marie-Dominique Devignes, Faiez Zannad, Patrick Rossignol, Kevin Dalleau, Gregoire Preud'homme, João Pedro Ferreira, Olivier Huttin, Emmanuel Bresso, Kevin Duarte, Masatake Kobayashi, Miguel Couceiro, Malika Smaïl-Tabbone, Claire Lacomblez, Nicolas Girerd, BOZEC, Erwan, Combattre l'insuffisance cardiaque - - FIGHT-HF2015 - ANR-15-RHUS-0004 - RHUS - VALID, Défaillance Cardiovasculaire Aiguë et Chronique (DCAC), Centre Hospitalier Régional Universitaire de Nancy (CHRU Nancy)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Lorraine (UL), Centre d'investigation clinique plurithématique Pierre Drouin [Nancy] (CIC-P), Centre d'investigation clinique [Nancy] (CIC), Centre Hospitalier Régional Universitaire de Nancy (CHRU Nancy)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Lorraine (UL)-Centre Hospitalier Régional Universitaire de Nancy (CHRU Nancy)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Lorraine (UL), Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy] (INI-CRCT), Institut Lorrain du Coeur et des Vaisseaux Louis Mathieu [Nancy], French-Clinical Research Infrastructure Network - F-CRIN [Paris] (Cardiovascular & Renal Clinical Trialists - CRCT ), Computational Algorithms for Protein Structures and Interactions (CAPSID), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Complex Systems, Artificial Intelligence & Robotics (LORIA - AIS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Knowledge representation, reasonning (ORPAILLEUR), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), This work and the publication of this article were funded by the Agence Nationale de la Recherche (grant number ANR-15-RHUS-0004: RHU FIGHT-HF) and by the CPER IT2MP (Contrat Plan État Région, Innovations Technologiques, Modélisation & Médecine Personnalisée) and FEDER (Fonds Européen de Développement Régional). Kevin Dalleau was recipient of a RHU-Region Lorraine doctoral fellowship, ANR-15-RHUS-0004,FIGHT-HF,Combattre l'insuffisance cardiaque(2015), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Subjects
0301 basic medicine ,Computer science ,Clustering method Q1 ,Science ,Rand index ,Population ,Merge mode Q3 ,030204 cardiovascular system & hematology ,Article ,Functional clustering ,03 medical and health sciences ,0302 clinical medicine ,Distance or transformation Q2 ,[SDV.MHEP.CSC]Life Sciences [q-bio]/Human health and pathology/Cardiology and cardiovascular system ,Machine learning ,Optimization algorithm Numeric Categorical ,Cluster analysis ,education ,Categorical variable ,education.field_of_study ,Multidisciplinary ,business.industry ,Pattern recognition ,Latent class model ,Medoid ,Computational biology and bioinformatics ,Hierarchical clustering ,[SDV.MHEP.CSC] Life Sciences [q-bio]/Human health and pathology/Cardiology and cardiovascular system ,030104 developmental biology ,ComputingMethodologies_PATTERNRECOGNITION ,Benchmark (computing) ,Medicine ,Artificial intelligence ,business - Abstract
The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.
- Published
- 2021
- Full Text
- View/download PDF