Back to Search Start Over

An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria

Authors :
Alioune Badara Ly
Richard Paul
Aliou Diop
Jean-François Trape
Jean-François Bureau
Gaoussou Diakhaby
Anavaj Sakuntabhai
Cheikh Loucoubar
Fatoumata Diene Sarr
Avner Bar-Hen
Cheikh Sokhna
Adama Tall
Abdoulaye Badiane
Augustin Huret
Joseph Faye
Pathogénie Virale
Institut Pasteur [Paris] (IP)-Institut National de la Santé et de la Recherche Médicale (INSERM)
Mathématiques Appliquées Paris 5 (MAP5 - UMR 8145)
Université Paris Descartes - Paris 5 (UPD5)-Institut National des Sciences Mathématiques et de leurs Interactions (INSMI)-Centre National de la Recherche Scientifique (CNRS)
Institut Pasteur de Dakar
Réseau International des Instituts Pasteur (RIIP)
École des Hautes Études en Santé Publique [EHESP] (EHESP)
Institute of Health and Science [Paris, France]
Paludologie afrotropicale
Institut de recherche pour le développement [Dakar, Sénégal] (IRD Hann Maristes)
Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE)
Institut de Recherche pour le Développement (IRD)-Aix Marseille Université (AMU)-Institut National de la Santé et de la Recherche Médicale (INSERM)-IFR48
Institut des sciences biologiques (INSB-CNRS)-Institut des sciences biologiques (INSB-CNRS)-Centre National de la Recherche Scientifique (CNRS)
Université Gaston Berger de Saint-Louis Sénégal (UGB)
Center of Excellence for Vectors and Vector-Borne Diseases (CVVD)
Mahidol University [Bangkok]
Funding was provided by Institut Pasteur and the Ecole des Hautes Etudes en Santé Publique.
Institut Pasteur [Paris]-Institut National de la Santé et de la Recherche Médicale (INSERM)
INSB-INSB-Centre National de la Recherche Scientifique (CNRS)
Lassailly-Bondaz, Anne
Source :
PLoS ONE, PLoS ONE, 2011, 6 (9), pp.e24085. ⟨10.1371/journal.pone.0024085⟩, PLoS ONE, Public Library of Science, 2011, 6 (9), pp.e24085. ⟨10.1371/journal.pone.0024085⟩, PLoS ONE, Vol 6, Iss 9, p e24085 (2011)
Publication Year :
2011
Publisher :
HAL CCSD, 2011.

Abstract

International audience; Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCubeH, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992-2003, aged 1-5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection #10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCubeH rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCubeH efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.

Subjects

Subjects :
Male
[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS]
MESH: Logistic Models
Plasmodium malariae
Space (commercial competition)
Protozoology
computer.software_genre
MESH: Risk Assessment
0302 clinical medicine
MESH: Child
lcsh:Science
Child
0303 health sciences
education.field_of_study
Statistics
Prognosis
MESH: Infant
Outcome (probability)
3. Good health
MESH: Reproducibility of Results
Child, Preschool
Medicine
Plasmodium falciparum
MESH: Glucosephosphate Dehydrogenase
MESH: Algorithms
Biostatistics
Risk Assessment
Microbiology
MESH: Prognosis
ABO Blood-Group System
03 medical and health sciences
MESH: Polymorphism, Genetic
Humans
Hemoglobin
education
Biology
Data mining
Polymorphism, Genetic
MESH: Humans
MESH: Data Mining
lcsh:R
MESH: Child, Preschool
Nonparametric statistics
Infant
Computational Biology
Logistic Models
Mutation
Computer Science
Parastic Protozoans
lcsh:Q
MESH: Female
Mathematics
Plasmodium
Multivariate analysis
Statistical methods
lcsh:Medicine
MESH: ABO Blood-Group System
Computer Applications
Engineering
Risk Factors
MESH: Risk Factors
MESH: Plasmodium malariae
MESH: Plasmodium falciparum
Multidisciplinary
Parasitic diseases
Infectious Diseases
Web-Based Applications
Female
[SDV.MP.PAR] Life Sciences [q-bio]/Microbiology and Parasitology/Parasitology
Algorithms
Research Article
MESH: Mutation
Clinical Research Design
030231 tropical medicine
Population
MESH: Malaria
[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]
Decision tree
Glucosephosphate Dehydrogenase
MESH: Multivariate Analysis
Malarial parasites
[SDV.MP.PAR]Life Sciences [q-bio]/Microbiology and Parasitology/Parasitology
030304 developmental biology
Reproducibility of Results
Tropical Diseases (Non-Neglected)
biology.organism_classification
MESH: Male
Malaria
Data set
Medical risk factors
[SDV.SPEE] Life Sciences [q-bio]/Santé publique et épidémiologie
Multivariate Analysis
Signal Processing
[SDV.SPEE]Life Sciences [q-bio]/Santé publique et épidémiologie
computer

Details

Language :
English
ISSN :
19326203
Database :
OpenAIRE
Journal :
PLoS ONE, PLoS ONE, 2011, 6 (9), pp.e24085. ⟨10.1371/journal.pone.0024085⟩, PLoS ONE, Public Library of Science, 2011, 6 (9), pp.e24085. ⟨10.1371/journal.pone.0024085⟩, PLoS ONE, Vol 6, Iss 9, p e24085 (2011)
Accession number :
edsair.doi.dedup.....f96c9d0c1559648463f28fe78ce3d1d5