1. A combined test for feature selection on sparse metaproteomics data - alternative to missing value imputation
- Author
-
Sandra Plancade, Ariane Bassignani, Blein Nicolas M, Catherine Juste, Olivier Langella, Berland M, Unité de Mathématiques et Informatique Appliquées de Toulouse (MIAT INRA), Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), MetaGenoPolis (MGP (US 1367)), Génétique Quantitative et Evolution - Le Moulon (Génétique Végétale) (GQE-Le Moulon), AgroParisTech-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), MICrobiologie de l'ALImentation au Service de la Santé (MICALIS), and AgroParisTech-Université Paris-Saclay-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE)
- Subjects
0303 health sciences ,Computer science ,General Neuroscience ,Univariate ,Experimental data ,Context (language use) ,Feature selection ,General Medicine ,Missing data ,computer.software_genre ,01 natural sciences ,General Biochemistry, Genetics and Molecular Biology ,[STAT]Statistics [stat] ,010104 statistics & probability ,03 medical and health sciences ,[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Metaproteomics ,Combined test ,Statistics::Methodology ,Imputation (statistics) ,Data mining ,0101 mathematics ,General Agricultural and Biological Sciences ,computer ,030304 developmental biology - Abstract
One of the difficulties encountered in the statistical analysis of metaproteomics data is the high proportion of missing values, which are usually treated by imputation. Nevertheless, imputation methods are based on restrictive assumptions regarding missingness mechanisms, namely “at random” or “not at random”. To circumvent these limitations in the context of feature selection in a multi-class comparison, we propose a univariate selection method that combines a test of association between missingness and classes, and a test for difference of observed intensities between classes. This approach implicitly handles both missingness mechanisms. We performed a quantitative and qualitative comparison of our procedure with imputation-based feature selection methods on two experimental data sets, as well as simulated data with various scenarios regarding the missingness mechanisms and the nature of the difference of expression (differential intensity or differential missingness). Whereas we observed similar performances in terms of prediction on the experimental data set, the feature ranking and selection from various imputation-based methods were strongly divergent. We showed that the combined test reaches a compromise by correlating reasonably with other methods, and remains efficient in all simulated scenarios unlike imputation-based feature selection methods.
- Published
- 2021