48 results on '"Catherine Matias"'
Search Results
2. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph.
- Author
-
Guillaume Gautreau, Adelme Bazin, Mathieu Gachet, Rémi Planel, Laura Burlot, Mathieu Dubois, Amandine Perrin, Claudine Médigue, Alexandra Calteau, Stéphane Cruveiller, Catherine Matias, Christophe Ambroise, Eduardo P C Rocha, and David Vallenet
- Subjects
Biology (General) ,QH301-705.5 - Abstract
The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don't account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN.
- Published
- 2020
- Full Text
- View/download PDF
3. Nine quick tips for analyzing network data.
- Author
-
Vincent Miele, Catherine Matias, Stéphane Robin, and Stéphane Dray
- Subjects
Biology (General) ,QH301-705.5 - Published
- 2019
- Full Text
- View/download PDF
4. Revealing the hidden structure of dynamic ecological networks
- Author
-
Vincent Miele and Catherine Matias
- Subjects
dynamic networks ,network clustering ,animal contact network ,trophic network ,stochastic block model ,Science - Abstract
In ecology, recent technological advances and long-term data studies now provide longitudinal interaction data (e.g. between individuals or species). Most often, time is the parameter along which interactions evolve but any other one-dimensional gradient (temperature, altitude, depth, humidity, etc.) can be considered. These data can be modelled through a sequence of different snapshots of an evolving ecological network, i.e. a dynamic network. Here, we present how the dynamic stochastic block model approach developed by Matias & Miele (Matias & Miele In press J. R. Stat. Soc. B (doi:10.1111/rssb.12200)) can capture the complexity and dynamics of these networks. First, we analyse a dynamic contact network of ants and we observe a clear high-level assembly with some variations in time at the individual level. Second, we explore the structure of a food web evolving during a year and we detect a stable predator–prey organization but also seasonal differences in the prey assemblage. Our approach, based on a rigorous statistical method implemented in the R package dynsbm, can pave the way for exploration of evolving ecological networks.
- Published
- 2017
- Full Text
- View/download PDF
5. Comparison of modularity-based approaches for nodes clustering in binary hypergraphs.
- Author
-
Veronica Poda and Catherine Matias
- Published
- 2024
- Full Text
- View/download PDF
6. Properties of the stochastic approximation EM algorithm with mini-batch sampling.
- Author
-
Estelle Kuhn, Catherine Matias, and Tabea Rebafka
- Published
- 2020
- Full Text
- View/download PDF
7. Spectral density of random graphs: convergence properties and application in model fitting.
- Author
-
Suzana de Siqueira Santos, André Fujita, and Catherine Matias
- Published
- 2021
- Full Text
- View/download PDF
8. Exploring the Robustness of the Parsimonious Reconciliation Method in Host-Symbiont Cophylogeny.
- Author
-
Laura Urbini, Blerina Sinaimeri, Catherine Matias, and Marie-France Sagot
- Published
- 2019
- Full Text
- View/download PDF
9. Robustness of the Parsimonious Reconciliation Method in Cophylogeny.
- Author
-
Laura Urbini, Blerina Sinaimeri, Catherine Matias, and Marie-France Sagot
- Published
- 2016
- Full Text
- View/download PDF
10. A stochastic block model for hypergraphs
- Author
-
Brusa, L, Matias, C, Luca Brusa, Catherine Matias, Brusa, L, Matias, C, Luca Brusa, and Catherine Matias
- Abstract
Over the past few decades a broad variety of models has been developed for graphs. However, modern applications in various fields highlighted the need to account for higher-order interactions, to include the information deriving from groups of three or more nodes. Simple examples include group interactions in social networks, scientific co-authorship, interactions between more than two species in ecological models or high-order correlations between neurons in brain networks. Hypergraphs provide the most general formalization of higher-order interactions: similarly to a graph, a hypergraph is defined as a set of nodes and a set of hyperedges, the latter specifying nodes taking part in each interaction. We propose a stochastic block model for hypergraphs to perform model-based clustering, capturing the information deriving from higher-order interactions. The formulation is sufficiently flexible to account for possible simplified latent structures. A variational expectation-maximization algorithm is developed to perform parameter estimation and model selection is explored using the ICL criterion. The model is applied to both simulated and real data, and the performance of the proposal is assessed in terms of parameter estimation and ability to recover the clusters. The estimation algorithm was implemented in C++ language and it was made available for the R software.
- Published
- 2022
11. A stochastic block model for hypergraphs
- Author
-
Luca Brusa, Catherine Matias, Brusa, L, and Matias, C
- Subjects
variational EM algorithm ,network ,latent variable model ,model-based clustering - Abstract
Over the past few decades a broad variety of models has been developed for graphs. However, modern applications in various fields highlighted the need to account for higher-order interactions, to include the information deriving from groups of three or more nodes. Simple examples include group interactions in social networks, scientific co-authorship, interactions between more than two species in ecological models or high-order correlations between neurons in brain networks. Hypergraphs provide the most general formalization of higher-order interactions: similarly to a graph, a hypergraph is defined as a set of nodes and a set of hyperedges, the latter specifying nodes taking part in each interaction. We propose a stochastic block model for hypergraphs to perform model-based clustering, capturing the information deriving from higher-order interactions. The formulation is sufficiently flexible to account for possible simplified latent structures. A variational expectation-maximization algorithm is developed to perform parameter estimation and model selection is explored using the ICL criterion. The model is applied to both simulated and real data, and the performance of the proposal is assessed in terms of parameter estimation and ability to recover the clusters. The estimation algorithm was implemented in C++ language and it was made available for the R software.
- Published
- 2022
12. An appraisal of graph embeddings for comparing trophic network architectures
- Author
-
Stéphane Dray, Wilfried Thuiller, Catherine Matias, Vincent Miele, Christophe Botella, Laboratoire d'Ecologie Alpine (LECA ), Université Savoie Mont Blanc (USMB [Université de Savoie] [Université de Chambéry])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité), ANR-18-CE02-0010,EcoNet,Modèles statistiques avancés pour les réseaux écologiques(2018), and ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
- Subjects
0106 biological sciences ,food-webs ,Property (programming) ,Computer science ,Graph embedding ,dimension reduction ,Space (commercial competition) ,Machine learning ,computer.software_genre ,010603 evolutionary biology ,01 natural sciences ,03 medical and health sciences ,trophic networks ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,Robustness (computer science) ,trophic groups ,[MATH]Mathematics [math] ,Ecology, Evolution, Behavior and Systematics ,030304 developmental biology ,graph embedding ,0303 health sciences ,Network architecture ,species interactions ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,evaluation ,business.industry ,Ecological Modeling ,Dimensionality reduction ,ecological interaction networks ,[STAT]Statistics [stat] ,[SDE]Environmental Sciences ,Graph (abstract data type) ,Embedding ,Artificial intelligence ,[SDE.BE]Environmental Sciences/Biodiversity and Ecology ,business ,computer ,[STAT.ME]Statistics [stat]/Methodology [stat.ME] - Abstract
International audience; Comparing the architecture of interaction networks in space or time is essential for understanding the assembly, trajectory, functioning and persistence of species communities. Graph embedding methods, which position networks into a vector space where nearby networks have similar architectures, could be ideal tools for this purposes.Here, we evaluated the ability of seven graph embedding methods to disentangle architectural similarities of interactions networks for supervised and unsupervised posterior analytic tasks. The evaluation was carried out over a large number of simulated trophic networks representing variations around six ecological properties and size.We did not find an overall best method and instead showed that the performance of the methods depended on the targeted ecological properties and thus on the research questions. We also highlighted the importance of normalizing the embedding for network sizes for meaningful posterior unsupervised analyses.We concluded by orientating potential users to the most suited methods given the question, the targeted network ecological property, and outlined links between those ecological properties and three ecological processes: robustness to extinction, community persistence and ecosystem functioning. We hope this study will stimulate the appropriation of graph embedding methods by ecologists.
- Published
- 2021
- Full Text
- View/download PDF
13. Spectral density of random graphs: convergence properties and application in model fitting
- Author
-
André Fujita, Suzana de Siqueira Santos, Catherine Matias, Universidade de São Paulo (USP), Fundacao Getulio Vargas [Rio de Janeiro] (FGV), Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), and Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP)
- Subjects
model selection ,Control and Optimization ,Computer Networks and Communications ,Computer science ,Mathematics - Statistics Theory ,TEORIA DOS GRAFOS ,Statistics Theory (math.ST) ,Management Science and Operations Research ,01 natural sciences ,010104 statistics & probability ,03 medical and health sciences ,Convergence (routing) ,FOS: Mathematics ,Applied mathematics ,Adjacency matrix ,0101 mathematics ,Eigenvalues and eigenvectors ,random graphs ,030304 developmental biology ,Random graph ,0303 health sciences ,convergence ,Applied Mathematics ,Cumulative distribution function ,Model selection ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Computational Mathematics ,spectral density ,Graph (abstract data type) ,model fitting ,Random matrix - Abstract
Random graph models are used to describe the complex structure of real-world networks in diverse fields of knowledge. Studying their behaviour and fitting properties are still critical challenges that, in general, require model-specific techniques. An important line of research is to develop generic methods able to fit and select the best model among a collection. Approaches based on spectral density (i.e. distribution of the graph adjacency matrix eigenvalues) appeal to that purpose: they apply to different random graph models. Also, they can benefit from the theoretical background of random matrix theory. This work investigates the convergence properties of model fitting procedures based on the graph spectral density and the corresponding cumulative distribution function. We also review the convergence of the spectral density for the most widely used random graph models. Moreover, we explore through simulations the limits of these graph spectral density convergence results, particularly in the case of the block model, where only partial results have been established. random graphs, spectral density, model fitting, model selection, convergence.
- Published
- 2021
14. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph
- Author
-
Adelme Bazin, Stéphane Cruveiller, Claudine Médigue, David Vallenet, Guillaume Gautreau, Rémi Planel, Eduardo P. C. Rocha, Mathieu Gachet, Amandine Perrin, Mathieu Dubois, Christophe Ambroise, Alexandra Calteau, Catherine Matias, Laura Burlot, Analyse Bio-Informatique pour la Génomique et le Métabolisme (LABGeM), Génomique métabolique (UMR 8030), Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Génomique évolutive des Microbes / Microbial Evolutionary Genomics, Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS), Collège Doctoral, Sorbonne Université (SU), Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité), Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), Université d'Évry-Val-d'Essonne (UEVE)-ENSIIE-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), ANR-11-INBS-0013,IFB (ex Renabi-IFB),Institut français de bioinformatique(2011), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS), ED 515 - Complexité du vivant, Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), This research was supported in part by the IRTELIS and Phare PhD programs of the French Alternative Energies and Atomic Energy Commission (CEA) for GG and AB respectively, the French Government 'Investissements d’Avenir' programs (namely FRANCE GENOMIQUE [ANR-10-INBS-09-08], the INSTITUT FRANÇAIS DE BIOINFORMATIQUE [ANR-11-INBS-0013], and the Agence Nationale de la Recherche [Projet ANR-16-CE12-29 for EPCR])., We acknowledge Alexandre Renaux and Jonathan Mercier for their preliminary insights on pangenome graphs. We thank Mélanie Buy for drawing the PPanGGOLiN logo. Finally, we thank Guilhem Royer, Valentin Sabatet, Johan Rollin, Mohammed-Amin Madoui, Tom Delmont, Nicolas Pons and Pierre Peterlongo for all their advice along this work., Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE), and Collège doctoral [Sorbonne universités]
- Subjects
Computer science ,[SDV]Life Sciences [q-bio] ,01 natural sciences ,Genome ,010104 statistics & probability ,Database and Informatics Methods ,Biology (General) ,Genome Evolution ,Genomic organization ,Data Management ,0303 health sciences ,Markov random field ,Applied Mathematics ,Simulation and Modeling ,Genomics ,Genomic Databases ,Physical Sciences ,Engineering and Technology ,Synthetic Biology ,Algorithms ,Research Article ,Computer and Information Sciences ,QH301-705.5 ,Computational biology ,Research and Analysis Methods ,Molecular Evolution ,03 medical and health sciences ,[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Genetics ,Gene family ,0101 mathematics ,Gene ,Genome size ,030304 developmental biology ,Taxonomy ,Comparative genomics ,Evolutionary Biology ,Bacteria ,Correction ,Biology and Life Sciences ,Computational Biology ,15. Life on land ,Comparative Genomics ,Synthetic Genomics ,Genome Analysis ,Biological Databases ,Multivariate Analysis ,Genome dynamics ,Genome, Bacterial ,Software ,Mathematics - Abstract
The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don’t account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN., Author summary Microorganisms have the greatest biodiversity and evolutionary history on earth. At the genomic level, it is reflected by a highly variable gene content even among organisms from the same species which explains the ability of microbes to be pathogenic or to grow in specific environments. We developed a new method called PPanGGOLiN which accurately represents the genomic diversity of a species (i.e. its pangenome) using a compact graph structure. Based on this pangenome graph, we classify genes by a statistical method according to their occurrence in the genomes. This method allowed us to build pangenomes even for uncultivated species at an unprecedented scale. We applied our method on all available genomes in databanks in order to depict the overall diversity of hundreds of species. Overall, our work enables microbiologists to explore and visualize pangenomes alike a subway map.
- Published
- 2020
- Full Text
- View/download PDF
15. A semiparametric extension of the stochastic block model for longitudinal networks
- Author
-
Catherine Matias, Tabea Rebafka, and Fanny Villers
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,General Mathematics ,02 engineering and technology ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,Stochastic block model ,Histogram ,Expectation–maximization algorithm ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Statistics - Methodology ,Mathematics ,Applied Mathematics ,Nonparametric statistics ,Estimator ,Extension (predicate logic) ,Agricultural and Biological Sciences (miscellaneous) ,Semiparametric model ,ComputingMethodologies_PATTERNRECOGNITION ,Kernel (statistics) ,020201 artificial intelligence & image processing ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Algorithm - Abstract
To model recurrent interaction events in continuous time, an extension of the stochastic block model is proposed where every individual belongs to a latent group and interactions between two individuals follow a conditional inhomogeneous Poisson process with intensity driven by the individuals' latent groups. The model is shown to be identifiable and its estimation is based on a semiparametric variational expectation-maximization algorithm. Two versions of the method are developed, using either a nonparametric histogram approach (with an adaptive choice of the partition size) or kernel intensity estimators. The number of latent groups can be selected by an integrated classification likelihood criterion. Finally, we demonstrate the performance of our procedure on synthetic experiments, analyse two datasets to illustrate the utility of our approach and comment on competing methods.
- Published
- 2018
- Full Text
- View/download PDF
16. SIMoNe: Statistical Inference for MOdular NEtworks.
- Author
-
Julien Chiquet, Alexander Smith 0003, Gilles Grasseau, Catherine Matias, and Christophe Ambroise
- Published
- 2009
- Full Text
- View/download PDF
17. Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model
- Author
-
Léa Longepierre, Catherine Matias, Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Université Paris Diderot - Paris 7 (UPD7)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), ANR-18-CE02-0010,EcoNet,Modèles statistiques avancés pour les réseaux écologiques(2018), Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001)), and ANR-18-CE02-0010,ECONET,ADVANCED STATISTICAL MODELLING OF ECOLOGICAL NETWORKS(2018)
- Subjects
Statistics and Probability ,Dynamic network analysis ,Mathematics - Statistics Theory ,02 engineering and technology ,01 natural sciences ,010104 statistics & probability ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Stochastic block model ,Consistency (statistics) ,Convergence (routing) ,0202 electrical engineering, electronic engineering, information engineering ,Applied mathematics ,0101 mathematics ,Hidden Markov model ,Mathematics ,dynamic stochastic block model ,temporal network ,Estimator ,020206 networking & telecommunications ,Maximum likelihood estimation ,Connection (mathematics) ,Bernoulli distribution ,dynamic network ,variational estimation ,Statistics, Probability and Uncertainty ,62F12 - Abstract
International audience; We consider a dynamic version of the stochastic block model, in which the nodes are partitioned into latent classes and the connection between two nodes is drawn from a Bernoulli distribution depending on the classes of these two nodes. The temporal evolution is modeled through a hidden Markov chain on the nodes memberships. We prove the consistency (as the number of nodes and time steps increase) of the maximum likelihood and variational estimators of the model parameters, and obtain upper bounds on the rates of convergence of these estimators. We also explore the particular case where the number of time steps is fixed and connectivity parameters are allowed to vary.
- Published
- 2019
18. Exploring the Robustness of the Parsimonious Reconciliation Method in Host-Symbiont Cophylogeny
- Author
-
Blerina Sinaimeri, Laura Urbini, Marie-France Sagot, Catherine Matias, Equipe de recherche européenne en algorithmique et biologie formelle et expérimentale (ERABLE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Baobab, Département PEGASE [LBBE] (PEGASE), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Université Paris Diderot - Paris 7 (UPD7)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), and Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001))
- Subjects
Scheme (programming language) ,Root (linguistics) ,Computer science ,[SDV]Life Sciences [q-bio] ,0206 medical engineering ,Context (language use) ,02 engineering and technology ,robustness ,Machine learning ,computer.software_genre ,cophylogeny ,Measure (mathematics) ,parsimony ,Robustness (computer science) ,Genetics ,measure for tree reconciliation comparison ,event-based methods ,computer.programming_language ,business.industry ,Applied Mathematics ,Tree (data structure) ,Artificial intelligence ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Host (network) ,020602 bioinformatics ,Biotechnology ,Cophylogeny - Abstract
International audience; The aim of this paper is to explore the robustness of the parsimonious host-symbiont tree reconciliation method under editing or small perturbations of the input. The editing involves making different choices of unique symbiont mapping to a host in the case where multiple associations exist. This is made necessary by the fact that the tree reconciliation model is currently unable to handle such associations. The analysis performed could however also address the problem of errors. The perturbations are re-rootings of the symbiont tree to deal with a possibly wrong placement of the root specially in the case of fast-evolving species. In order to do this robustness analysis, we introduce a simulation scheme specifically designed for the host-symbiont cophylogeny context, as well as a measure to compare sets of tree reconciliations, both of which are of interest by themselves.
- Published
- 2019
- Full Text
- View/download PDF
19. Statistical clustering of temporal networks through a dynamic stochastic block model
- Author
-
Vincent Miele, Catherine Matias, Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Mathematical optimization ,Theoretical computer science ,02 engineering and technology ,01 natural sciences ,variational expectation maximization ,Methodology (stat.ME) ,010104 statistics & probability ,Frequentist inference ,Stochastic block model ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,[MATH]Mathematics [math] ,Cluster analysis ,contact network ,Statistics - Methodology ,Clustering coefficient ,Mathematics ,dynamic random graph ,Markov chain ,Model selection ,stochastic block model ,graph clustering ,[STAT]Statistics [stat] ,Discrete time and continuous time ,Identifiability ,020201 artificial intelligence & image processing ,Statistics, Probability and Uncertainty - Abstract
Summary Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within-group connectivity behaviour. We study identifiability of the model parameters and propose an inference procedure based on a variational expectation–maximization algorithm as well as a model selection criterion to select the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and we compare our procedure with existing procedures on synthetic data sets. We also illustrate our approach on dynamic contact networks: one of encounters between high school students and two others on animal interactions. An implementation of the method is available as an R package called dynsbm.
- Published
- 2017
- Full Text
- View/download PDF
20. A time warping approach to multiple sequence alignment
- Author
-
Catherine Matias, Ana Arribas-Gil, Departamento de Estadistica, Universidad Carlos III de Madrid [Madrid] (UC3M), Laboratoire de Probabilités et Modèles Aléatoires (LPMA), and Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Statistics and Probability ,FOS: Computer and information sciences ,Warping ,Dynamic time warping ,Computer science ,Context (language use) ,Statistics - Applications ,Quantitative Biology - Quantitative Methods ,03 medical and health sciences ,Synchronization (computer science) ,Genetics ,Computer Simulation ,Applications (stat.AP) ,Image warping ,Molecular Biology ,Quantitative Methods (q-bio.QM) ,Alignment ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,Multiple sequence alignment ,Base Sequence ,Functional data analysis ,Quantitative Biology::Genomics ,[SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM] ,Computational Mathematics ,Quantitative Biology::Quantitative Methods ,030104 developmental biology ,FOS: Biological sciences ,Path (graph theory) ,Pairwise comparison ,Algorithm ,Sequence Alignment ,Algorithms ,Software - Abstract
We propose an approach for multiple sequence alignment (MSA) derived from the dynamic time warping viewpoint and recent techniques of curve synchronization developed in the context of functional data analysis. Starting from pairwise alignments of all the sequences (viewed as paths in a certain space), we construct a median path that represents the MSA we are looking for. We establish a proof of concept that our method could be an interesting ingredient to include into refined MSA techniques. We present a simple synthetic experiment as well as the study of a benchmark dataset, together with comparisons with 2 widely used MSA softwares.
- Published
- 2017
- Full Text
- View/download PDF
21. Modeling heterogeneity in random graphs through latent space models: a selective review
- Author
-
Stéphane Robin, Catherine Matias, Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-ENSIIE-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Mathématiques et Informatique Appliquées (MIA-Paris), AgroParisTech-Institut National de la Recherche Agronomique (INRA), Laboratoire de Mathématiques et Modélisation d'Evry, Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Institut National de la Recherche Agronomique (INRA) - Université d'Evry-Val d'Essonne - ENSIIE - Centre National de la Recherche Scientifique (CNRS), Université Pierre et Marie Curie - Paris 6 (UPMC) - Université Paris Diderot - Paris 7 (UP7) - Centre National de la Recherche Scientifique (CNRS), Institut National de la Recherche Agronomique (INRA) - AgroParisTech, INRA, Institut National de la Recherche Agronomique (INRA), and Institut National de la Recherche Agronomique (INRA)-AgroParisTech
- Subjects
FOS: Computer and information sciences ,Méthodologie ,Theoretical computer science ,Computer science ,Stochastic block model ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Space (mathematics) ,Methodology (stat.ME) ,Graph clustering ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,FOS: Mathematics ,QA1-939 ,Statistiques (Mathématiques) ,[MATH.MATH-ST] Mathematics [math]/Statistics [math.ST] ,Statistics - Methodology ,Random graphs ,Clustering coefficient ,Block (data storage) ,Random graph ,T57-57.97 ,[STAT.ME] Statistics [stat]/Methodology [stat.ME] ,Applied mathematics. Quantitative methods ,[STAT.TH] Statistics [stat]/Statistics Theory [stat.TH] ,Methodology ,Probabilistic logic ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,[STAT.ME]Statistics [stat]/Methodology [stat.ME] ,Mathematics - Abstract
We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years., Nous présentons une revue non exhaustive de la modélisation probabiliste de l'hétérogénéité des graphes aléatoires. Nous décrivons les modèles à espaces latents en nous intéressant plus particulièrement au modèle à blocs stochastiques et ses extensions, qui a connu des développements majeurs au cours des cinq dernières années.
- Published
- 2014
- Full Text
- View/download PDF
22. On Efficient Estimators of the Proportion of True Null Hypotheses in a Multiple Testing Setup
- Author
-
Van Hanh Nguyen and Catherine Matias
- Subjects
Statistics and Probability ,0303 health sciences ,Uniform distribution (continuous) ,Lebesgue measure ,Null (mathematics) ,Nonparametric statistics ,Estimator ,01 natural sciences ,Semiparametric model ,010104 statistics & probability ,03 medical and health sciences ,Delta method ,Statistics ,Applied mathematics ,0101 mathematics ,Statistics, Probability and Uncertainty ,030304 developmental biology ,Parametric statistics ,Mathematics - Abstract
We consider the problem of estimating the proportion $\theta$ of true null hypotheses in a multiple testing context. The setup is classically modeled through a semiparametric mixture with two components: a uniform distribution on interval $[0,1]$ with prior probability $\theta$ and a nonparametric density $f$. We discuss asymptotic efficiency results and establish that two different cases occur whether $f$ vanishes on a set with non null Lebesgue measure or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (\emph{i.e.} attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. We illustrate those results on simulated data.
- Published
- 2014
- Full Text
- View/download PDF
23. Parameter Estimation in Pair-hidden Markov Models
- Author
-
Catherine Matias, Ana Arribas-Gil, and Elisabeth Gassiat
- Subjects
Statistics and Probability ,Formalism (philosophy of mathematics) ,Kullback–Leibler divergence ,Estimation theory ,FOS: Mathematics ,Applied mathematics ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,62F12, 62M09, 62B10 ,Statistics, Probability and Uncertainty ,Parameter space ,Hidden Markov model ,Mathematics - Abstract
This paper deals with parameter estimation in pair hidden Markov models (pair-HMMs). We first provide a rigorous formalism for these models and discuss possible definitions of likelihoods. The model being biologically motivated, some restrictions with respect to the full parameter space naturally occur. Existence of two different Information divergence rates is established and divergence property (namely positivity at values different from the true one) is shown under additional assumptions. This yields consistency for the parameter in parametrization schemes for which the divergence property holds. Simulations illustrate different cases which are not covered by our results., Comment: corrected typos
- Published
- 2006
- Full Text
- View/download PDF
24. Hidden Markov model for parameter estimation of a random walk in a Markov environment
- Author
-
Catherine Matias, Pierre Andreoletti, Dasha Loukianova, Mathématiques - Analyse, Probabilités, Modélisation - Orléans (MAPMO), Université d'Orléans (UO)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-ENSIIE-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS) - Université d'Orléans (UO), Institut National de la Recherche Agronomique (INRA) - Université d'Evry-Val d'Essonne - ENSIIE - Centre National de la Recherche Scientifique (CNRS), Université Pierre et Marie Curie - Paris 6 (UPMC) - Université Paris Diderot - Paris 7 (UP7) - Centre National de la Recherche Scientifique (CNRS), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université d'Orléans (UO), Laboratoire de Mathématiques et Modélisation d'Evry, and Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7)-Université Pierre et Marie Curie - Paris 6 (UPMC)
- Subjects
Statistics and Probability ,Hidden Markov model ,Random walk in random environment ,Markov kernel ,Markov chain ,62M05, 62F12, 60J25 ,[STAT.TH] Statistics [stat]/Statistics Theory [stat.TH] ,Variable-order Markov model ,[MATH] Mathematics [math] ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Markov model ,Maximum likelihood estimation ,Markov renewal process ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Statistics ,Markov property ,Hidden semi-Markov model ,[MATH]Mathematics [math] ,Algorithm ,[MATH.MATH-ST] Mathematics [math]/Statistics [math.ST] ,Mathematics - Abstract
International audience; We focus on the parametric estimation of the distribution of a Markov environment from the observation of a single trajectory of a one-dimensional nearest-neighbor path evolving in this random environment. In the ballistic case, as the length of the path increases, we prove consistency, asymptotic normality and efficiency of the maximum likelihood estimator. Our contribution is two-fold: we cast the problem into the one of parameter estimation in a hidden Markov model (HMM) and establish that the bivariate Markov chain underlying this HMM is positive Harris recurrent. We provide different examples of setups in which our results apply, in particular that of DNA unzipping model, and we give a simple synthetic experiment to illustrate those results.
- Published
- 2015
- Full Text
- View/download PDF
25. Semiparametric deconvolution with unknown noise variance
- Author
-
Catherine Matias
- Subjects
Statistics and Probability ,Pointwise ,Noise (signal processing) ,Estimator ,Density estimation ,Convolution ,Moment (mathematics) ,symbols.namesake ,Rate of convergence ,Gaussian noise ,Statistics ,symbols ,Applied mathematics ,Mathematics - Abstract
This paper deals with semiparametric convolution models, where the noise sequence has a Gaussian centered distribution, with unknown variance. Non-parametric convolution models are concerned with the case of an entirely known distribution for the noise sequence, and they have been widely studied in the past decade. The main property of those models is the following one: the more regular the distribution of the noise is, the worst the rate of convergence for the estimation of the signal's density g is [5]. Nevertheless, regularity assumptions on the signal density g improve those rates of convergence [15]. In this paper, we show that when the noise (assumed to be Gaussian centered) has a variance σ2 that is unknown (actually, it is always the case in practical applications), the rates of convergence for the estimation of g are seriously deteriorated, whatever its regularity is supposed to be. More precisely, the minimax risk for the pointwise estimation of g over a class of regular densities is lower bounded by a constant over log n. We construct two estimators of σ2, and more particularly, an estimator which is consistent as soon as the signal has a finite first order moment. We also mention as a consequence the deterioration of the rate of convergence in the estimation of the parameters in the nonlinear errors-in-variables model.
- Published
- 2002
- Full Text
- View/download PDF
26. Propriétés asymptotiques de l'estimateur de maximum de vraisemblance pour des modèles de Markov cachés généraux
- Author
-
Randal Douc and Catherine Matias
- Subjects
Estimation theory ,Maximum likelihood ,Ergodicity ,Asymptotic distribution ,Applied mathematics ,Probability distribution ,General Medicine ,Statistical theory ,Markov model ,Likelihood function ,Mathematics - Abstract
Resume Nous etablissons la consistance et la normalite asymptotique de l'estimateur du maximum de vraisemblance dans un modele de Markov cache non stationnaire, lorsque l'espace d'etats de la chaine est separable et compact. Nous prouvons ces convergences sous de faibles hypotheses de regularite et d'identifiabilite du modele, en utilisant l'oubli exponentiel du filtre de prediction et l'ergodicite geometrique d'une chaine etendue bien choisie.
- Published
- 2000
- Full Text
- View/download PDF
27. Maximum likelihood estimator consistency for a ballistic random walk in a parametric random environment
- Author
-
Oleg Loukianov, Mikael Falconnet, Francis Comets, Catherine Matias, Dasha Loukianova, Université Paris Diderot - Paris 7 (UPD7), Université d'Évry-Val-d'Essonne (UEVE), Laboratoire Analyse et Probabilités (LAE), Université Paris-Est (COMUE), Partenaires INRAE, Institut National de la Recherche Agronomique (INRA), Laboratoire Analyse et Probabilités, Université d'Évry-Val-d'Essonne (UEVE)-PRES Universud Paris-Fédération de Mathématiques d'Evry Val d'Essonne, Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Analyse et de Mathématiques Appliquées (LAMA), Université Paris-Est Marne-la-Vallée (UPEM)-Fédération de Recherche Bézout-Université Paris-Est Créteil Val-de-Marne - Paris 12 (UPEC UP12)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7)-Université Pierre et Marie Curie - Paris 6 (UPMC), and Centre National de la Recherche Scientifique (CNRS)-Université Paris-Est Créteil Val-de-Marne - Paris 12 (UPEC UP12)-Fédération de Recherche Bézout-Université Paris-Est Marne-la-Vallée (UPEM)
- Subjects
Statistics and Probability ,Mathematical optimization ,Multivariate random variable ,AMS 2000 subject classification: Primary 62M05, 62F12 ,secondary 60J25 ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Ballistic regime ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,FOS: Mathematics ,Statistical physics ,Mathematics ,multitype branching-processe ,Random walk in random environment ,Heterogeneous random walk in one dimension ,Random field ,Estimation theory ,Applied Mathematics ,Loop-erased random walk ,Random function ,random walk in random environment ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Branching process in random environment ,Random walk ,Maximum likelihood estimation ,Random variate ,Modeling and Simulation ,immigration - Abstract
International audience; We consider a one dimensional ballistic random walk evolving in an i.i.d. parametric random environment. We provide a maximum likelihood estimation procedure of the parameters based on a single observation of the path till the time it reaches a distant site, and prove that the estimator is consistent as the distant site tends to infinity. Our main tool consists in using the link between random walks and branching processes in random environments and explicitly characterising the limiting distribution of the process that arises. We also explore the numerical performance of our estimation procedure.
- Published
- 2014
- Full Text
- View/download PDF
28. Asymptotic normality and efficiency of the maximum likelihood estimator for the parameter of a ballistic random walk in a random environment
- Author
-
Dasha Loukianova, Catherine Matias, Mikael Falconnet, Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Analyse et Probabilités (LAE), Université d'Évry-Val-d'Essonne (UEVE), Laboratoire Analyse et Probabilités, and Université d'Évry-Val-d'Essonne (UEVE)-PRES Universud Paris-Fédération de Mathématiques d'Evry Val d'Essonne
- Subjects
Statistics and Probability ,Independent and identically distributed random variables ,Ballistic random walk ,media_common.quotation_subject ,Maximum likelihood ,Asymptotic distribution ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Cramér-Rao efficiency ,Random walk in random environment ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Consistent estimator ,FOS: Mathematics ,Asymptotic normality ,Applied mathematics ,Parametric statistics ,Mathematics ,media_common ,Confidence regions ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Maximum likelihood estimation ,Random walk ,Infinity ,MSC 2000 : Primary 62M05, 62F12 ,secondary 60J25 ,Path (graph theory) ,Statistics, Probability and Uncertainty - Abstract
We consider a one dimensional ballistic random walk evolving in a parametric independent and identically distributed random environment. We study the asymptotic properties of the maximum likelihood estimator of the parameter based on a single observation of the path till the time it reaches a distant site. We prove an asymptotic normality result for this consistent estimator as the distant site tends to infinity and establish that it achieves the Cram\'er-Rao bound. We also explore in a simulation setting the numerical behaviour of asymptotic confidence regions for the parameter value.
- Published
- 2013
- Full Text
- View/download PDF
29. Nonparametric estimation of the density of the alternative hypothesis in a multiple testing setup. Application to local false discovery rate estimation
- Author
-
Catherine Matias, Van Hanh Nguyen, Université Paris-Sud - Paris 11 (UP11), Laboratoire de Mathématiques et Modélisation d'Evry, Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Statistique et Génome, Université d'Évry-Val-d'Essonne (UEVE), Laboratoire Statistique et Génome (SG), Laboratoire de Mathématiques d'Orsay (LM-Orsay), Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), and Nguyen, Van Hanh
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,False discovery rate ,[SDV]Life Sciences [q-bio] ,Kernel density estimation ,$p$-values ,local false discovery rate ,01 natural sciences ,Statistics - Applications ,010104 statistics & probability ,03 medical and health sciences ,62G07 ,kernel estimation ,Applied mathematics ,Applications (stat.AP) ,maximum smoothed likelihood ,0101 mathematics ,030304 developmental biology ,Mathematics ,Pointwise ,semiparametric mixture model ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,0303 health sciences ,multiple testing ,Nonparametric statistics ,Estimator ,p-values ,false discovery rate ,Mixture model ,Rate of convergence ,Multiple comparisons problem - Abstract
International audience; In a multiple testing context, we consider a semiparametric mixture model with two components where one component is known and corresponds to the distribution of p-values under the null hypothesis and the other component f is nonparametric and stands for the distribution under the alternative hypothesis. Motivated by the issue of local false discovery rate estimation, we focus here on the estimation of the nonparametric unknown component f in the mixture, relying on a preliminary estimator of the unknown proportion. of true null hypotheses. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of Holder densities. To our knowledge, this is the first result establishing convergence as well as corresponding rate for the estimation of the unknown component in this nonparametric mixture. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate. Their respective performances are then compared on synthetic data.
- Published
- 2012
- Full Text
- View/download PDF
30. A context dependent pair hidden Markov model for statistical alignment
- Author
-
Ana Arribas-Gil, Catherine Matias, Departamento de Estadistica, Universidad Carlos III de Madrid [Madrid], Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Universidad Carlos III de Madrid [Madrid] (UC3M), Spanish Ministry of Science and Innovation [ECO2008-05080, HI2008-0069, JC2010-0057], and Region of Madrid, Spain [CCG10-UC3M/HUM-5114]
- Subjects
Statistics and Probability ,Mutation rate ,Computer science ,Mathematics - Statistics Theory ,Context (language use) ,Sequence alignment ,Statistics Theory (math.ST) ,Stochastic approximation ,DNA sequence alignment ,Quantitative Biology - Quantitative Methods ,Stochastic expectation maximization algorithm ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Probabilistic alignment ,Expectation–maximization algorithm ,Insertion deletion model ,FOS: Mathematics ,Genetics ,Hidden Markov model ,EM algorithm ,Molecular Biology ,Quantitative Methods (q-bio.QM) ,Models, Statistical ,Contextual alignment ,Pair hidden Markov model ,Statistical alignment ,Base Sequence ,Markov chain ,Comparative genomics ,Sequence evolution ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Process substitution ,Quantitative Biology::Genomics ,Markov Chains ,Computational Mathematics ,FOS: Biological sciences ,Sequence Alignment ,Algorithm - Abstract
International audience; This article proposes a novel approach to statistical alignment of nucleotide sequences by introducing a context dependent structure on the substitution process in the underlying evolutionary model. We propose to estimate alignments and context dependent mutation rates relying on the observation of two homologous sequences. The procedure is based on a generalized pair-hidden Markov structure, where conditional on the alignment path, the nucleotide sequences follow a Markov distribution. We use a stochastic approximation expectation maximization (saem) algorithm to give accurate estimators of parameters and alignments. We provide results both on simulated data and vertebrate genomes, which are known to have a high mutation rate from CG dinucleotide. In particular, we establish that the method improves the accuracy of the alignment of a human pseudogene and its functional gene.
- Published
- 2012
- Full Text
- View/download PDF
31. New consistent and asymptotically normal parameter estimates for random-graph mixture models
- Author
-
Catherine Matias, Christophe Ambroise, Laboratoire Statistique et Génome (SG), and Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Random graph ,Mixture model ,Mathematical optimization ,Estimation theory ,Stochastic block model ,Estimator ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Composite likelihood ,Moment (mathematics) ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Expectation–maximization algorithm ,Maximum a posteriori estimation ,Applied mathematics ,Statistics, Probability and Uncertainty ,Mathematics - Abstract
Summary Random-graph mixture models are very popular for modelling real data networks. Parameter estimation procedures usually rely on variational approximations, either combined with the expectation–maximization (EM) algorithm or with Bayesian approaches. Despite good results on synthetic data, the validity of the variational approximation is, however, not established. Moreover, these variational approaches aim at approximating the maximum likelihood or the maximum a posteriori estimators, whose behaviour in an asymptotic framework (as the sample size increases to ∞) remains unknown for these models. In this work, we show that, in many different affiliation contexts (for binary or weighted graphs), parameter estimators based either on moment equations or on the maximization of some composite likelihood are strongly consistent and √n convergent, when the number n of nodes increases to ∞. As a consequence, our result establishes that the overall structure of an affiliation model can be (asymptotically) caught by the description of the network in terms of its number of triads (order 3 structures) and edges (order 2 structures). Moreover, these parameter estimates are either explicit (as for the moment estimators) or may be approximated by using a simple EM algorithm, whose convergence properties are known. We illustrate the efficiency of our method on simulated data and compare its performances with other existing procedures. A data set of cross-citations among economics journals is also analysed.
- Published
- 2012
- Full Text
- View/download PDF
32. Parameter identifiability in a class of random graph mixture models
- Author
-
Elizabeth S. Allman, Catherine Matias, John A. Rhodes, Department of Mathematics and Statistics [Fairbanks], University of Alaska [Fairbanks] (UAF), Laboratoire Statistique et Génome (SG), and Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,050402 sociology ,62E10, 62F99 ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Random graph ,01 natural sciences ,Unobservable ,010104 statistics & probability ,0504 sociology ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,FOS: Mathematics ,Identifiability ,0101 mathematics ,Mathematics ,Parametric statistics ,Discrete mathematics ,Mixture model ,Estimation theory ,Applied Mathematics ,05 social sciences ,Stochastic blockmodel ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,16. Peace & justice ,Graph (abstract data type) ,Statistics, Probability and Uncertainty ,Random variable - Abstract
We prove identifiability of parameters for a broad class of random graph mixture models. These models are characterized by a partition of the set of graph nodes into latent (unobservable) groups. The connectivities between nodes are independent random variables when conditioned on the groups of the nodes being connected. In the binary random graph case, in which edges are either present or absent, these models are known as stochastic blockmodels and have been widely used in the social sciences and, more recently, in biology. Their generalizations to weighted random graphs, either in parametric or non-parametric form, are also of interest in many areas. Despite a broad range of applications, the parameter identifiability issue for such models is involved, and previously has only been touched upon in the literature. We give here a thorough investigation of this problem. Our work also has consequences for parameter estimation. In particular, the estimation procedure proposed by Frank and Harary for binary affiliation models is revisited in this article.
- Published
- 2011
- Full Text
- View/download PDF
33. SIMoNe: Statistical Inference for MOdular NEtworks
- Author
-
Adam Alexander T. Smith, Julien Chiquet, G. Grasseau, Catherine Matias, Christophe Ambroise, Laboratoire Statistique et Génome (SG), and Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Theoretical computer science ,Computer science ,Gaussian ,Gene regulatory network ,Inference ,MESH: Algorithms ,computer.software_genre ,01 natural sciences ,Biochemistry ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,Matrix (mathematics) ,Databases ,MESH: Gene Expression Profiling ,MESH: Software ,Genetic ,MESH: Computer Simulation ,[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Databases, Genetic ,Statistical inference ,Gene Regulatory Networks ,Computer Simulation ,Graphical model ,0101 mathematics ,Molecular Biology ,Partial correlation ,MESH: Databases, Genetic ,030304 developmental biology ,MESH: Gene Regulatory Networks ,0303 health sciences ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,business.industry ,Gene Expression Profiling ,Modular design ,[SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM] ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,symbols ,Data mining ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Algorithms ,Software - Abstract
Summary: The R package SIMoNe (Statistical Inference for MOdular NEtworks) enables inference of gene-regulatory networks based on partial correlation coefficients from microarray experiments. Modelling gene expression data with a Gaussian graphical model (hereafter GGM), the algorithm estimates non-zero entries of the concentration matrix, in a sparse and possibly high-dimensional setting. Its originality lies in the fact that it searches for a latent modular structure to drive the inference procedure through adaptive penalization of the concentration matrix. Availability: Under the GNU General Public Licence at http://cran.r-project.org/web/packages/simone/ Contact: julien.chiquet@genopole.cnrs.fr
- Published
- 2009
- Full Text
- View/download PDF
34. Inferring sparse Gaussian graphical models with latent structure
- Author
-
Julien Chiquet, Christophe Ambroise, Catherine Matias, Laboratoire Statistique et Génome (SG), and Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Méthodologie ,Computer science ,Gaussian ,Inference ,Topology (electrical circuits) ,Model selection ,Statistics - Applications ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,Matrix (mathematics) ,62J07 ,62H20, 62J07, 62H30 ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Applications (stat.AP) ,Graphical model ,0101 mathematics ,Latent structure ,EM algorithm ,Statistics - Methodology ,62H20 ,030304 developmental biology ,Mixture model ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,0303 health sciences ,Methodology ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Graph ,$\ell_1$-penalization ,ComputingMethodologies_PATTERNRECOGNITION ,Gaussian graphical model ,ℓ_1-penalization ,Hidden variable theory ,Applications ,symbols ,Graph (abstract data type) ,Statistics, Probability and Uncertainty ,Variational inference ,[STAT.ME]Statistics [stat]/Methodology [stat.ME] ,Algorithm ,62H30 - Abstract
Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology. Our method uses an $\ell_1$ penalized likelihood criterion. Inference of the graph of conditional dependencies between the variates and of the hidden variables is performed simultaneously in an iterative \textsc{em}-like algorithm. The performances of our method is illustrated on synthetic as well as real data, the latter concerning breast cancer., Comment: 35 pages, 15 figures
- Published
- 2009
- Full Text
- View/download PDF
35. Identifiability of parameters in latent structure models with many observed variables
- Author
-
Catherine Matias, Elizabeth S. Allman, John A. Rhodes, Department of Mathematics and Statistics [Fairbanks], University of Alaska [Fairbanks] (UAF), Laboratoire Statistique et Génome (SG), and Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Contingency table ,62E10 ,Finite mixture ,62F99, 62G99 ,62E10 (Primary), 62F99, 62G99 (Secondary) ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Markov model ,01 natural sciences ,010104 statistics & probability ,Latent structure ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Conditional independence ,FOS: Mathematics ,State space ,Applied mathematics ,Identifiability ,0101 mathematics ,Parametric statistics ,Mathematics ,Random graph ,Algebraic statistics ,010102 general mathematics ,Nonparametric statistics ,Nonparametric mixture ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Mixture model ,Multivariate Bernoulli mixture ,Statistics, Probability and Uncertainty ,62G99 ,62F99 - Abstract
While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general approach for establishing identifiability utilizing algebraic arguments. A theorem of J. Kruskal for a simple latent-class model with finite state space lies at the core of our results, though we apply it to a diverse set of models. These include mixtures of both finite and nonparametric product distributions, hidden Markov models and random graph mixture models, and lead to a number of new results and improvements to old ones. In the parametric setting, this approach indicates that for such models, the classical definition of identifiability is typically too strong. Instead generic identifiability holds, which implies that the set of nonidentifiable parameters has measure zero, so that parameter inference is still meaningful. In particular, this sheds light on the properties of finite mixtures of Bernoulli products, which have been used for decades despite being known to have nonidentifiable parameters. In the nonparametric setting, we again obtain identifiability only when certain restrictions are placed on the distributions that are mixed, but we explicitly describe the conditions., Comment: Published in at http://dx.doi.org/10.1214/09-AOS689 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2009
- Full Text
- View/download PDF
36. Adaptive goodness-of-fit testing from indirect observations
- Author
-
Christophe Pouet, Cristina Butucea, Catherine Matias, Laboratoire Paul Painlevé - UMR 8524 (LPP), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Analyse, Topologie, Probabilités (LATP), Université Paul Cézanne - Aix-Marseille 3-Université de Provence - Aix-Marseille 1-Centre National de la Recherche Scientifique (CNRS), and Laboratoire Paul Painlevé (LPP)
- Subjects
Statistics and Probability ,62F12, 62G05, 62G10, 62G20 ,Context (language use) ,Type (model theory) ,Quadratic functional estimation ,01 natural sciences ,Convolution ,Combinatorics ,010104 statistics & probability ,Probability theory ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,0502 economics and business ,Calculus ,62G05 ,0101 mathematics ,62G20 ,Partially known noise ,050205 econometrics ,Mathematics ,Goodness-of-fit tests ,Infinitely differentiable functions ,Smoothness (probability theory) ,Sobolev classes ,05 social sciences ,Null (mathematics) ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Adaptive nonparametric tests ,Distribution (mathematics) ,Stable laws ,Statistics, Probability and Uncertainty ,62F12 ,Random variable ,Convolution model ,62G10 - Abstract
In a convolution model, we observe random variables whose distribution is the convolution of some unknown density f and some known noise density g. We assume that g is polynomially smooth. We provide goodness-of-fit testing procedures for the test H0: f=f0, where the alternative H1 is expressed with respect to $\mathbb{L}_{2}$-norm (i.e. has the form $\psi_{n}^{-2}\|f-f_{0}\|_{2}^{2}\ge \mathcal{C}$). Our procedure is adaptive with respect to the unknown smoothness parameter τ of f. Different testing rates (ψn) are obtained according to whether f0 is polynomially or exponentially smooth. A price for adaptation is noted and for computing this, we provide a non-uniform Berry–Esseen type theorem for degenerate U-statistics. In the case of polynomially smooth f0, we prove that the price for adaptation is optimal. We emphasise the fact that the alternative may contain functions smoother than the null density to be tested, which is new in the context of goodness-of-fit tests.
- Published
- 2009
- Full Text
- View/download PDF
37. Number of hidden states and memory: a joint order estimation problem for Markov chains with Markov regime
- Author
-
Catherine Matias, Antoine Chambaz, Mathématiques Appliquées Paris 5 (MAP5 - UMR 8145), Université Paris Descartes - Paris 5 (UPD5)-Institut National des Sciences Mathématiques et de leurs Interactions (INSMI)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Mathématiques Appliquées à Paris 5 ( MAP5 - UMR 8145 ), Université Paris Descartes - Paris 5 ( UPD5 ) -Institut National des Sciences Mathématiques et de leurs Interactions-Centre National de la Recherche Scientifique ( CNRS ), Laboratoire Statistique et Génome ( SG ), Institut National de la Recherche Agronomique ( INRA ) -Université d'Évry-Val-d'Essonne ( UEVE ) -Centre National de la Recherche Scientifique ( CNRS ), and Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Mathématiques et de leurs Interactions (INSMI)-Université Paris Descartes - Paris 5 (UPD5)
- Subjects
Statistics and Probability ,Markov kernel ,Markov chain ,Variable-order Markov model ,Markov process ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,Markov model ,[ STAT.TH ] Statistics [stat]/Statistics Theory [stat.TH] ,symbols.namesake ,Markov renewal process ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Econometrics ,symbols ,Applied mathematics ,STAT:TH ,Markov property ,Hidden semi-Markov model ,[ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST] ,Statistiques (Mathématiques) ,Mathematics - Abstract
This paper deals with order identification for Markov chains with Markov regime (MCMR) in the context of finite alphabets. We define the joint order of a MCMR process in terms of the number k of states of the hidden Markov chain and the memory m of the conditional Markov chain. We study the properties of penalized maximum likelihood estimators for the unknown order (k, m) of an observed MCMR process, relying on information theoretic arguments. The novelty of our work relies in the joint estimation of two structural parameters. Furthermore, the different models in competition are not nested. In an asymptotic framework, we prove that a penalized maximum likelihood estimator is strongly consistent without prior bounds on k and m . We complement our theoretical work with a simulation study of its behaviour. We also study numerically the behaviour of the BIC criterion. A theoretical proof of its consistency seems to us presently out of reach for MCMR, as such a result does not yet exist in the simpler case where m = 0 (that is for hidden Markov models).
- Published
- 2009
- Full Text
- View/download PDF
38. Adaptivity in convolution models with partially known noise distribution
- Author
-
Catherine Matias, Cristina Butucea, Christophe Pouet, Laboratoire Paul Painlevé - UMR 8524 (LPP), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Laboratoire Statistique et Génome (SG), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Analyse, Topologie, Probabilités (LATP), Université Paul Cézanne - Aix-Marseille 3-Université de Provence - Aix-Marseille 1-Centre National de la Recherche Scientifique (CNRS), and Laboratoire Paul Painlevé (LPP)
- Subjects
62F12, 62G05 (Primary) 62G10, 62G20 (Secondary) ,Statistics and Probability ,quadratic functional estimation ,Mathematics - Statistics Theory ,Context (language use) ,Adaptive nonparametric tests ,Convolution model ,Goodness-of-fit tests ,Infinitely differentiable functions ,Partially known noise ,Quadratic functional estimation ,Sobolev classes ,Stable laws ,Statistics Theory (math.ST) ,01 natural sciences ,Convolution ,Combinatorics ,010104 statistics & probability ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,0502 economics and business ,FOS: Mathematics ,convolution model ,partially known noise ,62G05 ,Statistiques (Mathématiques) ,0101 mathematics ,62G20 ,050205 econometrics ,Mathematics ,goodness-of-fit tests ,Smoothness (probability theory) ,05 social sciences ,stable laws ,Order (ring theory) ,Estimator ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,infinitely differentiable functions ,Primary Subjects: 62F12, 62G05 Secondary Subjects: 62G10, 62G20 ,Distribution (mathematics) ,STAT:TH ,Identifiability ,Statistics, Probability and Uncertainty ,62F12 ,Random variable ,62G10 - Abstract
We consider a semiparametric convolution model. We observe random variables having a distribution given by the convolution of some unknown density $f$ and some partially known noise density $g$. In this work, $g$ is assumed exponentially smooth with stable law having unknown self-similarity index $s$. In order to ensure identifiability of the model, we restrict our attention to polynomially smooth, Sobolev-type densities $f$, with smoothness parameter $\beta$. In this context, we first provide a consistent estimation procedure for $s$. This estimator is then plugged-into three different procedures: estimation of the unknown density $f$, of the functional $\int f^2$ and goodness-of-fit test of the hypothesis $H_0:f=f_0$, where the alternative $H_1$ is expressed with respect to $\mathbb {L}_2$-norm (i.e. has the form $\psi_n^{-2}\|f-f_0\|_2^2\ge \mathcal{C}$). These procedures are adaptive with respect to both $s$ and $\beta$ and attain the rates which are known optimal for known values of $s$ and $\beta$. As a by-product, when the noise density is known and exponentially smooth our testing procedure is optimal adaptive for testing Sobolev-type densities. The estimating procedure of $s$ is illustrated on synthetic data., Comment: Published in at http://dx.doi.org/10.1214/08-EJS225 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2008
- Full Text
- View/download PDF
39. Minimax estimation of the noise level and of the deconvolution density in a semiparametric convolution model
- Author
-
Catherine Matias, Cristina Butucea, Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), and Benassù, Serena
- Subjects
Statistics and Probability ,Mathematical optimization ,[MATH.MATH-PR] Mathematics [math]/Probability [math.PR] ,analytic densities ,Kernel density estimation ,noise level ,deconvolution ,01 natural sciences ,Convolution ,semiparametric model ,010104 statistics & probability ,symbols.namesake ,0502 economics and business ,Consistent estimator ,Applied mathematics ,0101 mathematics ,L_2 risk ,050205 econometrics ,Mathematics ,Smoothness (probability theory) ,Sobolev classes ,05 social sciences ,supersmooth densities ,Semiparametric model ,[MATH.MATH-PR]Mathematics [math]/Probability [math.PR] ,Noise ,Fourier transform ,symbols ,pointwise risk ,Deconvolution ,minimax estimation - Abstract
We consider a semiparametric convolution model where the noise has known Fourier transform which decays asymptotically as an exponential with unknown scale parameter; the deconvolution density is less smooth than the noise in the sense that the tails of the Fourier transform decay more slowly, ensuring the identifiability of the model. We construct a consistent estimation procedure for the noise level and prove that its rate is optimal in the minimax sense. Two convergence rates are distinguished according to different smoothness properties for the unknown density. If the tail of its Fourier transform does not decay faster than exponentially, the asymptotic optimal rate and exact constant are evaluated, while if it does not decay faster than polynomially, this rate is evaluated up to a constant. Moreover, we construct a consistent estimator of the unknown density, by using a plug-in method in the classical kernel estimation procedure. We establish that the rates of estimation of the deconvolution density are slower than in the case of an entirely known noise distribution. In fact, nonparametric rates of convergence are equal to the rate of estimation of the noise level, and we prove that these rates are minimax. In a few specific cases the plug-in method converges at even slower rates.
- Published
- 2005
40. Asymptotics of the maximum likelihood estimator for general hidden Markov models
- Author
-
Catherine Matias and Randal Douc
- Subjects
Statistics and Probability ,Mathematical optimization ,Markov kernel ,Markov chain ,consistency ,hidden Markov models ,Maximum-entropy Markov model ,Variable-order Markov model ,asymptotic normality ,geometric ergodicity ,maximum likelihood estimation ,Markov model ,identifiability ,Expectation–maximization algorithm ,Applied mathematics ,Markov property ,Hidden semi-Markov model ,Mathematics - Abstract
In this paper, we consider the consistency and asymptotic normality of the maximum likelihood estimator for a possibly non-stationary hidden Markov model where the hidden state space is a separable and compact space not necessarily finite, and both the transition kernel of the hidden chain and the conditional distribution of the observations depend on a parameter θ. For identifiable models, consistency and asymptotic normality of the maximum likelihood estimator are shown to follow from exponential memorylessness properties of the state prediction filter and geometric ergodicity of suitably extended Markov chains.
- Published
- 2001
41. Estimation par maximum de vraisemblance dans des modèles à blocs stochastiques dynamiques ou spatiaux
- Author
-
Longepierre, Léa, Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Sorbonne Université, and Catherine Matias
- Subjects
Variational estimation ,Dynamic network ,Markov random field ,Temporal network ,Stochastic block model ,Champs aléatoires de Markov ,Graphes dynamiques ,Maximum likelihood estimation ,Modèle à blocs stochastiques ,Consistance ,Spatial network ,Estimateur variationnel ,Graphes spatiaux ,Estimation par maximum de vraisemblance ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Potts model - Abstract
This thesis deals with maximum likelihood estimation in dynamic and spatial extensions of the stochastic block model (SBM), based respectively on hidden Markov chains and fields. First, we consider a dynamic version of the stochastic block model, suited for the observation of networks at multiple time steps. In this dynamic SBM, the nodes are partitioned into latent classes and the connection between two nodes is drawn from a Bernoulli distribution depending on the classes of these two nodes. The temporal evolution of the nodes memberships is modeled by a hidden Markov chain. We prove the consistency (as the numbers of nodes and time steps increase) of the maximum likelihood and variational estimators of the parameters and obtain upper bounds on their rates of convergence. We also explore the case where the number of time steps is fixed and the connectivity parameters are allowed to vary. Besides, we obtain some results regarding parameter identifiability. Second, we introduce a spatial version of the SBM, suited for the observation of networks at different spatial locations. As before, the nodes are partitioned into latent classes and the connection is drawn from a Bernoulli distribution depending on the classes of these two nodes. The spatial evolution of the nodes memberships is modeled through hidden Markov random fields. We first prove that the parameter is generically identifiable under certain conditions. For the estimation of the parameters, we propose an algorithm based on the simulated field Expectation-Maximisation (EM) algorithm, a variation of the EM algorithm relying on a mean field like approximation based on the simulation of latent configurations.; Cette thèse porte sur le maximum de vraisemblance dans des versions dynamiques et spatiales du modèle à blocs stochastiques (SBM) fondées respectivement sur des chaînes et champs de Markov cachés. D’abord, on considère une version dynamique du SBM adaptée à l’observation de réseaux à différents pas de temps. Dans ce modèle, les nœuds sont répartis dans des groupes latents et la connexion entre deux nœuds suit une loi de Bernoulli dont le paramètre dépend du groupe de ces nœuds. L’évolution temporelle des appartenances aux groupes est modélisée par une chaîne de Markov cachée. On prouve la consistance (lorsque les nombres de nœuds et pas de temps augmentent) des estimateurs du maximum de vraisemblance et variationnels des paramètres, et on obtient des bornes supérieures pour leur taux de convergence. On explore aussi le cas où le nombre de pas de temps est fixé et les probabilités de connexion varient dans le temps. On obtient également des résultats concernant l’identifiabilité des paramètres. Ensuite, on introduit une version spatiale du SBM adaptée à l’observation de réseaux à différentes localisations. Les nœuds sont répartis dans des groupes latents et la connexion entre deux nœuds suit une loi de Bernoulli dont le paramètre dépend du groupe de ces nœuds. L’évolution spatiale des appartenances aux groupes des nœuds est modélisée par des champs de Markov cachés. On montre que le paramètre est génériquement identifiable sous certaines conditions. Pour l’estimation des paramètres, on propose d’adapter à notre modèle une variante de l’algorithme Espérance-Maximisation (EM) reposant sur une approximation de type champ moyen grâce à la simulation de configurations latentes.
- Published
- 2020
42. Distribution of the local score to highlight atypical segments inside sequences
- Author
-
Mercier, Sabine, Institut de Mathématiques de Toulouse UMR5219 (IMT), Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Université de Toulouse (UT)-Institut National des Sciences Appliquées (INSA)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS), Université Toulouse III Paul Sabatier (UT3 Paul Sabatier), Catherine Matias, Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), and Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[MATH.MATH-PR]Mathematics [math]/Probability [math.PR] ,Analyse de séquences ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,p-valeur ,Sequence analysis ,p-value ,Local score ,Score local - Published
- 2018
43. Distribution du score local pour la détection de régions atypiques au sein de séquences
- Author
-
Mercier, Sabine, Institut de Mathématiques de Toulouse UMR5219 (IMT), Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS), Université Toulouse III Paul Sabatier (UT3 Paul Sabatier), and Catherine Matias
- Subjects
[MATH.MATH-PR]Mathematics [math]/Probability [math.PR] ,Analyse de séquences ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,p-valeur ,Sequence analysis ,p-value ,Local score ,Score local - Published
- 2018
44. Contribution à l'analyse Bayésienne de modèles à variables latentes
- Author
-
Latouche, Pierre, Statistique, Analyse et Modélisation Multidisciplinaire (SAmos-Marin Mersenne) (SAMM), Université Paris 1 Panthéon-Sorbonne (UP1), Université Paris 1 - Panthéon Sorbonne, Catherine Matias, and Latouche, Pierre
- Subjects
optimisation ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Bayesian inference ,sélection de modèles ,Model selection ,[MATH.MATH-ST] Mathematics [math]/Statistics [math.ST] ,Inférence ,méthodes variationnelles - Published
- 2017
45. Models and algorithms to study the common evolutionary history of hosts and symbionts
- Author
-
Urbini, Laura, Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS), Université de Lyon, Marie-France Sagot, Catherine Matias, and Blerina Sinaimeri
- Subjects
Event-based methods ,Robusness ,Cophylogeny ,Spread ,Cophilogenie ,Parsimonie ,Systèmes hôtes/symbiotes ,Calcul approximatif Bayésien ,Robustesse ,Méthodes basées sur des evènements ,Host/symbiont system ,Approximate Bayesian computation ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,Parsimony ,Measure for tree reconciliation comparison ,Mesure pour la comparaison de reconciliations entre arbres - Abstract
In this Ph.D. work, we proposed models and algorithms to study the common evolutionary history of hosts and symbionts. The first goal was to analyse the robustness of the methods of phylogenetic tree reconciliations, which are a common way of performing such study. This involves mapping one tree, most often the symbiont’s, to the other using a so-called event-based model. The events considered in general are cospeciation, duplication, host switch, and loss. The host and the symbiont phylogenies are usually considered as given and without any errors. The objective here was to understand the strengths and weaknesses of the parsimonious model used in such mappings of one tree to another, and how the final results may be influenced when small errors are present, or are introduced in the input datasets. This may correspond either to a wrong choice of present-day symbiont-host associations in the case where multiple ones exist, or to small errors related to a wrong rooting of the symbiont tree. Our results show that the choice of leaf associations and of root placement may have a strong impact on the variability of the reconciliation output. We also noticed that the host switch event has an important role in particular for the rooting problem. The second goal of this Ph.D. was to introduce some events that are little or not formally considered in the literature. One of them is the spread, which corresponds to the invasion of different hosts by a same symbiont. In this case, as when spreads are not considered, the optimal reconciliations obtained will depend on the choice made for the costs of the events. The need to develop statistical methods to assign the most appropriate ones therefore remains of actuality. Two types of spread are introduced: vertical and horizontal. The first case corresponds to what could be called also a freeze in the sense that the evolution of the symbiont “freezes” while the symbiont continues to be associated with a host and with the new species that descend from this host. The second includes both an invasion, of the symbiont which remains with the initial host but at the same time gets associated with (“invades”) another one incomparable with the first, and a freeze, actually a double freeze as the evolution of the symbiont “freezes” in relation to the evolution of the host to which it was initially associated and in relation to the evolution of the second one it “invaded”. Our results show that the introduction of these events makes the model more realistic, but also that it is now possible to directly use datasets with a symbiont that is associated with more than one host at the same time, which was not feasible before; Lors de cette thèse, je me suis intéressée aux modèles et aux algorithmes pour étudier l'histoire évolutive commune des hôtes et des symbiotes. Le premier objectif était d'analyser la robustesse des méthodes de réconciliation des arbres phylogénétiques, qui sont très utilisées dans ce type d'étude. Celles-ci associent (ou lient) un arbre, d'habitude celui des symbiotes, à l'autre, en utilisant un modèle dit basé sur des évènements. Les évènements les plus utilisés sont la cospéciation, la duplication, le saut et la perte. Les phylogénies des hôtes et des symbiotes sont généralement considérés comme donnés, et sans aucune erreur. L'objectif était de comprendre les forces et les faiblesses du modèle parcimonieux utilisé et comprendre comment les résultats finaux peuvent être influencés en présence de petites perturbations ou d'erreurs dans les données en entrée. Ici deux cas sont considérés, le premier est le choix erroné d'une association entre les feuilles des hôtes et des symbiotes dans le cas où plusieurs existent, le deuxième est lié au mauvais choix de l'enracinement de l'arbre des symbiotes. Nos résultats montrent que le choix des associations entre feuilles et le choix de l'enracinement peuvent avoir un fort impact sur la variabilité de la réconciliation obtenue. Nous avons également remarqué que l'evènement appelé “saut” joue un rôle important dans l'étude de la robustesse, surtout pour le problème de l'enracinement. Le deuxième objectif de cette thèse était d'introduire certains evènements peu ou pas formellement considérés dans la littérature. L'un d'entre eux est la “propagation”, qui correspond à l'invasion de différents hôtes par un même symbiote. Dans ce cas, lorsque les propagations ne sont pas considérés, les réconciliations optimales sont obtenues en tenant compte seulement des coûts des évènements classiques (cospeciation, duplication, saut, perte). La nécessité de développer des méthodes statistiques pour assigner les coûts les plus appropriés est toujours d'actualité. Deux types de propagations sont introduites : verticaux et horizontaux. Le premier type correspond à ce qu'on pourrait appeler aussi un gel, à savoir que l'évolution du symbiote s'arrête et “gèle” alors que le symbiote continue d'être associé à un hôte et aux nouvelles espèces qui descendent de cet hôte. Le second comprend à la fois une invasion, du symbiote qui reste associé à l'hôte initial, mais qui en même temps s'associe (“envahit”) un autre hôte incomparable avec le premier, et un gel par rapport à l'évolution des deux l'hôtes, celui auquel il était associé au début et celui qu'il a envahi. Nos résultats montrent que l'introduction de ces evènements rend le modèle plus réaliste, mais aussi que désormais il est possible d'utiliser directement des jeux de données avec un symbiote qui est associé plusieurs hôtes au même temps, ce qui n'était pas faisable auparavant
- Published
- 2017
46. Modélisation semi-markovienne de la perte d'autonomie chez les personnes âgées : application à l'assurance dépendance
- Author
-
Biessy, Guillaume, Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-ENSIIE-Centre National de la Recherche Scientifique (CNRS), Thèse CIFRE, Université Paris-Saclay, Université d'Evry Val d'Essonne, Catherine Matias, Laboratoire de Mathématiques et Modélisation d'Evry, Institut National de la Recherche Agronomique (INRA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), and Biessy, Guillaume
- Subjects
mixture model ,[STAT.AP]Statistics [stat]/Applications [stat.AP] ,[STAT.AP] Statistics [stat]/Applications [stat.AP] ,Processus semi-markovien ,local likelihood ,censored data ,données censurées ,vraisemblance locale ,modèle de mélange ,semi-Markov process ,assurance dépendance ,Long-Term-Care Insurance - Abstract
Alongside the increase in life expectancy observed in developed countries since the beginningof the 20th century , numerous challenges arise for modern societies. Among them the loss ofautonomy in elderly people, also known as Long-Term Care (LTC). Long-term care may bedefined as a state of incapacity to perform autonomously part of the Activities of Daily Living(ADL). In most cases, long-term care is caused by one or several pathologies linked to aging.Disabled people therefore require help provided by a relative or professional caregiver or mayeven need to enter a nursing home. In France, a public aid called the Allocation Personnaliséepour l’Autonomie (APA), literally customized aid for autonomy, aims at covering the expensescaused by long-term care. Nevertheless, the amount of benefit is relatively small in regards ofthose expenses. Therefore, many insurers have designed products dedicated to complement thepublic aid.In order to price those products and monitor the risk, insurers need to model the long-termcare process. In most cases, one rely on multi-state modeling with states autonomy, death andone or several levels of LTC. To predict the risk one has to assess the transition probabilitiesbetween states. Under the Markov assumption, those probabilities are considered to onlydepend on the current state. As regards the study of LTC, this assumption may be seen as toorestrictive to account for the complexity of the underlying risk. In a semi-Markov framework,those probabilities may also depend on the time spent in the current state. In this thesis,we emphasis the necessity of the semi-Markov modeling. We demonstrate the impact of timespent in LTC on death probabilities. Besides, we exhibit that taking into account the diversityinduced by pathologies leads to sizable improvements in the fit of the model to experience data.Furthermore, we highlight that the peculiar shape taken by death probabilities as a function oftime spent in LTC may be explained by the mixture of pathology groups among the disabledpopulation.The first chapter of this thesis provides an introduction of the long-term care risk anddifferent tools to quantify it. The second chapter focuses on death probabilities among disabled,using the APA database. In the third chapter, we introduce a fully parametric approach toestimate transition probabilities in a model with a single state of LTC, relying on data froman insurance portfolio. Lastly, the fourth chapter study transition probabilities for 4 distinctgroups of pathologies: cancer, neurological diseases, dementia and other causes. This validatesthe empirical results obtained in the previous chapters., L’allongement de l’espérance de vie observé depuis le début du 20esiècle dans les paysindustrialisés pose un certain nombre de défis aux sociétés modernes. Parmi eux celui de laperte d’autonomie chez les personnes âgées, connue également sous le nom de dépendance. Ladépendance des personnes âgées se définit comme un état d’incapacité à effectuer seul toutou partie des Actes de la Vie Quotidienne (AVQ). La dépendance apparaît dans la grandemajorité des cas sous l’effet d’une ou plusieurs pathologies chroniques liées au vieillissement.Les personnes concernées ont alors besoin de l’assistance d’une tierce personne, un proche ou unaidant professionnel ou même d’intégrer un Établissement d’Hébergement pour Personnes ÂgéesDépendants (EHPAD) dans les cas les plus sévères. En France, une aide publique, l’AllocationPersonnalisée pour l’Autonomie (APA), est destinée à couvrir les frais liés à la dépendance.Cependant, le montant des prestations accordées demeure faible devant les dépenses engendrées.Aussi, de nombreux assureurs ont développé des produits spécifiques destinés à compléter l’aidepublique fournie par l’APA.Afin de fixer les prix de ces produits et d’assurer le suivi du risque, les assureurs ont besoinde modéliser le processus de dépendance. Cette modélisation passe dans la majorité des cas parune représentation multi-états du processus dont les états sont l’autonomie, le décès ainsi qu’unou plusieurs niveaux de dépendance. Pour prédire le risque il est alors nécessaire d’estimer lesprobabilités de transition entre ces états. Sous l’hypothèse de Markov, on considère que cesprobabilités de transition dépendent uniquement de l’état actuel. En ce qui concerne l’étude durisque de dépendance, cette hypothèse peut paraître trop restrictive pour rendre compte de lacomplexité du phénomène étudié. Dans le cadre semi-markovien, plus général, les probabilitésde transition peuvent également dépendre du temps passé dans l’état actuel. Au cours decette thèse, nous nous attachons à montrer la nécessité d’une modélisation semi-markoviennedu processus. Nous mettons ainsi en évidence l’impact du temps passé en dépendance surles probabilités de décès. Nous montrons par ailleurs que la prise en compte de la diversitéinduite par les pathologies permet d’améliorer sensiblement l’adéquation du modèle proposéaux données étudiées. Plus encore, nous établissons que la forme particulière de la probabilitéde décès en fonction du temps passé en dépendance peut être expliquée par le mélange desgroupes de pathologies qui constituent la population des individus dépendants.Le premier chapitre de la thèse propose une introduction du sujet et des principales méthodesutilisées pour sa quantification. Le deuxième chapitre est consacré à l’étude des probabilitésde transitions pour les individus déjà dépendants sur la base de données publiques de l’APA.Dans le troisième chapitre, nous introduisons une démarche entièrement paramétrique pourl’estimation des probabilités de transition dans un modèle avec un seul niveau de dépendancesur la base de données de portefeuilles. Nous prenons notamment en compte le rôle du mélangedes groupes de pathologies, quand bien même celles-ci ne sont pas observées. Enfin, lequatrième chapitre est consacré à l’étude des probabilités de transition associées à 4 groupesde pathologies : cancer, maladies neurologiques, démence et autres causes. Cette étude permetainsi de valider les résultats empiriques établis au cours des chapitres précédents.
- Published
- 2016
47. Modèles de mélange semi-paramétriques et applications aux tests multiples
- Author
-
Nguyen, Van Hanh, Laboratoire de Mathématiques d'Orsay (LM-Orsay), Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS), Université Paris Sud - Paris XI, Elisabeth Gassiat, and Catherine Matias
- Subjects
Semi-parametric ,Estimateurs par histogramme ,Modèles de mélange ,[MATH.MATH-GM]Mathematics [math]/General Mathematics [math.GM] ,Tests multiple ,False discovery rate ,Histogram based estimators ,Multiple testing ,Kernel estimators ,Estimateurs à noyau ,Semi-paramétrique ,Mixture models - Abstract
In a multiple testing context, we consider a semiparametric mixture model with two components. One component is assumed to be known and corresponds to the distribution of p-values under the null hypothesis with prior probability p. The other component f is nonparametric and stands for the distribution under the alternative hypothesis. The problem of estimating the parameters p and f of the model appears from the false discovery rate control procedures. In the first part of this dissertation, we study the estimation of the proportion p. We discuss asymptotic efficiency results and establish that two different cases occur whether f vanishes on a non-empty interval or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. In the second part of the dissertation, we focus on the estimation of the nonparametric unknown component f in the mixture, relying on a preliminary estimator of p. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of Holder densities. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate.; Dans un contexte de test multiple, nous considérons un modèle de mélange semi-paramétrique avec deux composantes. Une composante est supposée connue et correspond à la distribution des p-valeurs sous hypothèse nulle avec probabilité a priori p. L'autre composante f est nonparamétrique et représente la distribution des p-valeurs sous l'hypothèse alternative. Le problème d'estimer les paramètres p et f du modèle apparaît dans les procédures de contrôle du taux de faux positifs (``false discovery rate'' ou FDR). Dans la première partie de cette dissertation, nous étudions l'estimation de la proportion p. Nous discutons de résultats d'efficacité asymptotique et établissons que deux cas différents arrivent suivant que f s'annule ou non surtout un intervalle non-vide. Dans le premier cas (annulation surtout un intervalle), nous présentons des estimateurs qui convergent \`{a} la vitesse paramétrique, calculons la variance asymptotique optimale et conjecturons qu'aucun estimateur n'est asymptotiquement efficace (i.e atteint la variance asymptotique optimale). Dans le deuxième cas, nous prouvons que le risque quadratique de n'importe quel estimateur ne converge pas à la vitesse paramétrique. Dans la deuxième partie de la dissertation, nous nous concentrons sur l'estimation de la composante inconnue nonparamétrique f dans le mélange, en comptant sur un estimateur préliminaire de p. Nous proposons et étudions les propriétés asymptotiques de deux estimateurs différents pour cette composante inconnue. Le premier estimateur est un estimateur à noyau avec poids aléatoires. Nous établissons une borne supérieure pour son risque quadratique ponctuel, en montrant une vitesse de convergence nonparamétrique classique sur une classe de Holder. Le deuxième estimateur est un estimateur du maximum de vraisemblance régularisée. Il est calculé par un algorithme itératif, pour lequel nous établissons une propriété de décroissance d'un critère. De plus, ces estimateurs sont utilisés dans une procédure de test multiple pour estimer le taux local de faux positifs (``local false discovery rate'' ou lfdr).
- Published
- 2013
48. Semi-parametric mixture models and applications to multiple testing
- Author
-
Nguyen, Van Hanh, STAR, ABES, Laboratoire de Mathématiques d'Orsay (LM-Orsay), Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS), Université Paris Sud - Paris XI, Elisabeth Gassiat, and Catherine Matias
- Subjects
Semi-parametric ,Modèles de mélange ,Estimateurs par histogramme ,[MATH.MATH-GM]Mathematics [math]/General Mathematics [math.GM] ,Tests multiple ,False discovery rate ,Histogram based estimators ,Multiple testing ,[MATH.MATH-GM] Mathematics [math]/General Mathematics [math.GM] ,Kernel estimators ,Estimateurs à noyau ,Semi-paramétrique ,Mixture models - Abstract
In a multiple testing context, we consider a semiparametric mixture model with two components. One component is assumed to be known and corresponds to the distribution of p-values under the null hypothesis with prior probability p. The other component f is nonparametric and stands for the distribution under the alternative hypothesis. The problem of estimating the parameters p and f of the model appears from the false discovery rate control procedures. In the first part of this dissertation, we study the estimation of the proportion p. We discuss asymptotic efficiency results and establish that two different cases occur whether f vanishes on a non-empty interval or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. In the second part of the dissertation, we focus on the estimation of the nonparametric unknown component f in the mixture, relying on a preliminary estimator of p. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of Holder densities. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate., Dans un contexte de test multiple, nous considérons un modèle de mélange semi-paramétrique avec deux composantes. Une composante est supposée connue et correspond à la distribution des p-valeurs sous hypothèse nulle avec probabilité a priori p. L'autre composante f est nonparamétrique et représente la distribution des p-valeurs sous l'hypothèse alternative. Le problème d'estimer les paramètres p et f du modèle apparaît dans les procédures de contrôle du taux de faux positifs (``false discovery rate'' ou FDR). Dans la première partie de cette dissertation, nous étudions l'estimation de la proportion p. Nous discutons de résultats d'efficacité asymptotique et établissons que deux cas différents arrivent suivant que f s'annule ou non surtout un intervalle non-vide. Dans le premier cas (annulation surtout un intervalle), nous présentons des estimateurs qui convergent \`{a} la vitesse paramétrique, calculons la variance asymptotique optimale et conjecturons qu'aucun estimateur n'est asymptotiquement efficace (i.e atteint la variance asymptotique optimale). Dans le deuxième cas, nous prouvons que le risque quadratique de n'importe quel estimateur ne converge pas à la vitesse paramétrique. Dans la deuxième partie de la dissertation, nous nous concentrons sur l'estimation de la composante inconnue nonparamétrique f dans le mélange, en comptant sur un estimateur préliminaire de p. Nous proposons et étudions les propriétés asymptotiques de deux estimateurs différents pour cette composante inconnue. Le premier estimateur est un estimateur à noyau avec poids aléatoires. Nous établissons une borne supérieure pour son risque quadratique ponctuel, en montrant une vitesse de convergence nonparamétrique classique sur une classe de Holder. Le deuxième estimateur est un estimateur du maximum de vraisemblance régularisée. Il est calculé par un algorithme itératif, pour lequel nous établissons une propriété de décroissance d'un critère. De plus, ces estimateurs sont utilisés dans une procédure de test multiple pour estimer le taux local de faux positifs (``local false discovery rate'' ou lfdr).
- Published
- 2013
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.