76 results on '"Camerlenghi, F."'
Search Results
2. Contaminated Gibbs-Type Priors
- Author
-
Camerlenghi, F, Corradin, R, Ongaro, A, Camerlenghi, Federico, Corradin, Riccardo, Ongaro, Andrea, Camerlenghi, F, Corradin, R, Ongaro, A, Camerlenghi, Federico, Corradin, Riccardo, and Ongaro, Andrea
- Abstract
Gibbs-type priors are combinatorial processes widely used as key components in several Bayesian nonparametric models. By virtue of their flexibility and mathematical tractability, they turn out to be predominant priors in species sampling problems and mixture modeling. We introduce a new family of processes which extends the Gibbs-type one, by including a contaminant component in the model to account for an excess of observations with frequency one. We first investigate the induced random partition, the associated predictive distribution, the asymptotic behavior of the total number of blocks and the number of blocks with a given frequency: all the results we obtain are in closed form and easily interpretable. A remarkable aspect of contaminated Gibbs-type priors relies on their predictive structure, compared to the one of the standard Gibbs-type family: it depends on the additional sampling information on the number of observations with frequency one out of the observed sample. As a noteworthy example we focus on the contaminated version of the Pitman-Yor process, which turns out to be analytically tractable and computationally feasible. Finally we pinpoint the advantage of our construction in different applications: we show how it helps to improve predictive inference in a species-related dataset exhibiting a high number of species with frequency one; we also discuss the use of the proposed construction in mixture models to perform density estimation and outlier detection.
- Published
- 2023
3. A Common Atoms Model for the Bayesian Nonparametric Analysis of Nested Data
- Author
-
Denti, F, Camerlenghi, F, Guindani, M, Mira, A, Denti, Francesco, Camerlenghi, Federico, Guindani, Michele, Mira, Antonietta, Denti, F, Camerlenghi, F, Guindani, M, Mira, A, Denti, Francesco, Camerlenghi, Federico, Guindani, Michele, and Mira, Antonietta
- Abstract
The use of large datasets for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this manuscript, we propose a nested common atoms model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study.
- Published
- 2023
4. On the estimation of the mean density of random closed sets
- Author
-
Camerlenghi, F., Capasso, V., and Villa, E.
- Published
- 2014
- Full Text
- View/download PDF
5. On johnson’s “sufficientness” postulates for feature-sampling models
- Author
-
Camerlenghi, F, Favaro, S, Camerlenghi F., Favaro S., Camerlenghi, F, Favaro, S, Camerlenghi F., and Favaro S.
- Abstract
In the 1920s, the English philosopher W.E. Johnson introduced a characterization of the symmetric Dirichlet prior distribution in terms of its predictive distribution. This is typically referred to as Johnson’s “sufficientness” postulate, and it has been the subject of many contributions in Bayesian statistics, leading to predictive characterization for infinite-dimensional generalizations of the Dirichlet distribution, i.e., species-sampling models. In this paper, we review “sufficientness” postulates for species-sampling models, and then investigate analogous predictive characterizations for the more general feature-sampling models. In particular, we present a “sufficientness” postulate for a class of feature-sampling models referred to as Scaled Processes (SPs), and then discuss analogous characterizations in the general setup of feature-sampling models.
- Published
- 2021
6. Nonparametric bayesian multiarmed bandits for single-cell experiment design
- Author
-
Camerlenghi, F, Dumitrascu, B, Ferrari, F, Engelhardt, B, Favaro, S, Camerlenghi F., Dumitrascu B., Ferrari F., Engelhardt B. E., Favaro S., Camerlenghi, F, Dumitrascu, B, Ferrari, F, Engelhardt, B, Favaro, S, Camerlenghi F., Dumitrascu B., Ferrari F., Engelhardt B. E., and Favaro S.
- Abstract
The problem of maximizing cell type discovery under budget constraints is a fundamental challenge for the collection and analysis of single-cell RNAsequencing (scRNA-seq) data. In this paper we introduce a simple, computationally efficient and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large-scale experiment for the collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on the following tools: (i) a hierarchical Pitman–Yor prior that recapitulates biological assumptions regarding cellular differentiation, and (ii) a Thompson sampling multiarmed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed by using a sequential Monte Carlo approach which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and scRNA-seq data alike.
- Published
- 2020
7. An information theoretic approach to post randomization methods under differential privacy
- Author
-
Ayed, F, Battiston, M, Camerlenghi, F, Ayed F., Battiston M., Camerlenghi F., Ayed, F, Battiston, M, Camerlenghi, F, Ayed F., Battiston M., and Camerlenghi F.
- Abstract
Post randomization methods are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix M and a specified variable, an individual belonging to category i is changed to category j with probability Mi,j. Every approach to choose the randomization matrix M has to balance between two desiderata: (1) preserving as much statistical information from the raw data as possible; (2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose M as the solution of a constrained maximization problems. Specifically, M is chosen as the solution of a constrained maximization problem, where we maximize the mutual information between raw and transformed data, given the constraint that the transformation satisfies the notion of differential privacy. For the general categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms.
- Published
- 2020
8. Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation
- Author
-
Camerlenghi, F, Favaro, S, Masoero, L, Broderick, T, Camerlenghi, F, Favaro, S, Masoero, L, and Broderick, T
- Abstract
There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.
- Published
- 2022
9. Clustering artists based on the energy distributions of their songs on Spotify via the Common Atoms Model Clustering di artisti in base alla distribuzione dell’energia delle loro canzoni su Spotify con il Common Atom Model
- Author
-
Balzanella, A, Bini, M, Cavicchia, C, Verde, R, Denti, F, Camerlenghi, F, Guindani, M, Mira, A, Francesco Denti, Federico Camerlenghi, Michele Guindani, Antonietta Mira, Balzanella, A, Bini, M, Cavicchia, C, Verde, R, Denti, F, Camerlenghi, F, Guindani, M, Mira, A, Francesco Denti, Federico Camerlenghi, Michele Guindani, and Antonietta Mira
- Abstract
Partially exchangeable datasets are characterized by observations grouped into known, heterogeneous units. The recently developed Common Atoms Model (CAM) is a Bayesian nonparametric technique suited for analyzing this type of data. CAM induces a two-layered clustering structure: one across observations and another across units. In particular, the units are clustered according to their distributional similarities. In this article, we illustrate the versatility of CAM with an application to an openly available Spotify dataset. The dataset contains quantitative audio features for a large number of songs grouped by artists. After describing the data preprocessing steps, we employ CAM to group the Spotify artists according to the distributions of the energy of their songs.
- Published
- 2022
10. More for less: predicting and maximizing genomic variant discovery via Bayesian nonparametrics
- Author
-
Masoero, L, Camerlenghi, F, Favaro, S, Broderick, T, Masoero, Lorenzo, Camerlenghi, Federico, Favaro, Stefano, Broderick, Tamara, Masoero, L, Camerlenghi, F, Favaro, S, Broderick, T, Masoero, Lorenzo, Camerlenghi, Federico, Favaro, Stefano, and Broderick, Tamara
- Abstract
While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains nontrivial. Under a fixed budget, scientists face a natural trade-off between quantity and quality: spending resources to sequence a greater number of genomes or spending resources to sequence genomes with increased accuracy. Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible. We introduce a Bayesian nonparametric methodology to predict the number of new variants in a follow-up study based on a pilot study. When experimental conditions are kept constant between the pilot and follow-up, we find that our prediction is competitive with the best existing methods. Unlike current methods, though, our new method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for more realistic predictions and for optimal allocation of a fixed budget between quality and quantity. We validate our method on cancer and human genomics data.
- Published
- 2022
11. On the convex combination of a Dirichlet process with a diffuse probability measure
- Author
-
Camerlenghi, F, Corradin, R, Ongaro, A, Perna, C, Salvati, N, Schirripa Spagnolo, F, Camerlenghi, F, Corradin, R, and Ongaro, A
- Subjects
Dirichlet process, Bayesian nonparametrics, Hapax legomena, Unique values, Convex combination ,SECS-S/01 - STATISTICA - Published
- 2021
12. Bayesian nonparametric prediction: from species to features
- Author
-
Masoero, L, Camerlenghi, F, Favaro, S, Broderick, T, Perna, C, Salvati, N, Schirripa Spagnolo, F, Masoero, L, Camerlenghi, F, Favaro, S, and Broderick, T
- Subjects
SECS-S/01 - STATISTICA ,Prediction, exchangeability, species models, feature models, Indian Buffet process - Published
- 2021
13. On the convex combination of a Dirichlet process with a diffuse probability measure
- Author
-
Perna, C, Salvati, N, Schirripa Spagnolo, F, Camerlenghi, F, Corradin, R, Ongaro, A, Perna, C, Salvati, N, Schirripa Spagnolo, F, Camerlenghi, F, Corradin, R, and Ongaro, A
- Published
- 2021
14. Consistent estimation of small masses in feature sampling
- Author
-
Ayed, F, Battiston, M, Camerlenghi, F, Favaro, S, Ayed, F, Battiston, M, Camerlenghi, F, and Favaro, S
- Abstract
Consider an (observable) random sample of size n from an infinite population of individuals, each individual being endowed with a finite set of "features" from a collection of features (F-j)(j>1) with unknown probabilities (p(j))(j>1), i.e., p(j) is the probability that an individual displays feature F-j. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses p(j)'s of features observed with frequency r >= 0 in the sample, here denoted by M-n,M-r. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or "species"). In this paper we study the problem of consistent estimation of the small mass M-n,M-r. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass M-n,M-0. Then, we introduce an estimator of M-n,M-r and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator (M) over cap (n,r) of M-n,M-r which has the same analytic form of the celebrated Good-Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that (M) over cap (n,r) is strongly consistent, in the multiplicative sense, under the assumption that (p(j))(j >= 1) has regularly varying heavy tails.
- Published
- 2021
15. Survival analysis via hierarchically dependent mixture hazards
- Author
-
Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Abstract
Hierarchical nonparametric processes are popular tools for defining priors on collections of probability distributions, which induce dependence across multiple samples. In survival analysis problems, one is typically interested in modeling the hazard rates, rather than the probability distributions themselves, and the currently available methodologies are not applicable. Here, we fill this gap by introducing a novel, and analytically tractable, class of multivariate mixtures whose distribution acts as a prior for the vector of sample-specific baseline hazard rates. The dependence is induced through a hierarchical specification of the mixing random measures that ultimately corresponds to a composition of random discrete combinatorial structures. Our theoretical results allow to develop a full Bayesian analysis for this class of models, which can also account for right-censored survival data and covariates, and we also show posterior consistency. In particular, we emphasize that the posterior characterization we achieve is the key for devising both marginal and conditional algorithms for evaluating Bayesian inferences of interest. The effectiveness of our proposal is illustrated through some synthetic and real data examples.
- Published
- 2021
16. Bayesian nonparametric prediction: from species to features
- Author
-
Perna, C, Salvati, N, Schirripa Spagnolo, F, Masoero, L, Camerlenghi, F, Favaro, S, Broderick, T, Perna, C, Salvati, N, Schirripa Spagnolo, F, Masoero, L, Camerlenghi, F, Favaro, S, and Broderick, T
- Published
- 2021
17. Issues on Bayesian nonparametric measures of disclosure risk
- Author
-
Camerlenghi, F, Carota, C, Favaro, S, Camerlenghi, F, Carota, C, and Favaro, S
- Subjects
Dirichlet process prior ,Bayesian nonparametric ,disclosure risk ,empirical Baye ,exchangeable random partition ,identifying and sensitive information - Published
- 2019
18. Hierarchical and nested random probability measures with statistical applications
- Author
-
Camerlenghi, F and Camerlenghi, F
- Subjects
Bayesian nonparamteric ,nested processe ,partial exchangeability ,completely random measure ,Hierarchical processe - Published
- 2019
19. A formal approach to data swapping and disclosure limitation techniques
- Author
-
Ayed, F, Battiston, M, Camerlenghi, F, Ayed, F, Battiston, M, and Camerlenghi, F
- Subjects
Data swapping, Disclosure risk, Mutual Information, Differential Privacy, Multinomial Models - Published
- 2019
20. Hierarchies of nonparametric priors
- Author
-
Camerlenghi, F, Favaro, S, Masoero, L, Camerlenghi, F, Favaro, S, and Masoero, L
- Subjects
Hierarchical processes, species models, feature models, completely random measures, partial exchangeability - Published
- 2019
21. Issues with Nonparametric Disclosure Risk Assessment
- Author
-
Camerlenghi, F, Favaro, S, Naulet, Z, Panero, F, Camerlenghi, F, Favaro, S, Naulet, Z, and Panero, F
- Subjects
disclosure risk assessment, microdata sample, parametric inference, nonparametric inference - Published
- 2019
22. Bayesian nonparametric prediction with multi-sample data
- Author
-
La Rocca, M, Liseo, B, Salmaso, L, Camerlenghi, F, Lijoi, A, Pruenster, I, La Rocca, M, Liseo, B, Salmaso, L, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Abstract
In the present paper, we address the problem of prediction within the setting of species sampling models. We consider d populations composed of different species with unknown proportions. Our goal is to predict specific features of additional and unobserved samples from the d populations by adopting a Bayesian nonparametric model. We focus on a broad class of hierarchical priors. These were introduced and investigated in[1], where also an algorithm for drawing predictions is devised, however, without any specific numerical illustration. The aim of this paper is twofold: on the one hand, we provide an illustration with an actual implementation of the algorithm of[1] and, on the other hand, we discuss its relevance with respect to complex prediction problems with species sampling data.
- Published
- 2020
23. Estimating the number of unseen species under heavy tails
- Author
-
Battiston, M, Camerlenghi, F, Dolera, E, Favaro, S, Abbruzzo, A, Brentari, E, Chiodi, M, Piacentino, D, Battiston, M, Camerlenghi, F, Dolera, E, and Favaro, S
- Subjects
Good-Toulmin type estimators, regular variation, species estimation, two parameter Poisson-Dirichlet prior - Published
- 2018
24. Distribution theory for hierarchical processes
- Author
-
Camerlenghi, F, Camerlenghi, F, Lijoi, A, Orbanz, P, Pruenster, I, Camerlenghi, F, Camerlenghi, F, Lijoi, A, Orbanz, P, and Pruenster, I
- Published
- 2017
25. Good-Toulmin type estimators for the number of unseen species
- Author
-
Camerlenghi, F, Camerlenghi, F, Dolera, E, Favaro, S, Camerlenghi, F, Camerlenghi, F, Dolera, E, and Favaro, S
- Published
- 2017
26. Distribution theory for hierarchical processes
- Author
-
Camerlenghi, F, Lijoi, A, Orbanz, P, Prünster, I, Camerlenghi, Federico, Lijoi, Antonio, Orbanz, Peter, Prünster, Igor, Camerlenghi, F, Lijoi, A, Orbanz, P, Prünster, I, Camerlenghi, Federico, Lijoi, Antonio, Orbanz, Peter, and Prünster, Igor
- Abstract
Hierarchies of discrete probability measures are remarkably popular as nonparametric priors in applications, arguably due to two key properties: (i) they naturally represent multiple heterogeneous populations; (ii) they produce ties across populations, resulting in a shrinkage property often described as “sharing of information.” In this paper, we establish a distribution theory for hierarchical random measures that are generated via normalization, thus encompassing both the hierarchical Dirichlet and hierarchical Pitman–Yor processes. These results provide a probabilistic characterization of the induced (partially exchangeable) partition structure, including the distribution and the asymptotics of the number of partition sets, and a complete posterior characterization. They are obtained by representing hierarchical processes in terms of completely random measures, and by applying a novel technique for deriving the associated distributions. Moreover, they also serve as building blocks for new simulation algorithms, and we derive marginal and conditional algorithms for Bayesian inference.
- Published
- 2019
27. Latent nested nonparametric priors (with discussion)
- Author
-
Camerlenghi, F, Dunson, D, Lijoi, A, Pruenster, I, Rodriguez, A, Dunson, DB, Camerlenghi, F, Dunson, D, Lijoi, A, Pruenster, I, Rodriguez, A, and Dunson, DB
- Abstract
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalizing to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop a Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by-product. The results and their inferential implications are showcased on synthetic and real data.
- Published
- 2019
28. Density Estimation via Hierarchies of Nonparametric Priors
- Author
-
Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Abstract
In Bayesian Nonparametrics partial exchangeability is a useful assumption tailored for heterogeneous, though related, groups of observations. Recent contributions in Bayesian literature have focused on the construction of dependent nonparametric priors to accommodate for partially exchangeable sequences of observations. In the present paper we concentrate on vectors of hierarchical Pitman-Yor processes, in which the dependence is created by choosing a common random base measure for each group of observations. These hierarchical processes are then used to define dependent hierarchical mixtures. We finally apply the model to estimate densities arising from multiple groups of observations by performing a suitable Gibbs sampling algorithm.
- Published
- 2018
29. Estimating the number of unseen species under heavy tails
- Author
-
Abbruzzo, A, Brentari, E, Chiodi, M, Piacentino, D, Battiston, M, Camerlenghi, F, Dolera, E, Favaro, S, Abbruzzo, A, Brentari, E, Chiodi, M, Piacentino, D, Battiston, M, Camerlenghi, F, Dolera, E, and Favaro, S
- Published
- 2018
30. Bayesian nonparametric inference beyond the Gibbs-type framework
- Author
-
Camerlenghi, F, Lijoi, A, Prünster, I, Camerlenghi, Federico, Lijoi, Antonio, Prünster, Igor, Camerlenghi, F, Lijoi, A, Prünster, I, Camerlenghi, Federico, Lijoi, Antonio, and Prünster, Igor
- Abstract
The definition and investigation of general classes of nonparametric priors has recently been an active research line in Bayesian statistics. Among the various proposals, the Gibbs-type family, which includes the Dirichlet process as a special case, stands out as the most tractable class of nonparametric priors for exchangeable sequences of observations. This is the consequence of a key simplifying assumption on the learning mechanism, which, however, has justification except that of ensuring mathematical tractability. In this paper, we remove such an assumption and investigate a general class of random probability measures going beyond the Gibbs-type framework. More specifically, we present a nonparametric hierarchical structure based on transformations of completely random measures, which extends the popular hierarchical Dirichlet process. This class of priors preserves a good degree of tractability, given that we are able to determine the fundamental quantities for Bayesian inference. In particular, we derive the induced partition structure and the prediction rules and characterize the posterior distribution. These theoretical results are also crucial to devise both a marginal and a conditional algorithm for posterior inference. An illustration concerning prediction in genomic sequencing is also provided.
- Published
- 2018
31. Large and moderate deviations for kernel–type estimators of the mean density of Boolean models
- Author
-
Camerlenghi, F, Villa, E, Camerlenghi, F, and Villa, E
- Abstract
The mean density of a random closed set with integer Hausdorff dimension is a crucial notion in stochastic geometry, in fact it is a fundamental tool in a large variety of applied problems, such as image analysis, medicine, computer vision, etc. Hence the estimation of the mean density is a problem of interest both from a theoretical and computational standpoint. Nowadays different kinds of estimators are available in the literature, in particular here we focus on a kernel–type estimator, which may be considered as a generalization of the traditional kernel density estimator of random variables to the case of random closed sets. The aim of the present paper is to provide asymptotic properties of such an estimator in the context of Boolean models, which are a broad class of random closed sets. More precisely we are able to prove large and moderate deviation principles, which allow us to derive the strong consistency of the estimator of the mean density as well as asymptotic confidence intervals. Finally we underline the connection of our theoretical findings with classical literature concerning density estimation of random variables
- Published
- 2018
32. Nonparametric hierarchical models based on completely random measures
- Author
-
Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi F, Lijoi A, Pruenster I, Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi F, Lijoi A, and Pruenster I
- Abstract
A very active line of research in Bayesian statistics has aimed at defining and investigating general classes of nonparametric priors. A notable example, which includes the Dirichlet process, is obtained through normalization or transformation of completely random measures. These have been extensively studied for the exchangeable setting. However in a large variety of applied problems data are heterogeneous, being generated by different, though related, experiments; in such situations partial exchangeability is a more appropriate assumption. In this spirit we propose a nonparametric hierarchical model based on transformations of completely random measures, which extends the hierarchical Dirichlet process. The model allows us to handle related groups of observations, creating a borrowing of strength between them. From the theoretical viewpoint, we analyze the induced partition structure, which plays a pivotal role in a very large number of inferential problems. The resulting partition probability function has a feasible expression, suitable to address predictionin its generality, as suggested by de Finetti. Finally we propose a set of applications which include inference on genomic and survival data.
- Published
- 2015
33. Nonparametric hierarchical models based on completely random measures
- Author
-
Camerlenghi F, Lijoi A, Pruenster I, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Subjects
Bayesian nonparametrics ,SECS-S/01 - STATISTICA - Abstract
A very active line of research in Bayesian statistics has aimed at defining and investigating general classes of nonparametric priors. A notable example, which includes the Dirichlet process, is obtained through normalization or transformation of completely random measures. These have been extensively studied for the exchangeable setting. However in a large variety of applied problems data are heterogeneous, being generated by different, though related, experiments; in such situations partial exchangeability is a more appropriate assumption. In this spirit we propose a nonparametric hierarchical model based on transformations of completely random measures, which extends the hierarchical Dirichlet process. The model allows us to handle related groups of observations, creating a borrowing of strength between them. From the theoretical viewpoint, we analyze the induced partition structure, which plays a pivotal role in a very large number of inferential problems. The resulting partition probability function has a feasible expression, suitable to address predictionin its generality, as suggested by de Finetti. Finally we propose a set of applications which include inference on genomic and survival data.
- Published
- 2015
34. On some distributional properties of hierarchical processes
- Author
-
Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Abstract
Vectors of hierarchical random probability measures are popular tools in Bayesian nonparametrics. They may be used as priors whenever partial exchangeability is assumed at the level of either the observations or of some latent variables involved in the model. The first contribution in this direction can be found in Teh et al. (2006), who introduced the hierarchical Dirichlet process. Recently, Camerlenghi et al. (2018) have developed a general distribution theory for hierarchical processes, which includes the derivation of the partition structure, the posterior distribution and the prediction rules. The present paper is a review of these theoretical findings for vectors of hierarchies of Pitman--Yor processes.
- Published
- 2017
35. Bayesian prediction with multiple-samples information
- Author
-
Camerlenghi, F, Lijoi, A, Pruenster, I, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Abstract
The prediction of future outcomes of a random phenomenon is typically based on a certain number of “analogous” observations from the past. When observations are generated by multiple samples, a natural notion of analogy is partial exchangeability and the problem of prediction can be effectively addressed in a Bayesian nonparametric setting. Instead of confining ourselves to the prediction of a single future experimental outcome, as in most treatments of the subject, we aim at predicting features of an unobserved additional sample of any size. We first provide a structural property of prediction rules induced by partially exchangeable arrays, without assuming any specific nonparametric prior. Then we focus on a general class of hierarchical random probability measures and devise a simulation algorithm to forecast the outcome of m future observations, for any m≥1. The theoretical result and the algorithm are illustrated by means of a real dataset, which also highlights the “borrowing strength” behavior across samples induced by the hierarchical specification.
- Published
- 2017
36. Asymptotic results for multivariate estimators of the mean density of random closed sets
- Author
-
Camerlenghi, F, Macci, C, Villa, E, Camerlenghi, F, Macci, C, and Villa, E
- Abstract
The problem of the evaluation and estimation of the mean density of random closed sets in Rd with integer Hausdorff dimension 0 < n < d, is of great interest in many different scientific and technological fields. Among the estimators of the mean density available in literature, the so-called "Minkowski content"-based estimator reveals its benefits in applications in the non-stationary cases. We introduce here a multivariate version of such estimator, and we study its asymptotical properties by means of large and moderate deviation results. In particular we prove that the estimator is strongly consistent and asymptotically Normal. Furthermore we also provide confidence regions for the mean density of the involved random closed set in m ≥ 1 distinct points x1, …, xm ∈ Rd.
- Published
- 2016
37. Nested processes based on completely random measures
- Author
-
Camerlenghi, F, David B. Dunson, D, Lijoi, A, Pruenster, I, Rodriguez, A, David B. Dunson, DB, Camerlenghi, F, David B. Dunson, D, Lijoi, A, Pruenster, I, Rodriguez, A, and David B. Dunson, DB
- Published
- 2016
38. On time-dependent Gibbs-type random probability measures
- Author
-
Camerlenghi, F, Pruenster, I, Ruggiero, M, Camerlenghi, F, Pruenster, I, and Ruggiero, M
- Abstract
We review some recent constructions of time-dependent randomprobabilitymeasures of Gibbs type which exhibit diffusive behaviour. The characterization of the dynamics of the type frequencies, and those of the type heterogeneity in the underlying population, also known as alpha diversity, allows for a qualitative classification of these models according to whether the heterogeneity is driven by state-dependent
- Published
- 2016
39. Optimal Bandwidth of the “Minkowski Content”-Based Estimator of the Mean Density of Random Closed Sets: Theoretical Results and Numerical Experiments
- Author
-
Camerlenghi, F, Villa, E, Camerlenghi, F, and Villa, E
- Abstract
The estimation of the mean density of random closed sets in $$\mathbb {R}^d$$Rd with integer Hausdorff dimension $$n
- Published
- 2015
40. Numerical experiments for the estimation of mean densities of random sets
- Author
-
Camerlenghi, F, Capasso, V, Villa, E, Camerlenghi, F, Capasso, V, and Villa, E
- Abstract
Many real phenomena may be modelled as random closed sets in âd, of different Hausdorff dimensions. The problem of the estimation of pointwise mean densities of absolutely continuous, and spatially inhomogeneous, random sets with Hausdorff dimension n < d; has been the subject of extended mathematical analysis by the authors. In particular, two different kinds of estimators have been recently proposed, the first one is based on the notion of Minkowski content, the second one is a kernel-type estimator generalizing the well-known kernel density estimator for random variables. The specific aim of the present paper is to validate the theoretical results on statistical properties of those estimators by numerical experiments. We provide a set of simulations which illustrates their valuable properties via typical examples of lower dimensional random sets
- Published
- 2014
41. On the estimation of the mean density of random closed sets
- Author
-
Camerlenghi, F, Capasso, V, Villa, E, Camerlenghi, F, Capasso, V, and Villa, E
- Abstract
Many real phenomena may be modeled as random closed sets in Rd, of different Hausdorff dimensions. Of particular interest are cases in which their Hausdorff dimension, say n, is strictly less than d, such as fiber processes, boundaries of germ-grain models, and n-facets of random tessellations. A crucial problem is the estimation of pointwise mean densities of absolutely continuous, and spatially inhomogeneous random sets, as defined by the authors in a series of recent papers. While the case n= 0 (random vectors, point processes, etc.) has been, and still is, the subject of extensive literature, in this paper we face the general case of any n<. d; pointwise density estimators which extend the notion of kernel density estimators for random vectors are analyzed, together with a previously proposed estimator based on the notion of Minkowski content. In a series of papers, the authors have established the mathematical framework for obtaining suitable approximations of such mean densities. Here we study the unbiasedness and consistency properties, and identify optimal bandwidths for all proposed estimators, under sufficient regularity conditions. We show how some known results in the literature follow as particular cases. A series of examples throughout the paper, both non-stationary, and stationary, are provided to illustrate various relevant situations. © 2013 Elsevier Inc
- Published
- 2014
42. Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation
- Author
-
Federico Camerlenghi, Stefano Favaro, Lorenzo Masoero, Tamara Broderick, Camerlenghi, F, Favaro, S, Masoero, L, and Broderick, T
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Scaled process prior ,Beta process prior ,Bayesian nonparametrics ,Completely random measure ,Stable process ,Methodology (stat.ME) ,Bayesian nonparametric ,Stable proce ,Genetic variation ,Predictive distribution ,Unseen-features problem ,SECS-S/01 - STATISTICA ,Statistics, Probability and Uncertainty ,Statistics - Methodology - Abstract
There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.
- Published
- 2022
43. Clustering artists based on the energy distributions of their songs on Spotify via the Common Atoms Model Clustering di artisti in base alla distribuzione dell’energia delle loro canzoni su Spotify con il Common Atom Model
- Author
-
Francesco Denti, Federico Camerlenghi, Michele Guindani, Antonietta Mira, Balzanella, A, Bini, M, Cavicchia, C, Verde, R, Denti, F, Camerlenghi, F, Guindani, M, and Mira, A
- Subjects
SECS-S/01 - STATISTICA ,Common Atoms Model, partially exchangeable data, nested data, Spotify dataset, Kaggle, energy - Abstract
Partially exchangeable datasets are characterized by observations grouped into known, heterogeneous units. The recently developed Common Atoms Model (CAM) is a Bayesian nonparametric technique suited for analyzing this type of data. CAM induces a two-layered clustering structure: one across observations and another across units. In particular, the units are clustered according to their distributional similarities. In this article, we illustrate the versatility of CAM with an application to an openly available Spotify dataset. The dataset contains quantitative audio features for a large number of songs grouped by artists. After describing the data preprocessing steps, we employ CAM to group the Spotify artists according to the distributions of the energy of their songs.
- Published
- 2022
44. Contaminated Gibbs-type priors
- Author
-
Federico Camerlenghi, Riccardo Corradin, Andrea Ongaro, Camerlenghi, F, Corradin, R, and Ongaro, A
- Subjects
Statistics and Probability ,Methodology (stat.ME) ,FOS: Computer and information sciences ,Applied Mathematics ,Bayesian nonparametrics , Gibbs-type priors , Mixture models , Random partitions , species sampling models ,SECS-S/01 - STATISTICA ,FOS: Mathematics ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Statistics - Methodology - Abstract
Gibbs-type priors are widely used as key components in several Bayesian nonparametric models. By virtue of their flexibility and mathematical tractability, they turn out to be predominant priors in species sampling problems, clustering and mixture modelling. We introduce a new family of processes which extend the Gibbs-type one, by including a contaminant component in the model to account for the presence of anomalies (outliers) or an excess of observations with frequency one. We first investigate the induced random partition, the associated predictive distribution and we characterize the asymptotic behaviour of the number of clusters. All the results we obtain are in closed form and easily interpretable, as a noteworthy example we focus on the contaminated version of the Pitman-Yor process. Finally we pinpoint the advantage of our construction in different applied problems: we show how the contaminant component helps to perform outlier detection for an astronomical clustering problem and to improve predictive inference in a species-related dataset, exhibiting a high number of species with frequency one., 43 pages, 14 figures, 10 tables
- Published
- 2021
45. Survival analysis via hierarchically dependent mixture hazards
- Author
-
Federico Camerlenghi, Antonio Lijoi, Igor Prünster, Camerlenghi, F, Lijoi, A, and Pruenster, I
- Subjects
Statistics and Probability ,Hazard (logic) ,Multivariate statistics ,Bayesian probability ,HIERARCHICAL PROCESSES ,computer.software_genre ,HAZARD RATE MIXTURES ,01 natural sciences ,Generalized gamma processe ,Completely random measure ,Bayesian Nonparametric ,BAYESIAN NONPARAMETRICS, COMPLETELY RANDOM MEASURES, GENERALIZED GAMMA PROCESSES, HAZARD RATE MIXTURES, HIERARCHICAL PROCESSES, META-ANALYSIS, PARTIAL EXCHANGEABILITY ,010104 statistics & probability ,Consistency (statistics) ,GENERALIZED GAMMA PROCESSES ,Covariate ,Prior probability ,Meta-analysi ,0101 mathematics ,Mathematics ,PARTIAL EXCHANGEABILITY ,Nonparametric statistics ,COMPLETELY RANDOM MEASURES ,Hazard rate mixture ,META-ANALYSIS ,SECS-S/01 - STATISTICA ,Probability distribution ,Data mining ,Statistics, Probability and Uncertainty ,BAYESIAN NONPARAMETRICS ,Hierarchical processe ,computer - Abstract
Hierarchical nonparametric processes are popular tools for defining priors on collections of probability distributions, which induce dependence across multiple samples. In survival analysis problems, one is typically interested in modeling the hazard rates, rather than the probability distributions themselves, and the currently available methodologies are not applicable. Here, we fill this gap by introducing a novel, and analytically tractable, class of multivariate mixtures whose distribution acts as a prior for the vector of sample-specific baseline hazard rates. The dependence is induced through a hierarchical specification of the mixing random measures that ultimately corresponds to a composition of random discrete combinatorial structures. Our theoretical results allow to develop a full Bayesian analysis for this class of models, which can also account for right-censored survival data and covariates, and we also show posterior consistency. In particular, we emphasize that the posterior characterization we achieve is the key for devising both marginal and conditional algorithms for evaluating Bayesian inferences of interest. The effectiveness of our proposal is illustrated through some synthetic and real data examples.
- Published
- 2021
46. A Common Atoms Model for the Bayesian Nonparametric Analysis of Nested Data
- Author
-
Antonietta Mira, Federico Camerlenghi, Michele Guindani, Francesco Denti, Denti, F, Camerlenghi, F, Guindani, M, and Mira, A
- Subjects
Statistics and Probability ,0303 health sciences ,Computer science ,Common atoms model ,Microbiome abundance analysis ,Nested dataset ,Nested Dirichlet process ,Partially exchangeable data ,01 natural sciences ,Bayesian nonparametrics ,Nested Dirichlet proce ,010104 statistics & probability ,03 medical and health sciences ,Settore SECS-S/01 - STATISTICA ,Microbiome abundance analysi ,SECS-S/01 - STATISTICA ,Econometrics ,Nested data ,0101 mathematics ,Statistics, Probability and Uncertainty ,Specific population ,030304 developmental biology - Abstract
The use of large datasets for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this manuscript, we propose a nested common atoms model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study.
- Published
- 2021
- Full Text
- View/download PDF
47. Consistent estimation of small masses in feature sampling
- Author
-
Marco Battiston, Fadhel Ayed, Federico Camerlenghi, Stefano Favaro, Ayed, F, Battiston, M, Camerlenghi, F, and Favaro, S
- Subjects
Good-Turing estimator ,multiplicative consistency ,nonparametric inference ,Missing ma ,SECS-S/01 - STATISTICA ,Regularly varying heavy-tailed distribution ,missing mass ,Feature sampling ,regularly varying heavy-tailed distributions ,species sampling - Abstract
Consider an (observable) random sample of size n from an infinite population of individuals, each individual being endowed with a finite set of "features" from a collection of features (F-j)(j>1) with unknown probabilities (p(j))(j>1), i.e., p(j) is the probability that an individual displays feature F-j. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses p(j)'s of features observed with frequency r >= 0 in the sample, here denoted by M-n,M-r. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or "species"). In this paper we study the problem of consistent estimation of the small mass M-n,M-r. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass M-n,M-0. Then, we introduce an estimator of M-n,M-r and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator (M) over cap (n,r) of M-n,M-r which has the same analytic form of the celebrated Good-Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that (M) over cap (n,r) is strongly consistent, in the multiplicative sense, under the assumption that (p(j))(j >= 1) has regularly varying heavy tails.
- Published
- 2021
48. More for less: predicting and maximizing genomic variant discovery via Bayesian nonparametrics
- Author
-
Tamara Broderick, Stefano Favaro, Lorenzo Masoero, Federico Camerlenghi, Masoero, L, Camerlenghi, F, Favaro, S, and Broderick, T
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Bayesian nonparametric inference ,New genomic variants discovery ,Optimal experimental design ,General Mathematics ,media_common.quotation_subject ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,Genome ,Bayesian nonparametrics ,Methodology (stat.ME) ,Constant (computer programming) ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,0101 mathematics ,Statistics - Methodology ,media_common ,Mathematics ,Sequence ,business.industry ,Applied Mathematics ,010102 general mathematics ,Agricultural and Biological Sciences (miscellaneous) ,SECS-S/01 - STATISTICA ,Resource allocation ,020201 artificial intelligence & image processing ,Artificial intelligence ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,business ,computer - Abstract
While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, then, scientists face a natural trade-off between quantity and quality; they can spend resources to sequence a greater number of genomes (quantity) or spend resources to sequence genomes with increased accuracy (quality). Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible, and thus as many new scientific insights as possible. In this paper, we consider the common setting where scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. We introduce a Bayesian nonparametric methodology to predict the number of new variants in the follow-up study based on the pilot study. When experimental conditions are kept constant between the pilot and follow-up, we demonstrate on real data from the gnomAD project that our prediction is more accurate than three recent proposals, and competitive with a more classic proposal. Unlike existing methods, though, our method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for (i) more realistic predictions and (ii) optimal allocation of a fixed budget between quality and quantity.
- Published
- 2021
- Full Text
- View/download PDF
49. Optimal disclosure risk assessment
- Author
-
Francesca Panero, Zacharie Naulet, Federico Camerlenghi, Stefano Favaro, Camerlenghi, F, Favaro, S, Naulet, Z, and Panero, F
- Subjects
Statistics and Probability ,Poisson abundance model ,Polynomial approximation ,Population ,Mathematics - Statistics Theory ,Sample (statistics) ,02 engineering and technology ,Statistics Theory (math.ST) ,Sampling fraction ,Nonparametric inference ,Poisson distribution ,01 natural sciences ,Upper and lower bounds ,Disclosure risk assessment ,62G05, 62C20 ,010104 statistics & probability ,symbols.namesake ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Mathematics ,HA Statistics ,0101 mathematics ,education ,Categorical variable ,Mathematics ,Microdata sample ,education.field_of_study ,Estimator ,020206 networking & telecommunications ,16. Peace & justice ,Minimax ,Optimal minimax procedure ,SECS-S/01 - STATISTICA ,symbols ,Statistics, Probability and Uncertainty - Abstract
Protection against disclosure is a legal and ethical obligation for agencies releasing microdata files for public use. Consider a microdata sample of size $n$ from a finite population of size $\bar{n}=n+\lambda n$, with $\lambda>0$, such that each record contains two disjoint types of information: identifying categorical information and sensitive information. Any decision about releasing data is supported by the estimation of measures of disclosure risk, which are functionals of the number of sample records with a unique combination of values of identifying variables. The most common measure is arguably the number $\tau_{1}$ of sample unique records that are population uniques. In this paper, we first study nonparametric estimation of $\tau_{1}$ under the Poisson abundance model for sample records. We introduce a class of linear estimators of $\tau_{1}$ that are simple, computationally efficient and scalable to massive datasets, and we give uniform theoretical guarantees for them. In particular, we show that they provably estimate $\tau_{1}$ all of the way up to the sampling fraction $(\lambda+1)^{-1}\propto (\log n)^{-1}$, with vanishing normalized mean-square error (NMSE) for large $n$. We then establish a lower bound for the minimax NMSE for the estimation of $\tau_{1}$, which allows us to show that: i) $(\lambda+1)^{-1}\propto (\log n)^{-1}$ is the smallest possible sampling fraction; ii) estimators' NMSE is near optimal, in the sense of matching the minimax lower bound, for large $n$. This is the main result of our paper, and it provides a precise answer to an open question about the feasibility of nonparametric estimation of $\tau_{1}$ under the Poisson abundance model and for a sampling fraction $(\lambda+1)^{-1}
- Published
- 2021
50. On Johnson’s 'Sufficientness' Postulates for Feature-Sampling Models
- Author
-
Stefano Favaro, Federico Camerlenghi, Camerlenghi, F, and Favaro, S
- Subjects
Class (set theory) ,Distribution (number theory) ,General Mathematics ,Johnson’s “sufficientness” postulate ,Characterization (mathematics) ,Scaled process prior ,01 natural sciences ,Species-sampling model ,Dirichlet distribution ,Bayesian nonparametrics ,010104 statistics & probability ,symbols.namesake ,Bayesian nonparametric ,QA1-939 ,Computer Science (miscellaneous) ,Feature (machine learning) ,0101 mathematics ,De Finetti theorem ,Exchangeability ,Feature-sampling model ,Predictive distribution ,Engineering (miscellaneous) ,Mathematics ,010102 general mathematics ,Sampling (statistics) ,Bayesian statistics ,SECS-S/01 - STATISTICA ,symbols ,Mathematical economics - Abstract
In the 1920s, the English philosopher W.E. Johnson introduced a characterization of the symmetric Dirichlet prior distribution in terms of its predictive distribution. This is typically referred to as Johnson’s “sufficientness” postulate, and it has been the subject of many contributions in Bayesian statistics, leading to predictive characterization for infinite-dimensional generalizations of the Dirichlet distribution, i.e., species-sampling models. In this paper, we review “sufficientness” postulates for species-sampling models, and then investigate analogous predictive characterizations for the more general feature-sampling models. In particular, we present a “sufficientness” postulate for a class of feature-sampling models referred to as Scaled Processes (SPs), and then discuss analogous characterizations in the general setup of feature-sampling models.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.