Lyons, Mitchell B., Keith, David A., Warton, David I., Somerville, Michael, Kingsford, Richard T., and De Cáceres, Miquel
Aim A 'good' classification should provide information about the composition and abundance of the species within communities, if it serves as an informative surrogate for biodiversity. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and multivariate species data (site by species matrix) is the response. In this study, we aimed to develop a predictive model-based framework for evaluating the predictive performance of alternative classifications of vegetation communities, and apply it to make objective and automated decisions about classification structure. Methods We used GLMs fit to multivariate species data to predict occurrence of individual species with site groupings. We used AIC to estimate predictive performance of alternative models to: (1) identify optimal partitioning of sites among multiple competing flexible- β clustering solutions; (2) identify species that contribute most to compositional differences between clusters (i.e. characteristic species); and (3) automatically merge clusters to maximize expected predictive performance using an iterative pruning approach. Using field data from southeastern Australia, and simulated data, we demonstrate our approach for common ecological data types (presence/absence, counts, cover-abundance scores, percentage cover). We supply all code and data required for these analyses. Results AIC was a useful metric for assessing competing classification solutions. Our method produced outputs that were simple to interpret and required few subjective choices to be made by the user, while performing similarly to the popular OptimClass assessment methodology. Characteristic species defined by predictive performance were consistent between data types, and had good general agreement with existing methods for defining characteristic species. Using model performance to iteratively refine clustering produced classifications with better than expected predictive performance compared to the dendrogram hierarchy, although the flexible- β hierarchy did a reasonable job of improving predictive performance. Conclusions Appropriately specified models are a natural way to maximize the predictive performance of a classification and its associated diagnostics. We show that a model-based assessment provides a clear decision framework based on data type, offering an objective pathway to make classification assessment decisions, as well as evaluate methodological choice and performance. [ABSTRACT FROM AUTHOR]