Back to Search
Start Over
Hierarchical Clustering: Objective Functions and Algorithms
- Source :
- SODA 2018-Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018-Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, Jan 2018, New Orleans, LA, United States. pp.378-397, ⟨10.1137/1.9781611975031.26⟩, SODA, Mallmann-Trenn, Frederik
- Publication Year :
- 2018
- Publisher :
- HAL CCSD, 2018.
-
Abstract
- Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta (2016) framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a `good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost disconnected components must be separated first and that in `structureless' graphs, i.e., cliques, all clusterings achieve the same cost. We take an axiomatic approach to defining `good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a `natural' ground-truth hierarchical clustering, the ground-truth clustering has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similarity-based hierarchical clustering, Dasgupta (2016) showed that a simple recursive sparsest-cut based approach achieves an O(log^3/2 n)-approximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an O(√log n)-approximation. This improves upon the LP-based O(log n)-approximation of Roy and Pokutta (2016). For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3/2 approximation. This aims at explaining the success of this heuristics in practice. Finally, we consider `beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.
- Subjects :
- FOS: Computer and information sciences
Computer science
Correlation clustering
[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]
0102 computer and information sciences
02 engineering and technology
01 natural sciences
Machine Learning (cs.LG)
Set (abstract data type)
Similarity (network science)
CURE data clustering algorithm
Artificial Intelligence
Stochastic block model
020204 information systems
Computer Science - Data Structures and Algorithms
0202 electrical engineering, electronic engineering, information engineering
Data Structures and Algorithms (cs.DS)
Cluster analysis
Mathematics
Brown clustering
Hierarchy (mathematics)
Constrained clustering
Axiomatic system
Function (mathematics)
Hierarchical clustering
Computer Science - Learning
010201 computation theory & mathematics
Hardware and Architecture
Control and Systems Engineering
Canopy clustering algorithm
020201 artificial intelligence & image processing
Hierarchical clustering of networks
Algorithm
Software
Information Systems
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- SODA 2018-Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018-Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, Jan 2018, New Orleans, LA, United States. pp.378-397, ⟨10.1137/1.9781611975031.26⟩, SODA, Mallmann-Trenn, Frederik
- Accession number :
- edsair.doi.dedup.....79bdf3a838f67ba6de6e0266a0324f4f
- Full Text :
- https://doi.org/10.1137/1.9781611975031.26⟩