Back to Search Start Over

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Authors :
Fahrbach, Matthew
Ramalingam, Srikumar
Zadimoghaddam, Morteza
Ahmadian, Sara
Citovsky, Gui
DeSalvo, Giulia
Publication Year :
2024

Abstract

We propose a novel subset selection task called min-distance diverse data summarization ($\textsf{MDDS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint $|S| \le k$. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the $\texttt{GIST}$ algorithm, which achieves a $\frac{2}{3}$-approximation guarantee for $\textsf{MDDS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary $(\frac{2}{3}+\varepsilon)$-hardness of approximation, for any $\varepsilon > 0$. Finally, we provide an empirical study that demonstrates $\texttt{GIST}$ outperforms existing methods for $\textsf{MDDS}$ on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.<br />Comment: 15 pages, 1 figure

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2405.18754
Document Type :
Working Paper