Identifying frequent items in distributed data sets.

Authors :: Sacha, Jan
Montresor, Alberto
Source :: Computing. Apr2013, Vol. 95 Issue 4, p289-307. 19p.
Publication Year :: 2013
Abstract: Many practical problems in computer science require the knowledge of the most frequently occurring items in a data set. Current state-of-the-art algorithms for frequent items discovery are either fully centralized or rely on node hierarchies which are inflexible and prone to failures in massively distributed systems. In this paper we describe a family of gossip-based algorithms that efficiently approximate the most frequent items in large-scale distributed datasets. We show, both analytically and using real-world datasets, that our algorithms are fast, highly scalable, and resilient to node failures. [ABSTRACT FROM AUTHOR]