Back to Search
Start Over
HyperMinHash: MinHash in LogLog space.
- Source :
-
IEEE transactions on knowledge and data engineering [IEEE Trans Knowl Data Eng] 2022 Jan; Vol. 34 (1), pp. 328-339. Date of Electronic Publication: 2020 Mar 17. - Publication Year :
- 2022
-
Abstract
- In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O ( l o g n ) to buckets of size O ( l o g l o g n ) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t , given a random oracle, HyperMinHash needs O ( ϵ - 2 ( l o g l o g n + l o g 1 ϵ ) ) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10 19 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 10 10 with the same memory consumption.
Details
- Language :
- English
- ISSN :
- 1041-4347
- Volume :
- 34
- Issue :
- 1
- Database :
- MEDLINE
- Journal :
- IEEE transactions on knowledge and data engineering
- Publication Type :
- Academic Journal
- Accession number :
- 38288326
- Full Text :
- https://doi.org/10.1109/tkde.2020.2981311