Bottom-k and priority sampling, set similarity and subset sums with minimal independence

15 Citations (Scopus)

Abstract

We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

Original languageEnglish
Title of host publicationSTOC '13 : Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Computing
Number of pages10
PublisherAssociation for Computing Machinery
Publication date2013
Pages371-380
ISBN (Print)978-1-4503-2029-0
DOIs
Publication statusPublished - 2013
EventAnnual ACM Symposium on Theory of Computing - Palo Alto, CA, United States
Duration: 1 Jun 20134 Jun 2013
Conference number: 45

Conference

ConferenceAnnual ACM Symposium on Theory of Computing
Number45
Country/TerritoryUnited States
CityPalo Alto, CA
Period01/06/201304/06/2013

Fingerprint

Dive into the research topics of 'Bottom-k and priority sampling, set similarity and subset sums with minimal independence'. Together they form a unique fingerprint.

Cite this