Abstract
We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.
Original language | English |
---|---|
Title of host publication | STOC '13 : Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Computing |
Number of pages | 10 |
Publisher | Association for Computing Machinery |
Publication date | 2013 |
Pages | 371-380 |
ISBN (Print) | 978-1-4503-2029-0 |
DOIs | |
Publication status | Published - 2013 |
Event | Annual ACM Symposium on Theory of Computing - Palo Alto, CA, United States Duration: 1 Jun 2013 → 4 Jun 2013 Conference number: 45 |
Conference
Conference | Annual ACM Symposium on Theory of Computing |
---|---|
Number | 45 |
Country/Territory | United States |
City | Palo Alto, CA |
Period | 01/06/2013 → 04/06/2013 |