Abstract
We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.
Originalsprog | Engelsk |
---|---|
Titel | STOC '13 : Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Computing |
Antal sider | 10 |
Forlag | Association for Computing Machinery |
Publikationsdato | 2013 |
Sider | 371-380 |
ISBN (Trykt) | 978-1-4503-2029-0 |
DOI | |
Status | Udgivet - 2013 |
Begivenhed | Annual ACM Symposium on Theory of Computing - Palo Alto, CA, USA Varighed: 1 jun. 2013 → 4 jun. 2013 Konferencens nummer: 45 |
Konference
Konference | Annual ACM Symposium on Theory of Computing |
---|---|
Nummer | 45 |
Land/Område | USA |
By | Palo Alto, CA |
Periode | 01/06/2013 → 04/06/2013 |
Emneord
- estimation, independence, sampling