Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Mikkel Thorup

doi:10.1145/2488608.2488655

Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Mikkel Thorup

Datalogisk Institut

15 Citationer (Scopus)

Abstract

We consider bottom-k sampling for a set X, picking a sample S _k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |S_k(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A ≊ B) = S_k(S _k(A) ≊ S_k(B)), and then the similarity is estimated as |S_k(A ≊ B) ∩ Sk₍(A) ∩ S_k(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h₁, ..., h_k, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

Originalsprog	Engelsk
Titel	STOC '13 : Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Computing
Antal sider	10
Forlag	Association for Computing Machinery
Publikationsdato	2013
Sider	371-380
ISBN (Trykt)	978-1-4503-2029-0
DOI	https://doi.org/10.1145/2488608.2488655
Status	Udgivet - 2013
Begivenhed	Annual ACM Symposium on Theory of Computing - Palo Alto, CA, USA Varighed: 1 jun. 2013 → 4 jun. 2013 Konferencens nummer: 45

Konference

Konference	Annual ACM Symposium on Theory of Computing
Nummer	45
Land/Område	USA
By	Palo Alto, CA
Periode	01/06/2013 → 04/06/2013

Emneord

estimation, independence, sampling

Adgang til dokumentet

10.1145/2488608.2488655

Citationsformater

@inproceedings{0ee435f84cc34b6b930138d296c57de3,

title = "Bottom-k and priority sampling, set similarity and subset sums with minimal independence",

abstract = "We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.",

keywords = "estimation, independence, sampling",

author = "Mikkel Thorup",

year = "2013",

doi = "10.1145/2488608.2488655",

language = "English",

isbn = "978-1-4503-2029-0",

pages = "371--380",

booktitle = "STOC '13",

publisher = "Association for Computing Machinery",

note = "Annual ACM Symposium on Theory of Computing ; Conference date: 01-06-2013 Through 04-06-2013",

}

TY - GEN

T1 - Bottom-k and priority sampling, set similarity and subset sums with minimal independence

AU - Thorup, Mikkel

N1 - Conference code: 45

PY - 2013

Y1 - 2013

N2 - We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

AB - We consider bottom-k sampling for a set X, picking a sample S k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the frequency f = |Y |/|X| of any subset Y as |Sk(X) ∩ Y |/k. A standard application is the estimation of the Jaccard similarity f = |A ∩ B|/|A ≊ B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A ≊ B) = Sk(S k(A) ≊ Sk(B)), and then the similarity is estimated as |Sk(A ≊ B) ∩ Sk((A) ∩ Sk(B)|/k. We show here that even if the hash function is only 2- independent, the expected relative error is O(1/ √ fk). For fk = Ω(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of repeated min-wise hashing, where we use k independent hash functions h1, ..., hk, storing the smallest element with each hash function. For min-wise hashing, there can be a constant bias with constant independence, and this is not reduced with more repetitions k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

KW - estimation, independence, sampling

U2 - 10.1145/2488608.2488655

DO - 10.1145/2488608.2488655

M3 - Article in proceedings

SN - 978-1-4503-2029-0

SP - 371

EP - 380

BT - STOC '13

PB - Association for Computing Machinery

T2 - Annual ACM Symposium on Theory of Computing

Y2 - 1 June 2013 through 4 June 2013

ER -

Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Abstract

Konference

Emneord

Adgang til dokumentet

Fingeraftryk

Citationsformater