Hashing for statistics over k-partitions

Søren Dahlgaard; Mathias Bæk Tejs Knudsen; Eva Rotenberg; Mikkel Thorup

doi:10.1109/FOCS.2015.83

Hashing for statistics over k-partitions

Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Eva Rotenberg, Mikkel Thorup

Datalogisk Institut

13 Citationer (Scopus)

Abstract

In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

Originalsprog	Engelsk
Titel	Proceedings. 56th Annual Symposium on Foundations of Computer Science
Antal sider	19
Forlag	IEEE
Publikationsdato	11 dec. 2015
Sider	1292-1310
ISBN (Elektronisk)	978-1-4673-8191-8
DOI	https://doi.org/10.1109/FOCS.2015.83
Status	Udgivet - 11 dec. 2015
Begivenhed	The Annual Symposium on Foundations of Computer Science - DoubleTree Hotel, Berkeley, California, USA Varighed: 18 okt. 2015 → 20 okt. 2015 Konferencens nummer: 56

Konference

Konference	The Annual Symposium on Foundations of Computer Science
Nummer	56
Lokation	DoubleTree Hotel
Land/Område	USA
By	Berkeley, California
Periode	18/10/2015 → 20/10/2015

Adgang til dokumentet

10.1109/FOCS.2015.83

http://arxiv.org/pdf/1411.7191v3.pdfLicens: Andet

Citationsformater

@inproceedings{c6fe98db37a34a4b931281570a06b1f3,

title = "Hashing for statistics over k-partitions",

abstract = "In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.",

author = "S{\o}ren Dahlgaard and Knudsen, {Mathias B{\ae}k Tejs} and Eva Rotenberg and Mikkel Thorup",

year = "2015",

month = dec,

day = "11",

doi = "10.1109/FOCS.2015.83",

language = "English",

pages = "1292--1310",

booktitle = "Proceedings. 56th Annual Symposium on Foundations of Computer Science",

publisher = "IEEE",

note = "The Annual Symposium on Foundations of Computer Science ; Conference date: 18-10-2015 Through 20-10-2015",

}

TY - GEN

T1 - Hashing for statistics over k-partitions

AU - Dahlgaard, Søren

AU - Knudsen, Mathias Bæk Tejs

AU - Rotenberg, Eva

AU - Thorup, Mikkel

N1 - Conference code: 56

PY - 2015/12/11

Y1 - 2015/12/11

N2 - In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

AB - In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

U2 - 10.1109/FOCS.2015.83

DO - 10.1109/FOCS.2015.83

M3 - Article in proceedings

SP - 1292

EP - 1310

BT - Proceedings. 56th Annual Symposium on Foundations of Computer Science

PB - IEEE

T2 - The Annual Symposium on Foundations of Computer Science

Y2 - 18 October 2015 through 20 October 2015

ER -

Hashing for statistics over k-partitions

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater