Hashing for statistics over k-partitions

Søren Dahlgaard; Mathias Bæk Tejs Knudsen; Eva Rotenberg; Mikkel Thorup

doi:10.1109/FOCS.2015.83

Hashing for statistics over k-partitions

Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Eva Rotenberg, Mikkel Thorup

Department of Computer Science

13 Citations (Scopus)

Abstract

In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

Original language	English
Title of host publication	Proceedings. 56th Annual Symposium on Foundations of Computer Science
Number of pages	19
Publisher	IEEE
Publication date	11 Dec 2015
Pages	1292-1310
ISBN (Electronic)	978-1-4673-8191-8
DOIs	https://doi.org/10.1109/FOCS.2015.83
Publication status	Published - 11 Dec 2015
Event	The Annual Symposium on Foundations of Computer Science - DoubleTree Hotel, Berkeley, California, United States Duration: 18 Oct 2015 → 20 Oct 2015 Conference number: 56

Conference

Conference	The Annual Symposium on Foundations of Computer Science
Number	56
Location	DoubleTree Hotel
Country/Territory	United States
City	Berkeley, California
Period	18/10/2015 → 20/10/2015

Access to Document

10.1109/FOCS.2015.83

http://arxiv.org/pdf/1411.7191v3.pdfLicence: Other

Cite this

@inproceedings{c6fe98db37a34a4b931281570a06b1f3,

title = "Hashing for statistics over k-partitions",

abstract = "In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.",

author = "S{\o}ren Dahlgaard and Knudsen, {Mathias B{\ae}k Tejs} and Eva Rotenberg and Mikkel Thorup",

year = "2015",

month = dec,

day = "11",

doi = "10.1109/FOCS.2015.83",

language = "English",

pages = "1292--1310",

booktitle = "Proceedings. 56th Annual Symposium on Foundations of Computer Science",

publisher = "IEEE",

note = "The Annual Symposium on Foundations of Computer Science ; Conference date: 18-10-2015 Through 20-10-2015",

}

TY - GEN

T1 - Hashing for statistics over k-partitions

AU - Dahlgaard, Søren

AU - Knudsen, Mathias Bæk Tejs

AU - Rotenberg, Eva

AU - Thorup, Mikkel

N1 - Conference code: 56

PY - 2015/12/11

Y1 - 2015/12/11

N2 - In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

AB - In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

U2 - 10.1109/FOCS.2015.83

DO - 10.1109/FOCS.2015.83

M3 - Article in proceedings

SP - 1292

EP - 1310

BT - Proceedings. 56th Annual Symposium on Foundations of Computer Science

PB - IEEE

T2 - The Annual Symposium on Foundations of Computer Science

Y2 - 18 October 2015 through 20 October 2015

ER -

Hashing for statistics over k-partitions

Abstract

Conference

Access to Document

Fingerprint

Cite this