Hashing for statistics over k-partitions

Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Eva Rotenberg, Mikkel Thorup

13 Citationer (Scopus)

Abstract

In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

OriginalsprogEngelsk
TitelProceedings. 56th Annual Symposium on Foundations of Computer Science
Antal sider19
ForlagIEEE
Publikationsdato11 dec. 2015
Sider1292-1310
ISBN (Elektronisk)978-1-4673-8191-8
DOI
StatusUdgivet - 11 dec. 2015
BegivenhedThe Annual Symposium on Foundations of Computer Science - DoubleTree Hotel, Berkeley, California, USA
Varighed: 18 okt. 201520 okt. 2015
Konferencens nummer: 56

Konference

KonferenceThe Annual Symposium on Foundations of Computer Science
Nummer56
LokationDoubleTree Hotel
Land/OmrådeUSA
ByBerkeley, California
Periode18/10/201520/10/2015

Fingeraftryk

Dyk ned i forskningsemnerne om 'Hashing for statistics over k-partitions'. Sammen danner de et unikt fingeraftryk.

Citationsformater