Abstract
The past decade has brought with it an immense amount of data from
large volumes of text to network traffic data.Working with such largescale
data has become an increasingly important topic, giving rise
to many important problems and influential solutions. One common
denominator between many popular algorithms and data structures
for tackling these problems is randomization implemented with hash
functions.
A common practice in the analysis of such randomized algorithms,
is to work under the abstract assumption that truly random unit cost
hash functions are freely available without concern for which concrete
hash function to employ. However, the choice of hash function is of
huge importance, as the theoretical guarantees of a randomized algorithm
rely crucially on this choice, and the analysis breaks down
completely when too simple hash functions are used. Furthermore,
hashing is often employed as an “inner-loop” operation and evaluation
time is thus of utmost importance.
This thesis seeks to bridge this gap in the theory by providing efficient
families of hash functions with strong theoretical guarantees
for several influential problems in the large-scale data regime. This is
done by studying families of tabulation-based hashing – a method for
constructing very efficient hashing schemes based on table lookups
and word parallelism.
We provide a new fundamental understanding of the dependencies
between hash values of keys when using tabulation-based hashing,
leading to the first “practical” hash functions with strong theoretical
guarantees for many popular algorithms and techniques. This
includes statistics based on k-partitioning which is employed in the
popular HyperLogLog counters as well as the one permutation hashing
sketch for similarity estimation. Furthermore, our techniques lead to
the most efficient known hashing schemes for the power of two choices,
approximately minwise independence, constant moment bounds, and more.
From an algorithmic point of view, we present a new similarity
sketch with properties similar to the seminal MinHash sketch, but
with much faster running time. This problem has previously been
considered from a practical perspective, but the previously proposed
solutions fail to give strong concentration bounds.
We complement our theoretical results with experiments demonstrating
that tabulation hashing systematically outperforms simpler
hashing schemes for similarity estimation and feature hashing on both
synthetic and real-world data.
large volumes of text to network traffic data.Working with such largescale
data has become an increasingly important topic, giving rise
to many important problems and influential solutions. One common
denominator between many popular algorithms and data structures
for tackling these problems is randomization implemented with hash
functions.
A common practice in the analysis of such randomized algorithms,
is to work under the abstract assumption that truly random unit cost
hash functions are freely available without concern for which concrete
hash function to employ. However, the choice of hash function is of
huge importance, as the theoretical guarantees of a randomized algorithm
rely crucially on this choice, and the analysis breaks down
completely when too simple hash functions are used. Furthermore,
hashing is often employed as an “inner-loop” operation and evaluation
time is thus of utmost importance.
This thesis seeks to bridge this gap in the theory by providing efficient
families of hash functions with strong theoretical guarantees
for several influential problems in the large-scale data regime. This is
done by studying families of tabulation-based hashing – a method for
constructing very efficient hashing schemes based on table lookups
and word parallelism.
We provide a new fundamental understanding of the dependencies
between hash values of keys when using tabulation-based hashing,
leading to the first “practical” hash functions with strong theoretical
guarantees for many popular algorithms and techniques. This
includes statistics based on k-partitioning which is employed in the
popular HyperLogLog counters as well as the one permutation hashing
sketch for similarity estimation. Furthermore, our techniques lead to
the most efficient known hashing schemes for the power of two choices,
approximately minwise independence, constant moment bounds, and more.
From an algorithmic point of view, we present a new similarity
sketch with properties similar to the seminal MinHash sketch, but
with much faster running time. This problem has previously been
considered from a practical perspective, but the previously proposed
solutions fail to give strong concentration bounds.
We complement our theoretical results with experiments demonstrating
that tabulation hashing systematically outperforms simpler
hashing schemes for similarity estimation and feature hashing on both
synthetic and real-world data.
Originalsprog | Engelsk |
---|
Forlag | Department of Computer Science, Faculty of Science, University of Copenhagen |
---|---|
Status | Udgivet - 2017 |