Tabulation Hashing for Large-Scale Data Processing

Søren Dahlgaard

Tabulation Hashing for Large-Scale Data Processing

Søren Dahlgaard

SCIENCE PhD theses

Abstract

The past decade has brought with it an immense amount of data from
large volumes of text to network traffic data.Working with such largescale
data has become an increasingly important topic, giving rise
to many important problems and influential solutions. One common
denominator between many popular algorithms and data structures
for tackling these problems is randomization implemented with hash
functions.
A common practice in the analysis of such randomized algorithms,
is to work under the abstract assumption that truly random unit cost
hash functions are freely available without concern for which concrete
hash function to employ. However, the choice of hash function is of
huge importance, as the theoretical guarantees of a randomized algorithm
rely crucially on this choice, and the analysis breaks down
completely when too simple hash functions are used. Furthermore,
hashing is often employed as an “inner-loop” operation and evaluation
time is thus of utmost importance.
This thesis seeks to bridge this gap in the theory by providing efficient
families of hash functions with strong theoretical guarantees
for several influential problems in the large-scale data regime. This is
done by studying families of tabulation-based hashing – a method for
constructing very efficient hashing schemes based on table lookups
and word parallelism.
We provide a new fundamental understanding of the dependencies
between hash values of keys when using tabulation-based hashing,
leading to the first “practical” hash functions with strong theoretical
guarantees for many popular algorithms and techniques. This
includes statistics based on k-partitioning which is employed in the
popular HyperLogLog counters as well as the one permutation hashing
sketch for similarity estimation. Furthermore, our techniques lead to
the most efficient known hashing schemes for the power of two choices,
approximately minwise independence, constant moment bounds, and more.
From an algorithmic point of view, we present a new similarity
sketch with properties similar to the seminal MinHash sketch, but
with much faster running time. This problem has previously been
considered from a practical perspective, but the previously proposed
solutions fail to give strong concentration bounds.
We complement our theoretical results with experiments demonstrating
that tabulation hashing systematically outperforms simpler
hashing schemes for similarity estimation and feature hashing on both
synthetic and real-world data.

Originalsprog	Engelsk

Forlag	Department of Computer Science, Faculty of Science, University of Copenhagen
Status	Udgivet - 2017

Adgang til dokumentet

PHD-DahlgaardForlagets udgivne version, 2,23 MB

https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01010652008&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

Andre filer og links

Sign in to request a library copy

Citationsformater

@phdthesis{8f94cff2ff2844289d69bba76034dfc9,

title = "Tabulation Hashing for Large-Scale Data Processing",

abstract = "The past decade has brought with it an immense amount of data fromlarge volumes of text to network traffic data.Working with such largescaledata has become an increasingly important topic, giving riseto many important problems and influential solutions. One commondenominator between many popular algorithms and data structuresfor tackling these problems is randomization implemented with hashfunctions.A common practice in the analysis of such randomized algorithms,is to work under the abstract assumption that truly random unit costhash functions are freely available without concern for which concretehash function to employ. However, the choice of hash function is ofhuge importance, as the theoretical guarantees of a randomized algorithmrely crucially on this choice, and the analysis breaks downcompletely when too simple hash functions are used. Furthermore,hashing is often employed as an “inner-loop” operation and evaluationtime is thus of utmost importance.This thesis seeks to bridge this gap in the theory by providing efficientfamilies of hash functions with strong theoretical guaranteesfor several influential problems in the large-scale data regime. This isdone by studying families of tabulation-based hashing – a method forconstructing very efficient hashing schemes based on table lookupsand word parallelism.We provide a new fundamental understanding of the dependenciesbetween hash values of keys when using tabulation-based hashing,leading to the first “practical” hash functions with strong theoreticalguarantees for many popular algorithms and techniques. Thisincludes statistics based on k-partitioning which is employed in thepopular HyperLogLog counters as well as the one permutation hashingsketch for similarity estimation. Furthermore, our techniques lead tothe most efficient known hashing schemes for the power of two choices,approximately minwise independence, constant moment bounds, and more.From an algorithmic point of view, we present a new similaritysketch with properties similar to the seminal MinHash sketch, butwith much faster running time. This problem has previously beenconsidered from a practical perspective, but the previously proposedsolutions fail to give strong concentration bounds.We complement our theoretical results with experiments demonstratingthat tabulation hashing systematically outperforms simplerhashing schemes for similarity estimation and feature hashing on bothsynthetic and real-world data.",

author = "S{\o}ren Dahlgaard",

year = "2017",

language = "English",

publisher = "Department of Computer Science, Faculty of Science, University of Copenhagen",

}

TY - BOOK

T1 - Tabulation Hashing for Large-Scale Data Processing

AU - Dahlgaard, Søren

PY - 2017

Y1 - 2017

N2 - The past decade has brought with it an immense amount of data fromlarge volumes of text to network traffic data.Working with such largescaledata has become an increasingly important topic, giving riseto many important problems and influential solutions. One commondenominator between many popular algorithms and data structuresfor tackling these problems is randomization implemented with hashfunctions.A common practice in the analysis of such randomized algorithms,is to work under the abstract assumption that truly random unit costhash functions are freely available without concern for which concretehash function to employ. However, the choice of hash function is ofhuge importance, as the theoretical guarantees of a randomized algorithmrely crucially on this choice, and the analysis breaks downcompletely when too simple hash functions are used. Furthermore,hashing is often employed as an “inner-loop” operation and evaluationtime is thus of utmost importance.This thesis seeks to bridge this gap in the theory by providing efficientfamilies of hash functions with strong theoretical guaranteesfor several influential problems in the large-scale data regime. This isdone by studying families of tabulation-based hashing – a method forconstructing very efficient hashing schemes based on table lookupsand word parallelism.We provide a new fundamental understanding of the dependenciesbetween hash values of keys when using tabulation-based hashing,leading to the first “practical” hash functions with strong theoreticalguarantees for many popular algorithms and techniques. Thisincludes statistics based on k-partitioning which is employed in thepopular HyperLogLog counters as well as the one permutation hashingsketch for similarity estimation. Furthermore, our techniques lead tothe most efficient known hashing schemes for the power of two choices,approximately minwise independence, constant moment bounds, and more.From an algorithmic point of view, we present a new similaritysketch with properties similar to the seminal MinHash sketch, butwith much faster running time. This problem has previously beenconsidered from a practical perspective, but the previously proposedsolutions fail to give strong concentration bounds.We complement our theoretical results with experiments demonstratingthat tabulation hashing systematically outperforms simplerhashing schemes for similarity estimation and feature hashing on bothsynthetic and real-world data.

AB - The past decade has brought with it an immense amount of data fromlarge volumes of text to network traffic data.Working with such largescaledata has become an increasingly important topic, giving riseto many important problems and influential solutions. One commondenominator between many popular algorithms and data structuresfor tackling these problems is randomization implemented with hashfunctions.A common practice in the analysis of such randomized algorithms,is to work under the abstract assumption that truly random unit costhash functions are freely available without concern for which concretehash function to employ. However, the choice of hash function is ofhuge importance, as the theoretical guarantees of a randomized algorithmrely crucially on this choice, and the analysis breaks downcompletely when too simple hash functions are used. Furthermore,hashing is often employed as an “inner-loop” operation and evaluationtime is thus of utmost importance.This thesis seeks to bridge this gap in the theory by providing efficientfamilies of hash functions with strong theoretical guaranteesfor several influential problems in the large-scale data regime. This isdone by studying families of tabulation-based hashing – a method forconstructing very efficient hashing schemes based on table lookupsand word parallelism.We provide a new fundamental understanding of the dependenciesbetween hash values of keys when using tabulation-based hashing,leading to the first “practical” hash functions with strong theoreticalguarantees for many popular algorithms and techniques. Thisincludes statistics based on k-partitioning which is employed in thepopular HyperLogLog counters as well as the one permutation hashingsketch for similarity estimation. Furthermore, our techniques lead tothe most efficient known hashing schemes for the power of two choices,approximately minwise independence, constant moment bounds, and more.From an algorithmic point of view, we present a new similaritysketch with properties similar to the seminal MinHash sketch, butwith much faster running time. This problem has previously beenconsidered from a practical perspective, but the previously proposedsolutions fail to give strong concentration bounds.We complement our theoretical results with experiments demonstratingthat tabulation hashing systematically outperforms simplerhashing schemes for similarity estimation and feature hashing on bothsynthetic and real-world data.

UR - https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01010652008&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

M3 - Ph.D. thesis

BT - Tabulation Hashing for Large-Scale Data Processing

PB - Department of Computer Science, Faculty of Science, University of Copenhagen

ER -

Tabulation Hashing for Large-Scale Data Processing

Abstract

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater