Fast similarity sketching

Søren Dahlgaard; Mathias Bæk Tejs Knudsen; Mikkel Thorup

doi:10.1109/FOCS.2017.67

Fast similarity sketching

Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Mikkel Thorup

Department of Computer Science

8 Citations (Scopus)

Abstract

We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.

Original language	English
Title of host publication	2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)
Number of pages	9
Publisher	IEEE
Publication date	10 Nov 2017
Pages	663-671
ISBN (Electronic)	978-1-5386-3464-6
DOIs	https://doi.org/10.1109/FOCS.2017.67
Publication status	Published - 10 Nov 2017
Event	58th Annual IEEE Symposium on Foundations of Computer Science - Berkeley, United States Duration: 15 Oct 2017 → 17 Oct 2017 Conference number: 58

Conference

Conference	58th Annual IEEE Symposium on Foundations of Computer Science
Number	58
Country/Territory	United States
City	Berkeley
Period	15/10/2017 → 17/10/2017

Access to Document

10.1109/FOCS.2017.67

Cite this

@inproceedings{3501e7d7eac34d6e9120c975733e1954,

title = "Fast similarity sketching",

abstract = "We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.",

author = "S{\o}ren Dahlgaard and Knudsen, {Mathias B{\ae}k Tejs} and Mikkel Thorup",

year = "2017",

month = nov,

day = "10",

doi = "10.1109/FOCS.2017.67",

language = "English",

pages = "663--671",

booktitle = "2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)",

publisher = "IEEE",

note = "58th Annual IEEE Symposium on Foundations of Computer Science ; Conference date: 15-10-2017 Through 17-10-2017",

}

TY - GEN

T1 - Fast similarity sketching

AU - Dahlgaard, Søren

AU - Knudsen, Mathias Bæk Tejs

AU - Thorup, Mikkel

N1 - Conference code: 58

PY - 2017/11/10

Y1 - 2017/11/10

N2 - We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.

AB - We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.

U2 - 10.1109/FOCS.2017.67

DO - 10.1109/FOCS.2017.67

M3 - Article in proceedings

SP - 663

EP - 671

BT - 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)

PB - IEEE

T2 - 58th Annual IEEE Symposium on Foundations of Computer Science

Y2 - 15 October 2017 through 17 October 2017

ER -

Fast similarity sketching

Abstract

Conference

Access to Document

Fingerprint

Cite this