Fast similarity sketching

Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Mikkel Thorup

8 Citations (Scopus)

Abstract

We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.

Original languageEnglish
Title of host publication2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)
Number of pages9
PublisherIEEE
Publication date10 Nov 2017
Pages663-671
ISBN (Electronic)978-1-5386-3464-6
DOIs
Publication statusPublished - 10 Nov 2017
Event58th Annual IEEE Symposium on Foundations of Computer Science - Berkeley, United States
Duration: 15 Oct 201717 Oct 2017
Conference number: 58

Conference

Conference58th Annual IEEE Symposium on Foundations of Computer Science
Number58
Country/TerritoryUnited States
CityBerkeley
Period15/10/201717/10/2017

Fingerprint

Dive into the research topics of 'Fast similarity sketching'. Together they form a unique fingerprint.

Cite this