Abstract
We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X-i = [S(A)[i] = S(B)[i]] and X = sum-{i in [t]} X-i. We want to have E[X] = t∗J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h-1,..., h-t, and stores (min-{a in A} h-1(A),..., min-{a in A} h-t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t∗|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j-1, j-2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum-{A in C} |A|) and O(n^rho ∗ log n + |Q|) expected time for querying a set Q compared to a O(n^rho ∗ log n ∗ |Q|) expected query time of the classic result of Indyk and Motwani.
Originalsprog | Engelsk |
---|---|
Titel | 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) |
Antal sider | 9 |
Forlag | IEEE |
Publikationsdato | 10 nov. 2017 |
Sider | 663-671 |
ISBN (Elektronisk) | 978-1-5386-3464-6 |
DOI | |
Status | Udgivet - 10 nov. 2017 |
Begivenhed | 58th Annual IEEE Symposium on Foundations of Computer Science - Berkeley, USA Varighed: 15 okt. 2017 → 17 okt. 2017 Konferencens nummer: 58 |
Konference
Konference | 58th Annual IEEE Symposium on Foundations of Computer Science |
---|---|
Nummer | 58 |
Land/Område | USA |
By | Berkeley |
Periode | 15/10/2017 → 17/10/2017 |