I/O-Efficient Similarity Join

Rasmus Pagh; Ninh Pham; Francesco Silvestri; Morten Stöckel

doi:10.1007/s00453-017-0285-5

I/O-Efficient Similarity Join

Rasmus Pagh, Ninh Pham, Francesco Silvestri, Morten Stöckel

Department of Computer Science

3 Citations (Scopus)

Abstract

We present an I/O-efficient algorithm for computing similarity joins based
on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly
suggested our method has provable sub-quadratic dependency on the data size. Further,
in contrast to straightforward implementations of known LSH-based algorithms on
external memory, our approach is able to take significant advantage of the available
internal memory:Whereas the time complexity of classical algorithms includes a factor
of Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithm
merely includes a factor (N/M)ρ, where N is the data size and M is the size of
internal memory. Our algorithm is randomized and outputs the correct result with
high probability. It is a simple, recursive, cache-oblivious procedure, and we believe
that it will be useful also in other computational settings such as parallel computation

Original language	English
Journal	Algorithmica
Volume	78
Issue number	4
Pages (from-to)	1263-1283
ISSN	0178-4617
DOIs	https://doi.org/10.1007/s00453-017-0285-5
Publication status	Published - 1 Aug 2017

Access to Document

10.1007/s00453-017-0285-5

http://arxiv.org/pdf/1507.00552Licence: Other

Cite this

@article{d4414b0dbde74eab9f132b3ef0c3eb6c,

title = "I/O-Efficient Similarity Join",

abstract = "We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation",

author = "Rasmus Pagh and Ninh Pham and Francesco Silvestri and Morten St{\"o}ckel",

year = "2017",

month = aug,

day = "1",

doi = "10.1007/s00453-017-0285-5",

language = "English",

volume = "78",

pages = "1263--1283",

journal = "Algorithmica",

issn = "0178-4617",

publisher = "Springer",

number = "4",

}

TY - JOUR

T1 - I/O-Efficient Similarity Join

AU - Pagh, Rasmus

AU - Pham, Ninh

AU - Silvestri, Francesco

AU - Stöckel, Morten

PY - 2017/8/1

Y1 - 2017/8/1

N2 - We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation

AB - We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation

U2 - 10.1007/s00453-017-0285-5

DO - 10.1007/s00453-017-0285-5

M3 - Journal article

SN - 0178-4617

VL - 78

SP - 1263

EP - 1283

JO - Algorithmica

JF - Algorithmica

IS - 4

ER -