Nearest neighbor classification using bottom-k sketches

Søren Dahlgaard; Christian Igel; Mikkel Thorup

doi:10.1109/BigData.2013.6691730

Nearest neighbor classification using bottom-k sketches

Søren Dahlgaard, Christian Igel, Mikkel Thorup

1 Citation (Scopus)

Abstract

Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

Original language	English
Title of host publication	2013 IEEE International Conference on Big Data : proceedings
Number of pages	7
Publisher	IEEE
Publication date	2013
Pages	28-34
ISBN (Electronic)	978-1-4799-1293-3
DOIs	https://doi.org/10.1109/BigData.2013.6691730
Publication status	Published - 2013
Event	IEEE International Conference on Big Data 2013: BigData Congress 2013 - Santa Clara, CA, United States Duration: 28 Jun 2013 → 3 Jul 2013

Conference

Conference	IEEE International Conference on Big Data 2013
Country/Territory	United States
City	Santa Clara, CA
Period	28/06/2013 → 03/07/2013

Access to Document

10.1109/BigData.2013.6691730

Nearest Neighbor Classification Using Bottom-k SketchesFinal published version, 773 KB

Cite this

@inproceedings{d34f1d346baa4d11bf0b45555e3df245,

title = "Nearest neighbor classification using bottom-k sketches",

abstract = "Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.",

author = "S{\o}ren Dahlgaard and Christian Igel and Mikkel Thorup",

year = "2013",

doi = "10.1109/BigData.2013.6691730",

language = "English",

pages = "28--34",

booktitle = "2013 IEEE International Conference on Big Data",

publisher = "IEEE",

note = "IEEE International Conference on Big Data 2013 ; Conference date: 28-06-2013 Through 03-07-2013",

}

TY - GEN

T1 - Nearest neighbor classification using bottom-k sketches

AU - Dahlgaard, Søren

AU - Igel, Christian

AU - Thorup, Mikkel

PY - 2013

Y1 - 2013

N2 - Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

AB - Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

U2 - 10.1109/BigData.2013.6691730

DO - 10.1109/BigData.2013.6691730

M3 - Article in proceedings

SP - 28

EP - 34

BT - 2013 IEEE International Conference on Big Data

PB - IEEE

T2 - IEEE International Conference on Big Data 2013

Y2 - 28 June 2013 through 3 July 2013

ER -

Nearest neighbor classification using bottom-k sketches

Abstract

Conference

Access to Document

Fingerprint

Cite this