Nearest neighbor classification using bottom-k sketches

Søren Dahlgaard; Christian Igel; Mikkel Thorup

doi:10.1109/BigData.2013.6691730

Nearest neighbor classification using bottom-k sketches

Søren Dahlgaard, Christian Igel, Mikkel Thorup

1 Citationer (Scopus)

Abstract

Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

Originalsprog	Engelsk
Titel	2013 IEEE International Conference on Big Data : proceedings
Antal sider	7
Forlag	IEEE
Publikationsdato	2013
Sider	28-34
ISBN (Elektronisk)	978-1-4799-1293-3
DOI	https://doi.org/10.1109/BigData.2013.6691730
Status	Udgivet - 2013
Begivenhed	IEEE International Conference on Big Data 2013: BigData Congress 2013 - Santa Clara, CA, USA Varighed: 28 jun. 2013 → 3 jul. 2013

Konference

Konference	IEEE International Conference on Big Data 2013
Land/Område	USA
By	Santa Clara, CA
Periode	28/06/2013 → 03/07/2013

Adgang til dokumentet

10.1109/BigData.2013.6691730

Nearest Neighbor Classification Using Bottom-k SketchesForlagets udgivne version, 773 KB

Citationsformater

@inproceedings{d34f1d346baa4d11bf0b45555e3df245,

title = "Nearest neighbor classification using bottom-k sketches",

abstract = "Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.",

author = "S{\o}ren Dahlgaard and Christian Igel and Mikkel Thorup",

year = "2013",

doi = "10.1109/BigData.2013.6691730",

language = "English",

pages = "28--34",

booktitle = "2013 IEEE International Conference on Big Data",

publisher = "IEEE",

note = "IEEE International Conference on Big Data 2013 ; Conference date: 28-06-2013 Through 03-07-2013",

}

TY - GEN

T1 - Nearest neighbor classification using bottom-k sketches

AU - Dahlgaard, Søren

AU - Igel, Christian

AU - Thorup, Mikkel

PY - 2013

Y1 - 2013

N2 - Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

AB - Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

U2 - 10.1109/BigData.2013.6691730

DO - 10.1109/BigData.2013.6691730

M3 - Article in proceedings

SP - 28

EP - 34

BT - 2013 IEEE International Conference on Big Data

PB - IEEE

T2 - IEEE International Conference on Big Data 2013

Y2 - 28 June 2013 through 3 July 2013

ER -

Nearest neighbor classification using bottom-k sketches

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater