Nearest neighbor classification using bottom-k sketches

Søren Dahlgaard, Christian Igel, Mikkel Thorup

1 Citationer (Scopus)

Abstract

Bottom-k sketches are an alternative to k×minwise sketches when using hashing to estimate the similarity of documents represented by shingles (or set similarity in general) in large-scale machine learning. They are faster to compute and have nicer theoretical properties. In the case of k×minwise hashing, the bias introduced by not truly random hash function is independent of the number k of hashes, while this bias decreases with increasing k when employing bottom-k. In practice, bottom-k sketches can expedite classification systems if the trained classifiers are applied to many data points with a lot of features (i.e., to many documents encoded by a large number of shingles on average). An advantage of b-bit k×minwise hashing is that it can be efficiently incorporated into machine learning methods relying on scalar products, such as support vector machines (SVMs). Still, experimental results indicate that a nearest neighbors classifier with bottom-k sketches can be preferable to using a linear SVM and b-bit k×minwise hashing if the amount of training data is low or the number of features is high.

OriginalsprogEngelsk
Titel2013 IEEE International Conference on Big Data : proceedings
Antal sider7
ForlagIEEE
Publikationsdato2013
Sider28-34
ISBN (Elektronisk)978-1-4799-1293-3
DOI
StatusUdgivet - 2013
BegivenhedIEEE International Conference on Big Data 2013: BigData Congress 2013 - Santa Clara, CA, USA
Varighed: 28 jun. 20133 jul. 2013

Konference

KonferenceIEEE International Conference on Big Data 2013
Land/OmrådeUSA
BySanta Clara, CA
Periode28/06/201303/07/2013

Fingeraftryk

Dyk ned i forskningsemnerne om 'Nearest neighbor classification using bottom-k sketches'. Sammen danner de et unikt fingeraftryk.

Citationsformater