Heavy hitters via cluster-preserving clustering

Kasper Green Larsen; Jelani Nelson; Huy L. Nguyen; Mikkel Thorup

doi:10.1109/FOCS.2016.16

Heavy hitters via cluster-preserving clustering

Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, Mikkel Thorup

Datalogisk Institut

26 Citationer (Scopus)

Abstract

In the turnstile lp heavy hitters problem with parameter ϵ, one must maintain a high-dimensional vector xϵRn subject to updates of the form update (i,δ) causing the change xi≤ ← xi + δ, where iϵ[n], δϵR. Upon receiving a query, the goal is to report every 'heavy hitter' iϵ[n] with |xi| ≥ϵ|x|p as part of a list L⊆[n] of size O(1/ϵp), i.e. proportional to the maximum possible number of heavy hitters. For any pϵ(0,2] the COUNTSKETCH of [CCFC04] solves lp heavy hitters using O(ϵ-plog n) words of space with O(log n) update time, O(nlog n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that xi ≥ 0 for all iϵ[n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the 'dyadic trick' for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ϵ-1log2 n), worse update time O(log2 n), but much better query time O(ϵ-1poly(log n)). An extension to all pϵ(0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ϵ-plog n) space, O(log n) update time, and fast O(ϵ-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the 'dyadic trick' by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a 'cluster-preserving clustering' algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.

Originalsprog	Engelsk
Titel	Proceedings - 57th Annual IEEE Symposium on Foundations of Computer Science
Antal sider	10
Forlag	IEEE
Publikationsdato	14 dec. 2016
Sider	61-70
ISBN (Elektronisk)	978-1-5090-3933-3
DOI	https://doi.org/10.1109/FOCS.2016.16
Status	Udgivet - 14 dec. 2016
Begivenhed	57th IEEE Annual Symposium on Foundations of Computer Science - New Brunswick, USA Varighed: 9 okt. 2016 → 11 okt. 2016 Konferencens nummer: 57

Konference

Konference	57th IEEE Annual Symposium on Foundations of Computer Science
Nummer	57
Land/Område	USA
By	New Brunswick
Periode	09/10/2016 → 11/10/2016

Adgang til dokumentet

10.1109/FOCS.2016.16

http://ieee-focs.org/FOCS-2016-Papers/3933a061.pdf

Citationsformater

@inproceedings{de1cdaa0e4c14d618e3d78d7971f60b2,

title = "Heavy hitters via cluster-preserving clustering",

abstract = "In the turnstile lp heavy hitters problem with parameter ϵ, one must maintain a high-dimensional vector xϵRn subject to updates of the form update (i,δ) causing the change xi≤ ← xi + δ, where iϵ[n], δϵR. Upon receiving a query, the goal is to report every 'heavy hitter' iϵ[n] with |xi| ≥ϵ|x|p as part of a list L⊆[n] of size O(1/ϵp), i.e. proportional to the maximum possible number of heavy hitters. For any pϵ(0,2] the COUNTSKETCH of [CCFC04] solves lp heavy hitters using O(ϵ-plog n) words of space with O(log n) update time, O(nlog n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that xi ≥ 0 for all iϵ[n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the 'dyadic trick' for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ϵ-1log2 n), worse update time O(log2 n), but much better query time O(ϵ-1poly(log n)). An extension to all pϵ(0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ϵ-plog n) space, O(log n) update time, and fast O(ϵ-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the 'dyadic trick' by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a 'cluster-preserving clustering' algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.",

author = "Larsen, {Kasper Green} and Jelani Nelson and Nguyen, {Huy L.} and Mikkel Thorup",

year = "2016",

month = dec,

day = "14",

doi = "10.1109/FOCS.2016.16",

language = "English",

pages = "61--70",

booktitle = "Proceedings - 57th Annual IEEE Symposium on Foundations of Computer Science",

publisher = "IEEE",

note = "57th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2016 ; Conference date: 09-10-2016 Through 11-10-2016",

}

TY - GEN

T1 - Heavy hitters via cluster-preserving clustering

AU - Larsen, Kasper Green

AU - Nelson, Jelani

AU - Nguyen, Huy L.

AU - Thorup, Mikkel

N1 - Conference code: 57

PY - 2016/12/14

Y1 - 2016/12/14

N2 - In the turnstile lp heavy hitters problem with parameter ϵ, one must maintain a high-dimensional vector xϵRn subject to updates of the form update (i,δ) causing the change xi≤ ← xi + δ, where iϵ[n], δϵR. Upon receiving a query, the goal is to report every 'heavy hitter' iϵ[n] with |xi| ≥ϵ|x|p as part of a list L⊆[n] of size O(1/ϵp), i.e. proportional to the maximum possible number of heavy hitters. For any pϵ(0,2] the COUNTSKETCH of [CCFC04] solves lp heavy hitters using O(ϵ-plog n) words of space with O(log n) update time, O(nlog n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that xi ≥ 0 for all iϵ[n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the 'dyadic trick' for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ϵ-1log2 n), worse update time O(log2 n), but much better query time O(ϵ-1poly(log n)). An extension to all pϵ(0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ϵ-plog n) space, O(log n) update time, and fast O(ϵ-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the 'dyadic trick' by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a 'cluster-preserving clustering' algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.

AB - In the turnstile lp heavy hitters problem with parameter ϵ, one must maintain a high-dimensional vector xϵRn subject to updates of the form update (i,δ) causing the change xi≤ ← xi + δ, where iϵ[n], δϵR. Upon receiving a query, the goal is to report every 'heavy hitter' iϵ[n] with |xi| ≥ϵ|x|p as part of a list L⊆[n] of size O(1/ϵp), i.e. proportional to the maximum possible number of heavy hitters. For any pϵ(0,2] the COUNTSKETCH of [CCFC04] solves lp heavy hitters using O(ϵ-plog n) words of space with O(log n) update time, O(nlog n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that xi ≥ 0 for all iϵ[n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the 'dyadic trick' for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ϵ-1log2 n), worse update time O(log2 n), but much better query time O(ϵ-1poly(log n)). An extension to all pϵ(0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ϵ-plog n) space, O(log n) update time, and fast O(ϵ-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the 'dyadic trick' by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a 'cluster-preserving clustering' algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.

U2 - 10.1109/FOCS.2016.16

DO - 10.1109/FOCS.2016.16

M3 - Article in proceedings

SP - 61

EP - 70

BT - Proceedings - 57th Annual IEEE Symposium on Foundations of Computer Science

PB - IEEE

T2 - 57th IEEE Annual Symposium on Foundations of Computer Science

Y2 - 9 October 2016 through 11 October 2016

ER -

Heavy hitters via cluster-preserving clustering

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater