Efficient stream sampling for variance-optimal estimation of subset sums

Edith Cohen; Nick Duffield; Haim Kaplan; Carsten Lund; Mikkel Thorup

doi:10.1137/10079817X

Efficient stream sampling for variance-optimal estimation of subset sums

Edith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, Mikkel Thorup

17 Citationer (Scopus)

Abstract

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOpt_k, that dominates all previous schemes in terms of estimation quality. VarOpt_k provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.

Originalsprog	Engelsk
Tidsskrift	S I A M Journal on Computing
Vol/bind	40
Udgave nummer	5
Sider (fra-til)	1402-1431
Antal sider	30
ISSN	0097-5397
DOI	https://doi.org/10.1137/10079817X
Status	Udgivet - 2011
Udgivet eksternt	Ja

Adgang til dokumentet

10.1137/10079817X

Citationsformater

@article{b4e01f762ccb420fa703dc9ce2ce9e4d,

title = "Efficient stream sampling for variance-optimal estimation of subset sums",

abstract = "From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.",

author = "Edith Cohen and Nick Duffield and Haim Kaplan and Carsten Lund and Mikkel Thorup",

year = "2011",

doi = "10.1137/10079817X",

language = "English",

volume = "40",

pages = "1402--1431",

journal = "SIAM Journal on Computing",

issn = "0097-5397",

publisher = "Society for Industrial and Applied Mathematics",

number = "5",

}

TY - JOUR

T1 - Efficient stream sampling for variance-optimal estimation of subset sums

AU - Cohen, Edith

AU - Duffield, Nick

AU - Kaplan, Haim

AU - Lund, Carsten

AU - Thorup, Mikkel

PY - 2011

Y1 - 2011

N2 - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.

AB - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.

U2 - 10.1137/10079817X

DO - 10.1137/10079817X

M3 - Journal article

SN - 0097-5397

VL - 40

SP - 1402

EP - 1431

JO - SIAM Journal on Computing

JF - SIAM Journal on Computing

IS - 5

ER -

Efficient stream sampling for variance-optimal estimation of subset sums

Abstract

Adgang til dokumentet

Fingeraftryk

Citationsformater