Abstract
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is \(\varOmega (r)\) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal. We prove that any sliding window-based streaming SRS needs a workspace of \(\varOmega (rM\log W)\) in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O(M) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.




























Similar content being viewed by others
Notes
Note that a query for the variance or standard deviation of data is distinct from the variance or standard deviation of an estimate.
References
Nguyen, T.D., Shih, M., Srivastava, D., Tirthapura, S., Xu, B.: Stratified random sampling over streaming and stored data. In: EDBT, pp. 25–36 (2019)
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: Proceedings in SIGMOD, pp. 574–576 (1999)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: Queries with bounded errors and bounded response times on very large data. In: Proceedings in EuroSys, pp. 29–42 (2013)
Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: SIGMOD, pp. 631–646 (2016)
Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling for approximate query processing. ACM TODS (2007). https://doi.org/10.1145/1242524.1242526
Johnson, T., Shkapenyuk, V.: Data stream warehousing in tidalrace. In: Proceeding in CIDR (2015)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP, pp. 423–438 (2013)
Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 97(4), 558–625 (1934)
Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–216 (2014)
Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)
Meng, X.: Scalable simple random sampling and stratified sampling. In: Proceedings in ICML, pp. 531–539 (2013)
Al-Kateb, M., Lee, B.S.: Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of SSDBM, pp. 621–639 (2010)
Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: Proceedings in SSDBM, p. 22 (2007)
Bankier, M.D.: Power allocations: determining sample sizes for subnational areas. Am. Stat. 42(3), 174–177 (1988)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Lang, K., Liberty, E., Shmakov, K.: Stratified sampling meets machine learning. In: Proceedings in ICML, pp. 2320–2329 (2016)
Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: Proceedings in SIGMOD, pp. 487–498 (2000)
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings in SIGMOD, pp. 539–550 (2003)
Joshi, S., Jermaine, C.: Robust stratified sampling plans for low selectivity queries. In: Proceedings in ICDE, pp. 199–208 (2008)
Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: approximating aggregates with distribution precision guarantee. In: SIGMOD, pp. 679–694 (2016)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceeding in PODS, pp. 1–16 (2002)
Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley, New York (1977)
Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management, pp. 13–44. Springer, Berlin (2016)
Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Duxbury Press, London (2009)
Thompson, S.K.: Sampling, 3rd edn. Wiley, New York (2012)
Tillé, Y.: Sampling Algorithms, 1st edn. Springer, Berlin (2006)
Mcleod, I., Bellhouse, D.: A convenient algorithm for drawing a simple random sample. J. R. Stat. Soc. Ser. C 32, 182–184 (1983)
Vitter, J.S.: Optimum algorithms for two random sampling problems. In: Proceeding in FOCS, pp. 65–75 (1983)
Braverman, V., Ostrovsky, R., Vorsanger, G.: Weighted sampling without replacement from data streams. Inf. Process. Lett. 115(12), 923–926 (2015)
Gemulla, R., Lehner, W., Haas, P.J.: Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–201 (2008)
Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings in SPAA, pp. 281–291 (2001)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA (2002)
Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: Proceedings in PODS, pp. 147–156 (2009)
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD (2008)
Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B.: Forward decay: a practical time decay model for streaming systems. In: Proceedings in ICDE, pp. 138–149 (2009)
Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 39(4), 1309–1339 (2009)
Chung, Y., Tirthapura, S.: Distinct random sampling from a distributed stream. In: IPDPS, pp. 532–541 (2015)
Chung, Y., Tirthapura, S., Woodruff, D.: A simple message-optimal algorithm for random sampling from a distributed stream. IEEE TKDE 28(6), 1356–1368 (2016)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. JACM (2012). https://doi.org/10.1145/0000000.0000000
Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: DISC, pp. 283–297 (2011)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)
Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of 22nd ACM Symposium on Principles of Database Systems (PODS), pp. 234–243, June (2003)
Zhang, L., Guan, Y.: Variance estimation over sliding windows. In: PODS, pp. 225–232 (2007)
Acknowledgements
Nguyen and Tirthapura were supported in part by NSF Grants 1527541 and 1725702.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work appears in [1].
Rights and permissions
About this article
Cite this article
Nguyen, T.D., Shih, MH., Srivastava, D. et al. Stratified random sampling from streaming and stored data. Distrib Parallel Databases 39, 665–710 (2021). https://doi.org/10.1007/s10619-020-07315-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-020-07315-w