Abstract
Many RDF descriptions today are text-rich: besides structured data they also feature much unstructured text. Text-rich RDF data is frequently queried via predicates matching structured data, combined with string predicates for textual constraints (hybrid queries). Evaluating hybrid queries efficiently requires means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, which are reflected in efficiency and effectiveness issues. We propose a novel estimation approach, TopGuess, which exploits topic models as data synopsis. This way, we capture correlations between structured and unstructured data in a holistic and compact manner. We study TopGuess in a theoretical analysis and show it to guarantee a linear space complexity w.r.t. text data size. Further, we show selectivity estimation time complexity to be independent from the synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing efficiency.
This work was supported by the European Union through project XLike (FP7-ICT-2011-288342).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: SIGMOD (1999)
Bicer, V., Tran, T., Ma, Y., Studer, R.: TRM - Learning Dependencies between Text and Structure with Topical Relational Models. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 1–16. Springer, Heidelberg (2013)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Chang, J., Blei, D.: Relational Topic Models for Document Networks. In: AIStats (2009)
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem. In: ICDE (2004)
Coffman, J., Weaver, A.C.: A framework for evaluating database keyword search strategies. In: CIKM (2010)
Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. In: SIGMOD (2001)
Doshi, F., Miller, K., Gael, J.V., Teh, Y.W.: Variational Inference for the Indian Buffet Process. JMLR 5, 137–144 (2009)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: SIGMOD (2001)
Huang, H., Liu, C.: Estimating Selectivity for Joined RDF Triple Patterns. In: CIKM (2011)
Jin, L., Li, C.: Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In: VLDB (2005)
Koller, D., Friedman, N.: Probabilistic graphical models. MIT Press (2009)
Lee, H., Ng, R.T., Shim, K.: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. In: VLDB (2007)
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: Joint Models of Topic and Author Community. In: ICML (2009)
Luo, Y., Wang, W., Lin, X., Zhou, X., Wang, J., Li, K.: SPARK2: Top-k Keyword Query in Relational Databases. TKDE 23(12), 1763–1780 (2011)
Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE (2011)
Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD (1996)
Spiegel, J., Polyzotis, N.: Graph-based synopses for relational selectivity estimation. In: SIGMOD (2006)
Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation. In: WWW (2008)
Tsialiamanis, P., Sidirourgos, L., Fundulaki, I., Christophides, V., Boncz, P.: Heuristics-based Query Optimisation for SPARQL. In: EDBT (2012)
Tzoumas, K., Deshpande, A., Jensen, C.S.: Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions. In: PVLDB (2011)
Wagner, A., Bicer, V., Tran, D.T.: Topic-based Selectivity Estimation for Text-Rich Data Graphs, http://www.aifb.kit.edu/web/Techreport3039
Wagner, A., Bicer, V., Tran, T.D.: Selectivity estimation for hybrid queries over text-rich data graphs. In: EDBT (2013)
Wang, D.Z., Wei, L., Li, Y., Reiss, F., Vaithyanathan, S.: Selectivity estimation for extraction operators over text data. In: ICDE (2011)
Zhang, L., et al.: Multirelational Topic Models. In: ICDM (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Wagner, A., Bicer, V., Tran, T., Studer, R. (2014). Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs. In: Mika, P., et al. The Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes in Computer Science, vol 8797. Springer, Cham. https://doi.org/10.1007/978-3-319-11915-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-11915-1_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11914-4
Online ISBN: 978-3-319-11915-1
eBook Packages: Computer ScienceComputer Science (R0)