Compressed vertical partitioning for efficient RDF management

Álvarez-García, Sandra; Brisaboa, Nieves; Fernández, Javier D.; Martínez-Prieto, Miguel A.; Navarro, Gonzalo

doi:10.1007/s10115-014-0770-y

Compressed vertical partitioning for efficient RDF management

Regular Paper
Published: 01 August 2014

Volume 44, pages 439–474, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

676 Accesses
43 Citations
3 Altmetric
Explore all metrics

Abstract

The Web of Data has been gaining momentum in recent years. This leads to increasingly publish more and more semi-structured datasets following, in many cases, the RDF (Resource Description Framework) data model based on atomic triple units of subject, predicate, and object. Although it is a very simple model, specific compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requirement is even more restrictive in RDF stores because efficient SPARQL solution on the compressed RDF datasets is also required. This article introduces a novel RDF indexing technique that supports efficient SPARQL solution in compressed space. Our technique, called $\hbox {k}^2$-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices of subjects $\times $ objects in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using $\hbox {k}^2$-trees. We enhance this model with two compact indexes listing the predicates related to each different subject and object in the dataset, in order to address the specific weaknesses of vertically partitioned representations. The resulting technique not only achieves by far the most compressed representations, but also achieves the best overall performance for RDF retrieval in our experimental setup. Our approach uses up to 10 times less space than a state-of-the-art baseline and outperforms its time performance by several orders of magnitude on the most basic query patterns. In addition, we optimize traditional join algorithms on $\hbox {k}^2$-triples and define a novel one leveraging its specific features. Our experimental results show that our technique also overcomes traditional vertical partitioning for join solution, reporting the best numbers for joins in which the non-joined nodes are provided, and being competitive in most of the cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HDTQ: Managing RDF Datasets in Compressed Space

Semantic Partitioning for RDF Datasets

HDT Bitmap Triple Indices for Efficient RDF Data Exploration

Notes

http://www.w3.org/TR/rdf-syntax-grammar/.
For simplicity, we have used strings instead of URIs and literals in the RDF excerpt.
A quad can be regarded as a triple enhanced with a fourth component of provenance: (s,p,o,c), where c is the context of the triple (s,p,o).
The division is similar to that proposed in the MX-Quadtree [41, Section 1.4.2.1].
This is done by traversing the $\hbox {k}^2$-tree in the proper order or by sorting the results afterward.
The relation (8,2) is added to P4 in order to provide a more interesting example of the interactive evaluation algorithm.
Hexastore has been kindly provided by its authors.
http://code.google.com/p/rdf3x/.
http://dbtune.org/jamendo/.
http://dblp.l3s.de/dblp++.php.
http://download.geonames.org/all-geonames-rdf.zip.
http://wiki.dbpedia.org/Downloads351.
http://any23.apache.org/.
The full testbed is available at http://dataweb.infor.uva.es/queries-k2triples.tgz.
The pattern (?,?,?), which returns all triples in the dataset, is excluded because it is rarely used in practice.

References

Abadi D, Marcus A, Madden S, Hollenbach K (2009) SW-store: a vertically partitioned DBMS for semantic web data management. VLDB J 18:385–406
Article Google Scholar
Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. In: Proceedings of 33rd international conference on management of data (SIGMOD)’, pp 671–682
Abadi D, Marcus A, Madden S, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of 33rd international conference on very large data bases (VLDB)’, pp 411–422
Anglés R, Gutiérrez C (2005) Querying RDF data from a graph database perspective. In: Proceedings of 2nd European semantic web conference (ESWC)’, pp 346–360
Arias M, Fernández J, Martínez-Prieto M (2011) An empirical study of real-world SPARQL queries. In: Proceedings of 1st international workshop on usage analysis and the web of data (USEWOD). Available at http://arxiv.org/abs/1103.5043
Atre M, Chaoji V, Zaki M, Hendler J (2010) Matrix “bit” loaded: a scalable lightweight join query processor for RDF data. In: Proceedings of 19th international conference on world wide web (WWW)’, pp 41–50
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) DBpedia: a nucleus for a web of open data. In: Proceedings of 6th international semantic web (ISWC) conference and 2nd Asian semantic web conference (ASWC)’, pp 722–735
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Scientific American Magazine
Binna R, Gassler W, Zangerle E, Pacher D, Specht, G (2011) SpiderStore: a native main memory approach for graph storage. In: Proceedings of 23rd workshop Grundlagen von Datenbanken (GvDB)’, pp 91–96
Bizer C, Heath T, Berners-Lee T (2009) Linked data-the story so far. Int J Semant Web Inf Syst 5:1–22
Google Scholar
Bönström V, Hinze A, Schweppe H (2003) Storing RDF as a graph. In: Proceedings of 1st Latin American Web Congress (LA-WEB)’, pp 27–36
Brisaboa N, Ladra S, Navarro G (2013) DACs: Bringing direct access to variable-length codes. Inf Process Manag 49(1):392–404
Article Google Scholar
Brisaboa N, Ladra S, Navarro G (2014) Compact representation of web graphs with extended functionality. Inf Syst 39(1):152–174
Article Google Scholar
Brisaboa N, de Bernardo G, Navarro G (2012) Compressed dynamic binary relations. In: Proceedings of 22nd data compression conference (DCC)’, pp 52–61
Broekstra J, Kampman A, van Harmelen F (2003) Sesame: an architecture for storing and querying RDF data and schema information. In: Spinning the semantic web, chapter , MIT Press, pp 197–222
Claude F, Ladra S (2011) Practical representations for Web and social graphs. In: Proceedings of 20th ACM conference on information and knowledge management (CIKM)’, pp 1185–1190
Fernández JD, Martínez-Prieto MA, Gutiérrez C, Polleres A (2011) Binary RDF representation for publication and exchange (HDT), W3C Member Submission. http://www.w3.org/Submission/2011/03/
Fernández JD, Martínez-Prieto MA, Gutiérrez C, Polleres A, Arias M (2013) Binary RDF representation for publication and exchange (HDT). J Web Semant. (in press). Available at: doi:10.1016/j.websem.2013.01.002
González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: Proceedings of posters of 4th workshop on experimental algorithms (WEA), pp 27–38
Grant J, Beckett D (2004) RDF test cases, W3C recommendation. http://www.w3.org/TR/rdf-testcases/
Groppe S (2011) Data management and query processing in semantic web databases. Springer, Berlin
Book Google Scholar
Groza T, Grimnes G, Handschuh S, Decker S (2013) From raw publications to linked data. Knowl Inf Syst 34:1–21
Article Google Scholar
Harris S, Gibbins N (2003) 3store: efficient bulk RDF storage. In: Proceedings of 1st international workshop on practical and scalable semantic systems (PSSS), pp 1–15
Harth A, Decker S (2005) Optimized index structures for querying RDF from the web. In: Proceedings of 3rd Latin American Web Congress (LA-WEB)’, pp 71–80
Hayes J, Gutiérrez C (2004) Bipartite graphs as intermediate model for RDF. In: Proceedings of 3rd international semantic web conference (ISWC), pp 47–61
Huang J, Abadi D, Ren K (2011) Scalable SPARQL querying of large RDF graphs. Proc VLDB Endow 4(11):1123–1134
Google Scholar
Janik M, Kochut K (2005) BRAHMS: a workbench RDF store and high performance memory system for semantic association discovery. In: Proceedings of 4th international semantic web conference (ISWC), pp 431–445
Jing Y, Jeong D, Baik D (2009) Sparql graph pattern rewriting for owl-dl inference queries. Knowl Inf Syst 20:243–262
Article Google Scholar
Knuth D (1973) The art of computer programming, vol. 3: sorting and searching. Addison Wesley, Reading
Google Scholar
Manola F, Miller E (eds) (2004) RDF primer, W3C recommendation. http://www.w3.org/TR/rdf-primer/
Martínez-Prieto M, Fernández J, Cánovas R (2012) Querying RDF dictionaries in compressed space. ACM SIGAPP Appl Comput Rev 12(2):64–77
Article Google Scholar
MonetDB (2013). http://www.monetdb.org/
Navarro G, Mäkinen V (2007) Compressed full-text indexes. In: ACM computing surveys 39(1) article 2
Neumann T, Weikum G (2010) The RDF-3X engine for scalable management of RDF data. VLDB J 19:91–113
Article Google Scholar
Neumann T, Weikum G (2009) Scalable join processing on very large RDF graphs. In: Proceedings of 35th international conference on management of data (SIGMOD), pp 627–640
Prud’hommeaux E, Seaborne A (eds) (2008) SPARQL query language for RDF, W3C recommendation. http://www.w3.org/TR/rdf-sparql-query/
Ramakrishnan R, Gehrke J (2000) Database management systems. Osborne/McGraw-Hill
Sakr S, Al-Naymat G (2010) Relational processing of RDF queries: a survey. SIGMOD Rec 38:23–28
Article Google Scholar
Sakr S, Elnikety S, He Y (2012) G-SPARQL: a hybrid engine for querying large attributed graphs. In: Proceedings of 21st ACM conference on information and knowledge management (CIKM), pp 335–344
Salomon D (2007) Variable-length codes for data compression. Springer, Berlin
Book Google Scholar
Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann Publishers Inc, Los Altos
Google Scholar
Sánchez D, Isern D, Millan M (2011) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27:393–418
Article Google Scholar
Schmidt M, Hornung T, Küchlin N, Lausen G, Pinkel C (2008) An experimental comparison of RDF data management approaches in a SPARQL benchmark scenario. In: Proceedings of 7th international conference on the semantic web (ISWC), pp 82–97
Sidirourgos L, Goncalves R, Kersten M, Nes N, Manegold S (2008) Column-store support for RDF data management: not all swans are white. Proc VLDB Endow 1(2):1553–1563
Article Google Scholar
Stonebraker M, Abadi D, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E, O’Neil P, Rasin A, Tran N, Zdonik S (2005) C-store: a column-oriented DBMS. In: Proceedings of 31st international conference on very large data bases (VLDB), pp 553–564
Urbani J, Maassen J, Bal H (2010) Massive semantic web data compression with MapReduce. In: Proceedings of 19th ACM international symposium on high performance distributed computing (HPDC), pp 795–802
Virtuoso Universal Server (2013) http://virtuoso.openlinksw.com/
Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019
Article Google Scholar
Wilkinson K (2006) Jena property table implementation. In: Proceedings of 2nd international workshop on scalable semantic web knowledge base systems (SSWS), pp 35–46

Download references

Acknowledgments

This work was partially funded by the Spanish Ministry of Economy and Competitiveness (PGE & FEDER), grants TIN2009-14560-C03-02 (first and second authors) and TIN2013-46238-C4-3-R (first, second, third, and fourth authors); CDTI, Spanish Ministry of Economy and Competitiveness, and Axencia Galega de Innovación (CDTI EXP 00064563 / ITC-20133062), and the Xunta de Galicia with FEDER ref. GRC2013/053 (first and second authors); and Chilean Fondecyt, refs. 1-110066 and 1-140796. The first author is granted by the Spanish Ministry of Economy and Competitiveness ref. BES-2010-039022. The third author is granted by the Regional Government of Castilla y Leon (Spain) and the European Social Fund. The fourth author has a Ibero-American Young Teachers and Researchers Grant funded by Santander Universidades.

Author information

Authors and Affiliations

Database Lab, Facultade de Informática, University of A Coruña, A Coruña, Spain
Sandra Álvarez-García & Nieves Brisaboa
DataWeb Research, Department of Computer Science, University of Valladolid, Valladolid, Spain
Javier D. Fernández & Miguel A. Martínez-Prieto
Department of Computer Science, University of Chile, Santiago, Chile
Javier D. Fernández, Miguel A. Martínez-Prieto & Gonzalo Navarro
Escuela Universitaria de Informática, Campus María Zambrano, Segovia, Spain
Miguel A. Martínez-Prieto

Authors

Sandra Álvarez-García
View author publications
You can also search for this author in PubMed Google Scholar
Nieves Brisaboa
View author publications
You can also search for this author in PubMed Google Scholar
Javier D. Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Miguel A. Martínez-Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel A. Martínez-Prieto.

Additional information

A preliminary version of this article appeared in Proc.17th Americas Conference on Information Systems (AMCIS 2011): article 350.

Appendices

Appendix 1: Complete triple pattern experiments

Figures 11, 12, 13 and 14 summarize triple pattern experiments for all the datasets in our setup. We provide figures for cold (left column) and warm (right column) scenarios.

Appendix 2: Further join experiments

We show join performance figures for the remaining datasets in our setup: jamendo in Fig. 15 discards all times over 100,000 milliseconds; dblp in Fig. 16 discards all times over $10^6$ milliseconds; and geonames in Fig. 17 discards all times over $10^6$ milliseconds. All these numbers are obtained in warm state because solution times for RDF3X and MonetDB are less competitive in cold scenarios.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Álvarez-García, S., Brisaboa, N., Fernández, J.D. et al. Compressed vertical partitioning for efficient RDF management. Knowl Inf Syst 44, 439–474 (2015). https://doi.org/10.1007/s10115-014-0770-y

Download citation

Received: 01 April 2013
Revised: 25 February 2014
Accepted: 11 July 2014
Published: 01 August 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s10115-014-0770-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressed vertical partitioning for efficient RDF management

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others