Scalable SPARQL querying using path partitioning

Buwen Wu; Yongluan Zhou; Pingpeng Yuan; Ling Liu; Hai Jin

doi:10.1109/ICDE.2015.7113334

Scalable SPARQL querying using path partitioning

Buwen Wu, Yongluan Zhou, Pingpeng Yuan, Ling Liu, Hai Jin

32 Citationer (Scopus)

Abstract

The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

Originalsprog	Engelsk
Titel	2015 IEEE 31st International Conference on Data Engineering (ICDE)
Antal sider	12
Forlag	IEEE
Publikationsdato	26 maj 2015
Sider	795-806
ISBN (Elektronisk)	978-1-4799-7964-6
DOI	https://doi.org/10.1109/ICDE.2015.7113334
Status	Udgivet - 26 maj 2015
Udgivet eksternt	Ja
Begivenhed	31st IEEE International Conference on Data Engineering - Seoul, Sydkorea Varighed: 13 apr. 2015 → 17 apr. 2015 Konferencens nummer: 31

Konference

Konference	31st IEEE International Conference on Data Engineering
Nummer	31
Land/Område	Sydkorea
By	Seoul
Periode	13/04/2015 → 17/04/2015

Adgang til dokumentet

10.1109/ICDE.2015.7113334

Citationsformater

@inproceedings{f91415ef9a54430bb51a99fe4c357ed6,

title = "Scalable SPARQL querying using path partitioning",

abstract = "The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.",

author = "Buwen Wu and Yongluan Zhou and Pingpeng Yuan and Ling Liu and Hai Jin",

year = "2015",

month = may,

day = "26",

doi = "10.1109/ICDE.2015.7113334",

language = "English",

pages = "795--806",

booktitle = "2015 IEEE 31st International Conference on Data Engineering (ICDE)",

publisher = "IEEE",

note = "31st IEEE International Conference on Data Engineering ; Conference date: 13-04-2015 Through 17-04-2015",

}

TY - GEN

T1 - Scalable SPARQL querying using path partitioning

AU - Wu, Buwen

AU - Zhou, Yongluan

AU - Yuan, Pingpeng

AU - Liu, Ling

AU - Jin, Hai

N1 - Conference code: 31

PY - 2015/5/26

Y1 - 2015/5/26

N2 - The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

AB - The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

U2 - 10.1109/ICDE.2015.7113334

DO - 10.1109/ICDE.2015.7113334

M3 - Article in proceedings

SP - 795

EP - 806

BT - 2015 IEEE 31st International Conference on Data Engineering (ICDE)

PB - IEEE

T2 - 31st IEEE International Conference on Data Engineering

Y2 - 13 April 2015 through 17 April 2015

ER -

Scalable SPARQL querying using path partitioning

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater