1 / 29

Towards a Top-K SPARQL Query Benchmark Generator

Towards a Top-K SPARQL Query Benchmark Generator. Shima Zahmatkesh 1 , Emanuele Della Valle 1 , Daniele Dell’Aglio 1 , and Alessandro Bozzon 2 1 Politecnico di Milano 2 TU Delft. Agenda. Rankings, Rankings everywhere What are top-k SPARQL queries Jim Gray's Benchmarking Principles

josiah-ward
Download Presentation

Towards a Top-K SPARQL Query Benchmark Generator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2 1Politecnico di Milano 2TU Delft

  2. Agenda • Rankings, Rankings everywhere • What are top-k SPARQL queries • Jim Gray's Benchmarking Principles • The problem • Some Definitions • Research Hypothesis • Background work: DBpedia SPARQL Benchmark • Our proposal: Top-k DBPSB • Preliminary Evaluation • Conclusions

  3. Rankings, rankings everywhere

  4. Rankings, rankings everywhere

  5. Rankings, rankings everywhere

  6. Why do we need to optimize them? A very intuitive and simplified example: • Top 3 largest countries (by both area and population)

  7. The standard way: materialize-then-sort scheme Fetch 3 best results … Sort allthe 242 countries Compute the scoring function that accounts for area and population Countries 242 … …

  8. Innovative optimization:Split-and-Interleave scheme Fetch 3 best results 3 Incrementally order partial results by area 9 Sorted access to countries ordered by population Countries 242

  9. State-of-the art Database • method • Split the evaluation of the scoring function into single criteria • Interleave them with other operators • Use partial orders to construct incrementally the final order • Standard assumptions: • Monotone scoring function • Each criterion is evaluated as a [0,1] number (normalization) • Optimized for the case of fast sorted access for each criterion

  10. Top-k SPARQL queries E.g., the 10 most recent books written by the youngest authors SELECT ?book ?author (0.5*norm(?releaseDate) + 0.5*norm(?dateOfBirth) AS ?s ) WHERE { ?book dbp:isbn ?v . ?book dbp:author ?author . ?book dbp:releaseDate ?releaseDate . ?v3 dbp:dateOfBirth ?dateOfBirth . } ORDER BY DESC(?s) LIMIT 10 Scoring Functionas a SELECT expression Normalization cast the value in [0..1] x - minx norm(x) = maxx - minx Order and slice

  11. The Problem Set up a benchmark for top-k SPARQL Queries that • Resembles reality • Stresses the features of top-k queries • Syntax: SELECT expression + ORDER BY + LIMIT • Performance: hit SPARQL engine where it hurts

  12. Jim Grayon Benchmarking Results Principles • Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain • Portable: Easy to implement on many different systems • Scalable: Applies to small and large computer systems • Simple: understandable

  13. Definitions E.g., the 10 most recent books written by the youngest authors Scoring Function 0.5* norm(?releaseDate) + 0.5*norm(?birthDate) Rankable Triple Patterns releaseDate ?book Triple Patterns author ?releaseDate ?birthDate dateOfBirth ?author Rankable Variables Rankable Data Properties Scoring Variables

  14. Research Hypothesis • H.0 top-k SPARQL queries that resemble reality can be obtained extending DBpedia SPARQL Benchmark • H.1 ++ Rankable variable  ++ execution time • H.2 ++ Scoring variable  ++ execution time • H.3 +/- LIMIT  = execution time

  15. DBpedia SPARQL Benchmark • A method to generate a SPARQL benchmark from DBpedia an its query longs • It can be applied to other datasets and other query logs • Characteristics • Resemble reality • Stress SPARQL features Query Logs Dataset generation Query Analysis and Clustering Auxiliary Queries Queries Templates Query Instances

  16. Proposed Solution Auxiliary Queries Top-k DBPSB • An extension of DBPSB Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables • It is also a method • Can be applied to other benchmark obtained using DBSBM method Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  17. A DBPSB Auxiliary Query Auxiliary Queries SELECT DISTINCT ?v WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . } Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  18. Top-k DBPSB step 1a Auxiliary Queries To generate queries with 1 rankable variable SELECT ?p (COUNT(?p) AS ?n) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } ORDER BY ORDER BY DESC(?n) Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  19. Top-k DBPSB step 1b Auxiliary Queries Results – not all sortable properties resemble reality • Pages • ISBN • NumberOfPages • Year • Volume • wikiPageID • releaseDate • … NOTE: it requires manual selection Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  20. Top-k DBPSB step 1c Auxiliary Queries To generate queries with 2 rankable variables SELECT ?p ?p1 (COUNT(?p1) AS ?n) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . ?o ?p1 ?o1 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) } GROUP BY ?p ?p1 ORDER BY DESC(?n) NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  21. Top-k DBPSB step 1d Auxiliary Queries Results • author, wikiPageID • author, wikiPageRevisionID • … • author, dateOfBirth • … • publisher, wikiPageID • publisher, wikiPageRevisionID • … • publisher, founded • … NOTE: it requires manual selection Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  22. Top-k DBPSB step 2 Auxiliary Queries SELECT (max(?o) as ?max) (min(?o) as ?min) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 dbp:pages?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } NOTE: the filter clause should not be necessary, but DBpedia is very dirty … Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  23. Top-k DBPSB step 3 Auxiliary Queries • Choose the number of ranking variables • Max three • E.g., books and authors • Choose the number of scoring variables per ranking variables • Max three • E.g., releaseDate for books and dateOfBirth for authors • Look up the min and the max of each ranking variable to normalise it • Choose the weights • The sum of the weight should be 1 • Assemble the scoring function • E.g., 0.5*norm(?releaseDate ) + 0.5*norm(?dateOfBirth) Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  24. Top-k DBPSB step 4 Auxiliary Queries SELECT ?v6 ?v3 (0.5*norm(?o1) + 0.5*norm(?o2) AS ?s ) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 dbp:releaseDate ?o1 . ?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime) } ORDER BY ?s LIMIT 10 Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

  25. Preliminary Results 1/2 • We tested our hypothesis using • Virtuoso Open-Source Edition version 6.1.6 • Jena-TDB Version 2.10.1 • DBpedia 10% • In this setting, Top-k DBPSB generates queries • adequate to test • H.2 ++ Scoring variable  ++ execution time • H.3 +/- LIMIT  = execution time • only partially adequate to test • H.1 ++ Rankable variable  ++ execution time

  26. Preliminary Results 2/2 • H.1 ++ Rankable variable  ++ execution time • confirmed in some cases • not confirmed aggregating by query across engine • confirmed aggregating by engine across queries • H.2 ++ Scoring variable  ++ execution time • confirmed for Jena TDB • confirmed in most of the cases for Virtuoso • H.3 +/- LIMIT  = execution time • confirmed for Jena TDB • confirmed in most of the cases for Virtuoso

  27. Conclusions • Top-k DBPSB is a successful first attempt to automatically generate Top-k SPARQL queries that • Resemble reality • Hit SPARQL engines where it hurts • More investigation is required • Better understand the relationships between the number of rankable variable and the execution time • E.g., cardinalities, selectivity and jooins • Include over known features of top-k query that impact execution time • E.g., correlation of order induced on the result set by the different scoring variable in the scoring function • E.g., Distribution of values matched by the scoring variables

  28. Thank you! Any Question? Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2 1Politecnico di Milano 2TU Delft

  29. Preliminary Results - details

More Related