An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce PadmashreeRavindra, HyeongSik Kim, KemaforAnyanwu COUL– Semantic COmpUtingresearch Lab

2/30 Outline • Introduction • Background • MapReduce, Pig and Join Processing • RDF Graph Pattern Matching in Pig • Approach • TripleGroup data model and Nested TripleGroup Algebra (NTGA) • Comparing NTGA based plans and Pig Latin plans for graph pattern matching queries • Evaluation • Related Work • Conclusion and Future Work

3/30 Basics: MapReduce • Large scale processing of data on a cluster of commodity grade machines • Users encode task asmap / reduce functions, which are executed in parallel across the cluster • Apache Hadoop* – open-source implementation • Key Terms • Hadoop Distributed File System (HDFS) • Slave nodes / Task Tracker – Mappers (Reducers) execute the map (reduce) function • Master node / Job Tracker – manages and assigns tasks to Mappers / Reducers * http://hadoop.apache.org/

4/30 Data Processing on Hadoop Input Job Tracker HDFS Reads Map exec Mapper1 map() Mapper2 map() …………. MapperN map() Disk Disk Local Writes Disk (k1, v1) (k1,v2) (k1, v3) Remote Reads (k1, {v1, v2, v3}) Reducer1 reduce() …………. ReducerM reduce() Sort / Shuffle Reduce exec (k1, val) HDFS Writes Output Each MR cycle  I/O and communication costs Supports Partition Parallelism

5/30 Joins in Map Reduce Single Join Workload • Map phase – scan input records • mapfunc. annotates each record based on join column e.g. (joinKey, Record) • Reduce phase – records with same joinKey collected by same reduce task • reducefunc. joins the tuples • Output written into HDFS

6/30 Data Processing in Pig • Express data flow using high-level query primitives • usability, code reuse, automatic optimization • Pig Latin • Data model : atom, tuple, bag (nesting) • Operators : LOAD, STORE, JOIN, GROUP BY, COGROUP, FOREACH, SPLIT, aggr. functions • Ex.Equijoin on REL A (column 0) and REL B (column 1)JOIN A by $0, B by $1; • Extensibility support via UDFs • Dataflow is compiled into a workflow of MapReduce jobs

7/30 Example Pig Query Plan SELECT ?vlabel ?hpage ?price ?prod WHERE { ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .} A = LOAD Input.rdf B = LOAD Input.rdf FILTER (homepage) FILTER(label) C = LOAD Input.rdf T3 = JOIN H ON Sub, T7 ON Sub; MR1 T1 = JOIN A ON Sub, B ON Sub; FILTER(country) MR2 T2 = JOIN C ON Sub, T1 ON Sub; H= LOAD Input.rdf ……. FILTER(product) MR6 • #MR cycles = #Joins = 6 • (I/O & communication costs) * 6 • Loads have I/Os as well • Expensive!!! (SPLIT Operator)* STORE

8/30 Possible Optimizations : m-way Join Input MR1 map Disk SELECT ?vlabel ?hpage ?price ?prod WHERE { ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .} JOIN SJ1 reduce SJ1 MR2 Join between Stars J1 map Disk JOIN SJ2 SJ2 HDFS reduce map Disk MR3 #MR cycles reduced from 6 to 3 JOIN J1 reduce

9/30 BUT ? • Still expensive! I MR cycle/star-join • Many pattern matching queries involve multiple star join subpatterns • 50% of BSBM* benchmark queries have two or more star patterns • Our proposal: • Coalesce the computation of ALL star-join subpatterns into a single MR cycle • How? • Don’t think of them as a set of joins! • Think of it as a GROUP BY operation

10/30 GROUPBY Subject WHERE { ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .} 1 MapReduce Cycle!!!

11/30 What are we proposing? • A new data model (TripleGroup) and algebra (Nested TripleGroup Algebra - NTGA) for more efficient graph pattern matching on MapReduce platforms

13/30 Our Approach : RAPID+ • Goal : Minimize I/O and communication costs by reducing MR cycles • Reinterpret and refactor operations into a more suitable (coalesced) set of operators – NTGA • Foundation: • Re-interpret multiple star-joins as a grouping operation • leads to “groups of Triples” (TripleGroups) instead of n-tuples • different structure BUT “content equivalent” • NTGA- algebra on TripleGroups

14/30 NTGA – Data Model • Data model based on nestedTripleGroups • More naturally capture graphs • TripleGroup – • groups of triples sharing • Subject / Object component • Can be nested at the Object component {(&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2)} {(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.vendors….)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

15/30 NTGA Operators…(1) • TG_Flatten – generate equivalent n-tuple (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven...)} “Content Equivalence” TG_Flatten (&V1, label, vendor1, &V1, country, US, &V1, homepage, www.ven...) t1 t2 t3 • TG_Unnest – unnesta nested TripleGroup {(&Offer1, price, 108), (&Offer1, vendor, &V1), (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)} {(&Offer1, price, 108), (&Offer1, vendor,{(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)} TG_Unnest

16/30 NTGA Operators…(2) • TG_GroupFilter – retain only TripleGroups that satisfy the required query sub structure •  Structure-based filtering TG TG{price, vendor, delDays, product} { (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) }, { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } , { (&Offer2, vendor, &V2), (&Offer2, product, &P3), (&Offer2, delDays, 1) } } { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } TG_GroupFilter (TG, {price, vendor, delDays, product}) Eliminate TripleGroups with missing triples (edges)

17/30 NTGA Operators…(3) • TG_Filter – filter out triples that do not satisfy the filter condition (FILTER clause) •  Value-based filtering TG{price, vendor, delDays, product} TG{price, vendor, delDays, product} { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } , { (&Offer3, vendor, &V2), (&Offer3, product, &P3), (&Offer3, price, 306), (&Offer3, delDays, 1) } } { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } TG_Filterprice<200(TG) Eliminate TripleGroups with triples that do not satisfy filter condition

18/30 NTGA Operators…(4) • TG_Join – join between different structure TripleGroups based on join triple patterns TG{price, vendor, delDays, product} TG{label, country, homepage} { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, ww.ven...)} ?o vendor ?v ?v country ?vcountry TG_Join {(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

Pattern Matching using NTGA in Pig LoadFilter (load + TG_Filter) StarGroupFilter (TG_GroupBy+ TG_GroupFilter) { (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) }, { (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } } RDFJoin {(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)} (TG_Join)

20/30 Mapping to Pig Latin/Relational Algebra

21/30 RDFMap: Efficient Data Representation • Compact representation of intermediate results during TripleGroup based processing • Efficient look-up of triples matching a given Property type via property-based indexing scheme • Ability to represent structure-labelinformation for groups of triples.

23/30 Evaluation • Setup: 5-node to 25-node Hadoop clusters on NCSU’s Virtual Computing Lab* • Dataset: Synthetic benchmark dataset generated using BSBM** tool (max. 40GB data – approx. 175 million triples) • Evaluation of Pig (Pig_opt) vs. RAPID+ • Task 1 – Scalability with size of RDF graphs • Task2 – Scalability with denser star patterns • Task 3 – Scalability with increasing cluster sizes *https://vcl.ncsu.edu **http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/

24/30 Experimental Results…(1) Cost Analysis across Increasing size of RDF graphs (5-node) • Key Observations: • Benefit of TripleGroup based processing seen across data sizes – up to 60% in some cases (RAPID+ << Pig_opt < Pig) • Pig approaches did not complete for large data size

Experimental Results…(2) 25/30 Cost Analysis across Increasing Star Density (5-node / 20GB) • Key Observations: • RAPID+ maintains a consistent %gain of 50% across the varying density • Costs savings by eliminating redundant Subject values and join triples %gain of RAPID+ over Pig (10-node / 32GB)

26/30 Experimental Results…(3) Cost Analysis across Increasing Cluster Sizes • Key Observations: • RAPID+ has 56% gain for 10-node cluster over Pig approaches • Pig approaches catch up with increasing cluster size • Increasing nodes decrease probability of disk spills with the SPLIT approach • RAPID+ still maintains 45% gain across the experiments Query pattern with three star-joins and two chain-joins (32GB)

And some Updates… 27/30 • Additional evaluation – • Up to 65% performance gain on another synthetic benchmark dataset* for three/two star-join queries • Experiments extended to 1 billion 3-ary triples (43GB) – 31% (10-node) to 41% (30-node) performance gain • RAPID+ now includes a SPARQL interface • In Future: Cost-based optimizations to select Pig vs. NTGA execution plans • Join us for a demo of RAPID+@VLDB2011** • *Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009) **Kim, H., Ravindra, P., Anyanwu, K : From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. To appear In: Proc. International Conference on Very Large Data Bases. (VLDB 2011)

28/30 Related Work MapReduce-based Processing High-level Dataflow Languages Partitioning Schemes Other Extensions Rule-based Optimizations Indexing [Newman08] * [Hunter08]* [Afrati10] Map-Reduce- Merge [Yang07] Pig Latin [Olston08], [HiveQL] [JAQL] [Husain10]*, Hadoop++ [Dittrich10], HadoopDB [Abadi09] Reasoning [Urbani07] *

29/30 Conclusion • TripleGroup based processing for evaluating pattern matching queries on MapReduce platforms • NTGA Operators re-factored to minimize #MR cycles  minimize costs • Reduce costs of repeated data handling via operator coalescing • Efficient data representation (RDFMap)

30/30 References [Dean04] Dean, J., Ghemawat, S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM 51 (2008) 107–113 [Olston08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. International Conference on Management of data. (2008) [Abadi09] Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in action: building real world applications. In: Proc. International Conference on Management of data. (2010) [Sridhar09] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 [Yu08] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008 [Newman08] Newman, A., Li, Y.F., Hunter, J.: Scalable semantics: The silver lining of cloud computing. In: eScience. IEEE International Conference on. (2008) [Hunter08] Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A scale-out rdf molecule store for distributed processing of biomedical data. In: Semantic Web for Health Care and Life Sciences Workshop. (2008) [Urbani07] Urbani, J., Kotoulas, S., Oren, E., Harmelen, F.: Scalable distributed reasoning using mapreduce. In: Proc. International Semantic Web Conference. (2009) [Abadi07] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 [Dittrich10] Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB 2010/PVLDB [Yang07] Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D.S.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007 [Afrati10] Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proc. International Conference on Extending Database Technology. (2010) [Husain10] Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: Cloud Computing (CLOUD), IEEE International Conference on. (2010) [HiveQL] http://hadoop.apache.org/hive/ [JAQL], http://code.google.com/p/jaql

Thank You!

Environment • Node Specifications • Single / duo core Intel X86 • 2.33 GHz processor speed • 4G memory • Red Hat Linux • Pig 0.5.0 • Hadoop 0.20 • Block size 256MB

Scripts (Q1) • A = load '/data/' using PigStorage(' '); • A1 = filter A by $1 eq'pageRank' or $1 eq'pageURL' or $1 eq'destURL' or $1 eq'srcIP' or $1 eq'adRevenue' or ($1 eq'type' and ($2 eq'Ranking' or $2 eq'UserVisits')); • B = group A1 by $0 PARALLEL 5; • C = foreach B generate flatten(ReassembleRDF($1,'pageURL|destURL','1')); • D = group C by $0 PARALLEL 5; • E = foreach D generate flatten(ReassembleRDF($1,'srcIP|','2')) as (srcIP:chararray, vals:bytearray); • store E into '/q1_app1';

Scripts (Q1) • A1 = load '/data/' using PigStorage(' '); • split A1 into pageRank IF $1 eq'pageRank', • srcIP IF $1 eq'srcIP‘, pageURL IF $1 eq'pageURL', • destURL IF $1 eq'destURL‘, adRevenue IF $1 eq'adRevenue', • typeRanking IF $1 eq'type' and $2 eq'Ranking', • typeUV IF $1 eq'type' and $2 eq'UserVisits'; • Ranking = join pageURL by $0, pageRank by $0, typeRanking by $0 PARALLEL 5; • UserVisits = join srcIP by $0, destURL by $0, adRevenue by $0, typeUV by $0 PARALLEL 5; • C1 = join Ranking by $2, UserVisits by $5 PARALLEL 5; • D1 = foreach C1 generate $11, $17, $5; • store D1 into '/q1_app2';

Experiment Results • Percentage Performance Gain • = (exec time 1) – (exec time 2) • (exec time 1)

Possible Optimizations (2) • Coalesce join operations into as few MR cycles as possible • Compute star patterns via m-way JOIN • Star-join using m-way JOIN = 1 MR cycle • Reduced #MR cycles  Reduced I/O + communication costs

Structured Data Processing in Pig UserVisits LOAD UserVisits LOAD Ranking FILTER(visitDate) Ranking JOIN UserVisitsONdestURL, Ranking ONpageURL; STORE Query: Retrieve the pageRank and adRevenue of pages visited by particular users between “1979/12/01” and “1979/12/30”

JOIN: Pig Latin  MapReduce UserVisits Ranking url1 url1 url2 url2 url1 map JOIN UserVisitsONdestURL, Ranking ONpageURL; Annotate based on join key reduce Package tuples Reducer 1 Reducer 2 url1 url2

RDF Data Model(Resource Description Framework) Statements (triples) Graph representation Subject Prop Object (&UV1 srcIP 158.112.27.3) Ranking UserVisits

Example SPARQL Query Data: Description of Vendors, their product Offers, and Reviews of products (BSBM* dataset) Query: Retrieve the details of US-based Vendors SELECT ?vlabel ?hpage WHERE {?v type Vendor . ?v country ?vcountry . ?v label ?vlabel . ?v homepage ?hpage .} FILTER (?vcountry = “US”); *http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce

Presentation Transcript

Pattern Matching Algorithms: An Overview

Pattern Matching

Pattern Matching

Graph Homomorphism Revisited for Graph Matching

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

Tarazu Optimizing MapReduce On Heterogeneous Clusters

RDFPath: Path Query Processing on Large RDF Graph with MapReduce

Graph Matching

Efficient Processing of RDF Graph Pattern Matching on MapReduce Platforms

Pattern Matching

Graph pattern matching

New Models for Graph Pattern Matching

Pattern Matching

Incremental Graph Pattern Matching

Pattern matching

Simulation Revised for Graph Pattern Matching

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande

Pattern Matching

Graph Matching

Graph Matching

Pattern Matching