Reachability Query Over A Large Graph

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Reachability Query Over A Large Graph Instructor: Lei Zou

Reachability Query • Problem Definition: Given a single large directed graph G and two vertices u1 and u2, recahability query verifies whether there exists an directed path from u1 and u2. 15 • ?Query(1,11) • Yes • ?Query(3,9) • No 14 11 13 10 12 6 7 8 9 3 4 5 1 2

Applications • XML Data “//”: ancestor-descendant search • Biological Networks “Finding all genes whose expressions are directly or indirectly influenced by a given molecule” • Paper Citation Network “Given two papers A and B, we want to check whether A is directly or indirectly cited by B”.

Naïve Methods 10,000 nodes and 20,000 edges

Existing Solutions • Chain Cover 5 1 2 6 7 3 8 4

Existing Solutions • Tree Cover 1 2 6 7 3 8 4 5

Existing Solutions • 2-hop Labeling 1 Hops 2 6 7 3 8 4 5

Preliminary • A directed acyclic graph (commonly abbreviated to DAG), is a directed graph with no directed cycles. • Any directed graph G can be transformed into a DAG G’ by condensing each “strong connected component” into one node. • G and G’ have the same “reachability” information.

Preliminary • In graph theory, a topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. • Every DAG has one or more topological sorts.

Topological Ordering E C B D G F A A G B D C E F

Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008

Chain@TODS1990 (2, C1) E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) 13 Chains: A  B  D  F G  C  E A (0, C0)

Chain@TODS1990 (2, C1) Can A reach E ? E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) A (0, C0)

Chain Decomposition E C B D G F A A G B D C E F

Chain Decomposition • A Simple Method: • 1. Firstly, a topological sort is performed over DAG. 2. Find v, the smallest vertex (in the ascending order of the topological sort) in the graph and add it to the path P. • Then find v’, such that v’ is the smallest vertex in the graph such that there is an edge from v to v’. Add v’ to the path P. • Iteratively, go to Step 3 until that no v’ can be found. • Remove all vertices in P from the DAG • Iteratively, go to Step 2 until no vertex left in DAG.

GRIPP @ SIGMOD 2007 • Pre- and Postorder Numbering For Trees Node labeling during depth-first traversal of G. R [ 0, 17 ] w reachable from v iff vpre < wpre < vpost [ 1, 16] A [ 2, 7] [ 8, 9] [ 10, 15] E reachable from A? Epre = 3 Apre = 1, Apost = 16 1 < 3 < 16  true B C D [ 3, 4] [ 5, 6] [ 11, 12] [ 13, 14] E F G H

Non-Tree Edge General Idea In GRIPP Input: Given a graph, G = (V, E) Output:Create the GRIPP index table, IND(G) one instance for every node v in G – Node identifier – Preorder value – Postorder value – Instance type R Non-tree edge A B C D E F G H

Index creation Depth-first traversal of G [0 21 ] R Non-tree instances 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H

GRIPP Index Table, IND(G) Depth-first traversal of G [0 21 ] R 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H

Order Tree, O(G) post R [0 21 ] R A D 20 H 17 ] A [16 G [1 20 ] A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E E F G H pre 5 10 15 20 [15 18 ]  Tree instances: Inner or leaf node  Non-tree instances: Always leaf node

Reachable Instance Set (RIS) post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] 23

Query Processing Query : A  H ? post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes Yes 24

Query Processing Query : D  C ? We must extend the search post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes No 25

Query Processing post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 In worst case, we need to perform ( |E|-|V|) queries Yes

Algorithm Step 1 : Retrieve RIS(D) If Stop Search; Return TRUE else proceed Step2. Step 2: Check every i ∈ RIS(D)  If i is tree instance – [G and H] – Continue else – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Step 3: Obtain the tree instance of node i  Proceed to Step 1 C ∈ RIS(D) R A D 20 H A G 15 B 10 C B 5 F E pre 5 10 15 20 Repeat steps 1…3 until  an instance of node C is found  or no more hop nodes are available

Pruning Rules Pruning hop nodes  Simple – Never retrieve reachable instance sets twice  Skip – Skip already searched areas  Stop – Certain nodes cannot have hop nodes in their reachable instance set – This property can be pre-computed

Simple and Skip-Strategy • Keep list of used hop nodes, U • When examining a new hop node h four possible positions of h relative to u ∈ U h = u u ∈ RIS(h) h ∈ RIS(u) h sibling to u No No Yes Yes, but skip

Stop Nodes Node s is a stop node iff  all non-tree instances in RIS(s) also have their corresponding tree instances in RIS(s) or are s R A D 20 H 1. Node A is a stop node  no hop node in RIS(A) can be used 2. Nodes B, C, E, F, and R are also stop Nodes 3. Nodes D, G, and H are no stop nodes A G 15 B 10 C B 5 F E pre 5 10 15 20

Implementation Issues: GRIPP • 1. Introducing one virtual root node r. • 2. Connect r to the node with the highest degree. • 3. Traverse and Label the nodes. • 4. For the remaining part, repeat steps 2 – 3 until all vertices are covered.

Dual Labeling @ICDE 2006 [0, 21 ] • Naïve Dual Labeling: • Tree Interval-Based Labeling • Link Table [1, 8] [9, 20] X 11  [1, 8] 16  [10, 15] [10, 15] [16, 19] [2, 7] Y [5, 6] [3, 4] [13, 14] [11, 12] [17, 18]

Query Processing Over Naïve Dual Labeling [0, 21 ] Case 1: Query B  F ? 17 ∈ [9, 20] A [1, 8] [9, 20] X B Case 2: Query C  X ? [10, 15] [16, 19] [2, 7] G Y C 16  [10, 15] 11 ∈ [10, 15] 11  [1, 8] YES [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]

Query Processing Over Naïve Dual Labeling Lemma 1. Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff c ∈[a, b] or the link table contains a series of m non-tree edges: i1 [j1,k1], … …, im  [jm, km] (1) such that i1 ∈ [a,b], c ∈[jm,km], and im’ ∈[jm’-1, km’-1]for all 1 < m’ ≤ m.

Transitive Link Table • Size of Transitive Link is O(t), where t = |E(G)|-|V(G)| • Naïve Query Processing: Traversing and exploring the non-tree edges in an iterative fashion. • Transitive Link Table Given two links i1  [j1, k1] and i2  [j2, k2] in the link table, if i2 ∈[j1,k1], we add a new link i1  [j2,k2]. • Size of the transitive link table: O(t2).

Transitive Link Table • An Example [0, 21 ] A 16  [10, 15] 11  [1, 8] 16  [1, 8] [1, 8] [9, 20] X B [10, 15] [16, 19] [2, 7] G Y C [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]

Query Over Transitive Link Table • Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff 1) c ∈[a, b]; or 2) The transitive link table contains such link i [j, k], where i ∈[a, b] AND c ∈[j, k] k d c j a i b

N(.,.) 15 Query: B  G ? N(9, 2) - N (9, 20) > 0  (B  G) TRUE 10 8 7 N(9,2) = 2 N(9,20) = 0 2 1 9 11 16 20

Reducing Space Cost ■ ■ In each cell, all H(.,.) are the exactly same with each other. 15 10 ■ ■ 8 ■ ■ 7 2 1 ■ ■ 9 11 16 20

Path-Tree in a Nutshell 15 14 P4 11 13 10 12 P2 6 7 8 9 P4 P1 P3 3 4 5 P3 1 2 P2 P1

Key Problems • How to construct a path-tree? • Algorithm • How can a path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

Constructing Path-Tree • Step 1: Path-Decomposition of DAG • Step 2: Minimal Equivalent Edge Set between any two paths • Step 3: Path-Graph Construction • Step 4: Path-Tree Cover Extraction

Step 1: Path-Decomposition 15 (PID,SID) =(2, 5) 14 11 For any two nodes (u, v) in the same path, u  v if and only if (u.sid  v.sid) 13 10 12 6 7 8 9 P4 3 4 5 P3 1 2 P2 P1 Simple linear algorithm based on topological sort can achieve a path-decomposition

Step 2: Minimal equivalent edge set The reachability between any two paths can be captured by a unique minimal set of edges 15 15 14 14 11 11 13 10 13 10 6 7 P1 P2 P1  P2 6 7 3 4 3 4 1 2 1 2 P2 P2 P1 P1 The edges in the minimal equivalent edge set do not cross (always parallel)!

Step 2: Constructing Minimal Equivalent Edge Set (PiPj) • Ordering the vertices in Pi and Pj by decreasing order • Finding the first vertex v in P_j that P_i can reach • Finding the last vertex u in P_i that reach v • Removing all the edges cross (u,v) and • repeat 2-4 15 14 11 13 10 6 7 3 4 1 2 P2 P1 P1 P2

Step 3: Path-Graph Construction Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge 15 14 P2 11 2 5 13 10 12 4 P4 P1 2 2 1 1 6 7 8 9 1 P4 P3 3 4 5 P3 Weighted Directed Path-Graph 1 2 P2 P1

Step 4: Extracting Path-Tree Cover P2 P2 2 2 4 5 5 P4 P4 P1 P1 2 2 2 1 1 1 P3 P3 Weighted Directed Path-Graph Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk)

Key Problems • How to construct a path-tree? • Algorithm • How can path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

3-Tuple Labeling for Reachability 15 [1,3] P2 14 11 [1,4] P4 13 10 12 P1 [1,1] [2,2] 6 7 8 P3 9 P4 3 4 5 Interval labeling (2-tuple) High-level description about paths Pi  Pj ? P3 1 2 P2 P1 DFS labeling (1-tuple)

Reachability Query Over A Large Graph

Reachability Query Over A Large Graph

Presentation Transcript

The Graph Query Language: Towards a Unification of Graph Query Approaches

Query Preserving Graph Compression

RDFPath: Path Query Processing on Large RDF Graph with MapReduce

Program Analysis via Graph Reachability

On Graph Query Optimization in Large Networks

Large Graph Algorithms

Subgraph Search Over Large Graph Database

Mining Tree-Query Associations in a Graph

Outline: Reachability Query Evaluation What is reachability query?

Computing Label-Constraint Reachability in Graph Databases

Evaluating Reachability Queries over Path Collections*

Evaluating Reachability Queries over Path Collections

Large Graph Mining

Large Graph Mining

Subgraph Search Over Large Graph Database

On Graph Query Optimization in Large Networks

Large-Scale Graph Analytics

Recursive Graph Deduction and Reachability Queries

Program Analysis via Graph Reachability

The Graph Query Language: Towards a Unification of Graph Query Approaches

Reachability in Directed Graph s

Graph Reachability Query --Survey Report