850 likes | 1.02k Views
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Reachability Query Over A Large Graph. Instructor: Lei Zou. Reachability Query.
E N D
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Reachability Query Over A Large Graph Instructor: Lei Zou
Reachability Query • Problem Definition: Given a single large directed graph G and two vertices u1 and u2, recahability query verifies whether there exists an directed path from u1 and u2. 15 • ?Query(1,11) • Yes • ?Query(3,9) • No 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Applications • XML Data “//”: ancestor-descendant search • Biological Networks “Finding all genes whose expressions are directly or indirectly influenced by a given molecule” • Paper Citation Network “Given two papers A and B, we want to check whether A is directly or indirectly cited by B”.
Naïve Methods 10,000 nodes and 20,000 edges
Existing Solutions • Chain Cover 5 1 2 6 7 3 8 4
Existing Solutions • Tree Cover 1 2 6 7 3 8 4 5
Existing Solutions • 2-hop Labeling 1 Hops 2 6 7 3 8 4 5
Preliminary • A directed acyclic graph (commonly abbreviated to DAG), is a directed graph with no directed cycles. • Any directed graph G can be transformed into a DAG G’ by condensing each “strong connected component” into one node. • G and G’ have the same “reachability” information.
Preliminary • In graph theory, a topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. • Every DAG has one or more topological sorts.
Topological Ordering E C B D G F A A G B D C E F
Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008
Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008
Chain@TODS1990 (2, C1) E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) 13 Chains: A B D F G C E A (0, C0)
Chain@TODS1990 (2, C1) Can A reach E ? E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) A (0, C0)
Chain Decomposition E C B D G F A A G B D C E F
Chain Decomposition • A Simple Method: • 1. Firstly, a topological sort is performed over DAG. 2. Find v, the smallest vertex (in the ascending order of the topological sort) in the graph and add it to the path P. • Then find v’, such that v’ is the smallest vertex in the graph such that there is an edge from v to v’. Add v’ to the path P. • Iteratively, go to Step 3 until that no v’ can be found. • Remove all vertices in P from the DAG • Iteratively, go to Step 2 until no vertex left in DAG.
Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008
GRIPP @ SIGMOD 2007 • Pre- and Postorder Numbering For Trees Node labeling during depth-first traversal of G. R [ 0, 17 ] w reachable from v iff vpre < wpre < vpost [ 1, 16] A [ 2, 7] [ 8, 9] [ 10, 15] E reachable from A? Epre = 3 Apre = 1, Apost = 16 1 < 3 < 16 true B C D [ 3, 4] [ 5, 6] [ 11, 12] [ 13, 14] E F G H
Non-Tree Edge General Idea In GRIPP Input: Given a graph, G = (V, E) Output:Create the GRIPP index table, IND(G) one instance for every node v in G – Node identifier – Preorder value – Postorder value – Instance type R Non-tree edge A B C D E F G H
Index creation Depth-first traversal of G [0 21 ] R Non-tree instances 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H
GRIPP Index Table, IND(G) Depth-first traversal of G [0 21 ] R 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H
Order Tree, O(G) post R [0 21 ] R A D 20 H 17 ] A [16 G [1 20 ] A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E E F G H pre 5 10 15 20 [15 18 ] Tree instances: Inner or leaf node Non-tree instances: Always leaf node
Reachable Instance Set (RIS) post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] 23
Query Processing Query : A H ? post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes Yes 24
Query Processing Query : D C ? We must extend the search post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes No 25
Query Processing post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 In worst case, we need to perform ( |E|-|V|) queries Yes
Algorithm Step 1 : Retrieve RIS(D) If Stop Search; Return TRUE else proceed Step2. Step 2: Check every i ∈ RIS(D) If i is tree instance – [G and H] – Continue else – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Step 3: Obtain the tree instance of node i Proceed to Step 1 C ∈ RIS(D) R A D 20 H A G 15 B 10 C B 5 F E pre 5 10 15 20 Repeat steps 1…3 until an instance of node C is found or no more hop nodes are available
Pruning Rules Pruning hop nodes Simple – Never retrieve reachable instance sets twice Skip – Skip already searched areas Stop – Certain nodes cannot have hop nodes in their reachable instance set – This property can be pre-computed
Simple and Skip-Strategy • Keep list of used hop nodes, U • When examining a new hop node h four possible positions of h relative to u ∈ U h = u u ∈ RIS(h) h ∈ RIS(u) h sibling to u No No Yes Yes, but skip
Stop Nodes Node s is a stop node iff all non-tree instances in RIS(s) also have their corresponding tree instances in RIS(s) or are s R A D 20 H 1. Node A is a stop node no hop node in RIS(A) can be used 2. Nodes B, C, E, F, and R are also stop Nodes 3. Nodes D, G, and H are no stop nodes A G 15 B 10 C B 5 F E pre 5 10 15 20
Implementation Issues: GRIPP • 1. Introducing one virtual root node r. • 2. Connect r to the node with the highest degree. • 3. Traverse and Label the nodes. • 4. For the remaining part, repeat steps 2 – 3 until all vertices are covered.
Dual Labeling @ICDE 2006 [0, 21 ] • Naïve Dual Labeling: • Tree Interval-Based Labeling • Link Table [1, 8] [9, 20] X 11 [1, 8] 16 [10, 15] [10, 15] [16, 19] [2, 7] Y [5, 6] [3, 4] [13, 14] [11, 12] [17, 18]
Query Processing Over Naïve Dual Labeling [0, 21 ] Case 1: Query B F ? 17 ∈ [9, 20] A [1, 8] [9, 20] X B Case 2: Query C X ? [10, 15] [16, 19] [2, 7] G Y C 16 [10, 15] 11 ∈ [10, 15] 11 [1, 8] YES [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]
Query Processing Over Naïve Dual Labeling Lemma 1. Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff c ∈[a, b] or the link table contains a series of m non-tree edges: i1 [j1,k1], … …, im [jm, km] (1) such that i1 ∈ [a,b], c ∈[jm,km], and im’ ∈[jm’-1, km’-1]for all 1 < m’ ≤ m.
Transitive Link Table • Size of Transitive Link is O(t), where t = |E(G)|-|V(G)| • Naïve Query Processing: Traversing and exploring the non-tree edges in an iterative fashion. • Transitive Link Table Given two links i1 [j1, k1] and i2 [j2, k2] in the link table, if i2 ∈[j1,k1], we add a new link i1 [j2,k2]. • Size of the transitive link table: O(t2).
Transitive Link Table • An Example [0, 21 ] A 16 [10, 15] 11 [1, 8] 16 [1, 8] [1, 8] [9, 20] X B [10, 15] [16, 19] [2, 7] G Y C [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]
Query Over Transitive Link Table • Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff 1) c ∈[a, b]; or 2) The transitive link table contains such link i [j, k], where i ∈[a, b] AND c ∈[j, k] k d c j a i b
N(.,.) 15 Query: B G ? N(9, 2) - N (9, 20) > 0 (B G) TRUE 10 8 7 N(9,2) = 2 N(9,20) = 0 2 1 9 11 16 20
Reducing Space Cost ■ ■ In each cell, all H(.,.) are the exactly same with each other. 15 10 ■ ■ 8 ■ ■ 7 2 1 ■ ■ 9 11 16 20
Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008
Path-Tree in a Nutshell 15 14 P4 11 13 10 12 P2 6 7 8 9 P4 P1 P3 3 4 5 P3 1 2 P2 P1
Key Problems • How to construct a path-tree? • Algorithm • How can a path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality
Constructing Path-Tree • Step 1: Path-Decomposition of DAG • Step 2: Minimal Equivalent Edge Set between any two paths • Step 3: Path-Graph Construction • Step 4: Path-Tree Cover Extraction
Step 1: Path-Decomposition 15 (PID,SID) =(2, 5) 14 11 For any two nodes (u, v) in the same path, u v if and only if (u.sid v.sid) 13 10 12 6 7 8 9 P4 3 4 5 P3 1 2 P2 P1 Simple linear algorithm based on topological sort can achieve a path-decomposition
Step 2: Minimal equivalent edge set The reachability between any two paths can be captured by a unique minimal set of edges 15 15 14 14 11 11 13 10 13 10 6 7 P1 P2 P1 P2 6 7 3 4 3 4 1 2 1 2 P2 P2 P1 P1 The edges in the minimal equivalent edge set do not cross (always parallel)!
Step 2: Constructing Minimal Equivalent Edge Set (PiPj) • Ordering the vertices in Pi and Pj by decreasing order • Finding the first vertex v in P_j that P_i can reach • Finding the last vertex u in P_i that reach v • Removing all the edges cross (u,v) and • repeat 2-4 15 14 11 13 10 6 7 3 4 1 2 P2 P1 P1 P2
Step 3: Path-Graph Construction Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge 15 14 P2 11 2 5 13 10 12 4 P4 P1 2 2 1 1 6 7 8 9 1 P4 P3 3 4 5 P3 Weighted Directed Path-Graph 1 2 P2 P1
Step 4: Extracting Path-Tree Cover P2 P2 2 2 4 5 5 P4 P4 P1 P1 2 2 2 1 1 1 P3 P3 Weighted Directed Path-Graph Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk)
Key Problems • How to construct a path-tree? • Algorithm • How can path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality
3-Tuple Labeling for Reachability 15 [1,3] P2 14 11 [1,4] P4 13 10 12 P1 [1,1] [2,2] 6 7 8 P3 9 P4 3 4 5 Interval labeling (2-tuple) High-level description about paths Pi Pj ? P3 1 2 P2 P1 DFS labeling (1-tuple)