1 / 60

Managing XML and Semistructured Data

Managing XML and Semistructured Data. Lecture : Indexes. OEM vs. XML. OEM’s objects correspond to elements in XML Sub-elements in XML are inherently ordered. XML elements may optionally include a list of attribute value pairs.

kermit
Download Presentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing XML and Semistructured Data Lecture : Indexes

  2. OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

  3. OEM to XML • Example: • <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office></member> • This corresponds to rightmost member in the example OEM, where project is an attribute.

  4. Select xFrom A.B xWhere exists y in x.C: y = 5

  5. In this lecture • Indexes • XSet • Region algebras • Indexes for Arbitrary Semistructured Data • Dataguides • 1-2 indexes Resources • Index Structures for Path Expressions by Milo and Suciu, in ICDT'99 • XSet description: http://www.openhealth.org/XSet/ • Data on the WebAbiteboul, Buneman, Suciu : section 8.2

  6. The problem • Input: large, irregular data graph • Output: index structure for evaluating regular path expressions

  7. The Data Semistructured data instance = a large graph

  8. The queries SELECT X fROM (Bib.*.author).(lastname|firstname).Abiteboul X Regular expressions (using Lorel-like syntax) Select x from part._*.supplier.name x Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression. Select X From part._*.supplier: {name: X, address: “Philadelphia”} Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

  9. Analyzing the problem • what kind of data • tree data (XML): easier to index • graph data: used in more complex applications • what kind of queries • restricted regular expressions (e.g. XPath): may be more efficient

  10. XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

  11. XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)

  12. XSet: Efficient query evaluation (R1) SELECT X FROM part.name X -yes (R2) SELECT X FROM part.supplier.name X -yes (R3) SELECT X FROM *.supplier.name X -maybe (R4) SELECT X FROM part.*.subpart.name X -maybe • To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name. • R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name. • Thus, explore the entire subtree dominated by h2. • Will be efficient if index is small and fits in memory • R3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4. • Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal search from there.

  13. Region Algebras • Structured text = text with tags (like XML) • New Oxford English Dictionary • critical limitation:ordered data only (like text) • Assume: data given as an XML text file, and implicit ordering in the file. • less critical limitation: restricted regular expressions

  14. Region Algebras: Definitions • data = sequence of characters [c1c2c3 …] • region = segment of the text in a file • representation (x,y) = [cx,cx+1, … cy], x – start position, y – end position of the region • example: <section> … </section> • region set = a set of regions s.t. any two regions are either disjoint or one included in the other • example all <section> regions (may be nested) • Tree data – each node defines a region and each set of nodes define a region set. • example: region p2 consisting of text under p2, set {p2,s2,s1} is a region set with three regions

  15. Representation of a region set • Example: the <subpart> region set: • region algebra = operators on region set, s1 op s2defines a new region set

  16. Region algebra: some operators • s1intersect s2 = {r | r s1, r s2} • s1included s2 = {r | rs1, r´ s2, r  r´} • s1including s2 = {r | r s1, r´ s2, r  r´} • s1parent s2 = {r | r s1, r´ s2, r is a parent of r´} • s1child s2 = {r | r s1, r´ s2, r is child of r´} Examples: <subpart> included <part> = { s1, s2, s3, s5} <part>including<subpart> = {p2, p3} <name> child <part> = {n1, n3, n12}

  17. From path expressions to region expressions • Use region algebra operators to answer regular path expressions: • Only restricted forms of regular path expressions can be translated into region algebra operators • expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *. part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions

  18. From path expressions to region expressions • Answering more complex queries: • Translates into the following region algebra expression: • “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text. • Such a region can be computed dynamically using a full text index. • Region expressions correspond to simple XPath expressions Select X From *.subpart: {name: X, *.supplier.address: “Philadelphia”} Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

  19. Indexes for Arbitrary Semistructured Data • A semistructured data instance that is a DAG

  20. Indexes for Arbitrary Semistructured Data • The data represents employees and projects in a company. • Two kinds of employees – programmers and statisticians • Three kinds of links to projects – leads, workson, consultants • Index graph – reduced graph that summarizes all paths from root in the data graph • Example: node p1 – paths from root to p1 labeled with the following five sequences: Project Employee.leads Employee.workson Programmer.employee.leads Programmer.employee.workson • Node p2 – paths from root to p2 labeled by same five sequences • p1 and p2 are language-equivalent

  21. Indexes for Arbitrary Semistructured Data • For each node x in the data graph, Lx = {w|  a path from the root to x labeled w} Note that Lx will be infinite if graph has a cycle! For any two nodes x and y, they are language equivalent x,y x  y  Lx = Ly Equivalence class of x, [x] = {y | x  y } Nodes(I) = {[x] | x  nodes(G) I = Edges(I) = {[x] [y] | x  [x], y  [y], x y }

  22. Indexes for Arbitrary Semistructured Data • We have the following equivalences: e1  e2 e3  e4  e5 p1  p2 p3  p4 p5  p6  p7

  23. Indexes for Arbitrary Semistructured Data • Computing path expression queries • Compute query on I and obtain set of index nodes • Compute union of all extents, a list of pointers to all data nodes in the equivalence class • Returns nodes h8, h9. • Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8] • Always: size(I)  size(G) • Efficient when I can be stored in main memory • Checking x  y is expensive. Select X From statistician.employee.(leads|consults): X

  24. DataGuides • Goldman & Widom [VLDB 97] • graph data • arbitrary regular expressions

  25. DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

  26. Dataguides Example:

  27. DataGuides • Multiple DataGuides for the same data:

  28. DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB

  29. DataGuides Example: • G1 is a strong dataguide • G2 is not strong person.project !DB dept.project person.project G2 dept.project

  30. DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G)

  31. DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet Dataguides usually fail on data with cyclic schemas, like:

  32. T-Indexes • Milo & Suciu [ICDT 99] • 1-index: • data graph • arbitrary regular expressions • 2-index, T-index: for more complex queries, consisting of more regular expressions.

  33. T-Indexes • T-index: template index • Trades space for generality • The class of paths associated with a given T-index is specified by a path template • Example 1: x y. Here can be replaced by any regular expression. • Example 2: (*.Restaurant) x y. The first regular expression is fixed; this T-index takes less space but is less general. • T-indexes can be generated efficiently. • The size of a T-index associated to a single regular expression is at most linear in that of the database P P P P

  34. 1-Indexes • Database: DB = (V,E,Roots), V is finite set of nodes, E is a set of labeled edges, R is a set of root nodes. • Regular path expressions P ::=  |  | ƒ | (P|P) | (P.P) | P.* where ƒ are formulas defined over predicates p1, p2,…on the set of data values. • A path expression p = v0 v1 v2…vn-1 vn • Queries: regular path expressions q(DB) • A query path is an expression of the form P1 x1 P2 x2 … Pn xn, xi variable names, Pi’s path expressions • A query has the form Select x1, x2, …, xn from P1 x1 P2 x2 … Pn xn a1 a2 an

  35. 1-Indexes P F • Path template t = T1 x1 T2 x2 … T3 x3, Ti a regular expression or or • Instantiating query paths • Query path q = instantiating and by regular path expression and some formula, respectively, in template t • Example: path template t = (*.Restaurant) x1 x2 Name x3 x4 • Query path instantiations: • q1 = (*.Restaurant) x1 * x2Name x3Fridays x4 • q2 = (*.Restaurant) x1 * x2Name x3 _ x4 ( _ is a predicate with True) • q3 = (*.Restaurant) x1 (  | _ ) x2Name x3Fridays x4 P F P F

  36. 1-Indexes P • Goal: compute efficiently queries q  inst( x) • A first attempt: • Lu is the set of words on path reachable from root to u. • That is, all the path queries that lead to u. uV. Lu = {a1…an | v0 … vnDB, v0Root, vn=u} u,vV. u  v  Lu = Lv That is, u and v are indistinguishable by path queries from root. uV. [u] = {v | u  v} is a equivalence class containing u a1 an

More Related