1 / 45

Efficient XML Storage, Query, and Update

Efficient XML Storage, Query, and Update . Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo. XML Storage Methods. Flat Streams Metamodeling Mixed Redundant Hybrid. Method Covered. “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix.

Download Presentation

Efficient XML Storage, Query, and Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo

  2. XML Storage Methods • Flat Streams • Metamodeling • Mixed • Redundant • Hybrid

  3. Method Covered • “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix. • “Efficient relational storage and retrieval of XML documents” covers Metamodeling using their Monet database.

  4. Natix Overview • Natix is an efficient, native repository for storing, retrieving and managing XML documents. • It supports tree-structured objects like XML documents at low architecture level.

  5. Natix architectural overview

  6. Logic Model • Tree is often used in logic model of semistructured data. • Each non-leaf node is labeled with a symbol taken from an alphabet DTD. • Leaf nodes can be labeled as the data itself.

  7. A sample XML with its associated logical tree Example XML: <SPEECH> <SPEAKER>OTHELLO</SPEAKER> <LINE>Let me see your eyes;</LINE> <LINE>Look in my face.</LINE> </SPEECH>

  8. Physical Model • Object Content: • Node and objects are used interchangeably. • A record contains a set of nodes/objects. • Aggregate nodes are inner nodes of the tree. They contain their respective child nodes. • Literal nodes are leaf nodes containing an uninterpreted stream of bytes, like text strings, graphics, etc. • Proxy nodes are nodes which point to different records.

  9. Node Representation • Whole documents (or subtrees of documents) can be stored in one record. • Each record contains exactly one subtree. • The root nodes of each record’s subtree are called standalone objects, other nodes are called embedded objects. • The record size has an upper limit, the page size.

  10. Large Trees • For a large tree, physical model must provide a mechanism for distributing data trees over several pages. • Method 1: “flat” representation. It wastes the available structural information about the data. • Method 2: split large objects based on the underlying tree structure. • Use proxy objects to connect subtrees of the large object residing in other records.

  11. A Sample Distribution of logical nodes on records • Proxies (p1, p2) • Helper aggregate objects (h1, h2) • Scaffolding objects include proxies and helper aggregates. • Facade objects (f i)

  12. Dynamic maintenance of an efficient storage • The principle problem is that a record containing a subtree can grow larger than a page if a node is added or grows. • Subtree contains in the record has to be partitioned into several subtrees. • Scaffolding nodes link the new records together in the physical tree.

  13. Multiway tree representation of records

  14. Tree Growth Procedure • Step 1: Determine the record r into which the node has to be inserted. • Step 2: If there is not enough on the page, try to move r. If the record still does not fit, split the record: • (a) Determine the separator by recursively descending into the r’s subtree • (b) Distribute the resulting partitions onto records • (c) Insert the separator into the parent record, recursively calling this procedure • Step 3: Insert the new node

  15. Determining the Insertion Location • There are several possibilities to insert a new node f n into the physical tree. • This choice can be determined by a configuration parameters.

  16. Determining the separator • Separator – a tree structure with proxies pointing to the new records to indicate where which part of the old record was moved. • Consists of all the nodes on the path from d to the subtree’s root. • Partition the tree into left partition L, right partition R and Separator S.

  17. A record’s subtree before a split occurs

  18. Splitting a Record • Distributing the nodes on records • After determining the partitioning, the contents of the record has to be distributed onto new records. • Each resulting subtree is then stored in its own record, called partition records. • Inserting the separator • The separator is moved to the parent record.

  19. Split Algorithm • Find a node d, such that the resulting L and R. • The ratio between the sizes of L and R is determined by a configuration parameter (split target). • Another configuration parameter Split tolerance specifies the minimum size for the subtree of d. It is used to prevent fragmentation.

  20. Record assembly for the subtree from previous figure

  21. Physical storage of the tree represented inside one record

  22. Performance Test • XML markup version of Shakspeare’s play with 8MB with 320,000 nodes. • Pentium-II 333Mhz with 128MB under Windows NT4.0 with IBM DCAS 34330 disk. • The implementation of the record and tree storage managers was done in C++.

  23. Test Conditions • Record:Node 1:1 indicating smart record splitting being inhibited. • Record:Node 1:n indicating that the algorithm has full control over distribution of nodes on records. • Incremental updates distributed over the whole document. • Updates in pre-order (append).

  24. Insertion

  25. Full tree traversal

  26. Queries • Retrieve all speakers in the third act and second scene of every play, which means it accesses all leaf nodes of a certain type in one selected subtree of the document. • Recreate the textual representation of the complete first speech in every scene, hence reading a lot of small contiguous fragments of each document. • A simple path query was evaluated by reading only the opening speech of each play.

  27. Selection on leaf nodes of document subtree

  28. Small contiguous fragments

  29. Single path for each document

  30. Space requirements

  31. Monet Model • XML document is decomposed into binary relations. • Efficient for storage and retrieval of XML documents in a relational database. • The database used is their Monet database server which supports the Monet model.

  32. Some Definitions • An XML document is a rooted treed = (V, E, r, labelE, labelA, rank) with nodes V and edges EVV and a distinguished node rV. • The function labelE : Vstring assigns labels to nodes • labelA : Vstringstring assigns pairs of strings, attributes and their values, to nodes. • rank : Vint establishes a ranking to allow for an order among nodes with the same parent node.

  33. A sample XML document <bibliography> <article key=“BB88”> <author>Ben Bit</author> <title>How to Hack</title> </article> <article key=“BK99”> <editor>Ed Itor</editor> <author>Bob Byte</author> <author>Ken Key</author> <title>Hacking & RSI</title> </article> </bibliography>

  34. Syntax Tree of the Previous XML Document

  35. Monet Transform • Given an XML document d, the Monet transform is a quadruple Mt(d)=(r,R,A,T) where • R is the set of binary relations that contain all associations between nodes; • A is the set of binary relations that contain all associations between nodes and their attribute values, including character data; • T is set of binary relations that contain all pairs of nodes and their rank; • r is the root of the document;

  36. Monet Transform of the Example Document

  37. OQL-like query

  38. Query Handling

  39. Assessment • Implemented within the Monet database server • Tested on 550 MHz Silicon Graphics 1400 Server with 1 GB main memory. • Also used Sun UltraSparc-IIi with 360 MHz and 256 MB main memory to contrast with a related work.

  40. Size of document collections in XML and Monet XML format

  41. Scaling of Document • Scaled the ACM Anthology from 30 to 3x106 which corresponds to XML source size between 10KB and 1GB. • Run 4 queries consisting of path expressions of length 1 through 4 for various sizes of the anthology.

  42. Response Time vs. Result Size

  43. Comparison of response time for query set of SYU, another method for storage/retrieval of XML document.

  44. Compare/Contrast Natix and Monet • Natix uses custom database while Monet is built on top of relational database • Neither uses DTD. • Natix focuses on XML query as well as update. • Monet focuses on XML storage and query. • Though lacking equivalent test, Monet is faster than Natix on query. • Monet seems to be more space efficient than Natix as well.

  45. References • “Efficient storage of XML data” By Carl-Christian Kanne, et al. ICDE 2000 http://citeseer.nj.nec.com/kanne99efficient.html • “Efficient Relational Storage and Retrieval of XML Documents” By Albrecht Schmidt, et al. WebDB 2000 http://www.research.att.com/conf/webdb2000/program.html

More Related