XML Storage and Indexing Native XML

XML Storage and Indexing Native XML Ervin Domazet Ahmed CavitZafar HakanDemir

Indexing Native XML Hakan Demir

Contents • What is a Native XML Database • Native XML Database Architectures • Features of Native XML Databases • Normalization, Referential Integrity, and Scalability • XML Indexing Optimizations • Elementary Indices • Content Indices • Navigational Indices • Index Comparison

1. What is a Native XML Databases • Native XML databases are databases designed especially to store XML documents. Like other databases, they support features like transactions, security, multi-user access, programmatic APIs, query languages, and so on. • The only difference from other databases is that their internal model is based on XML and not something else, such as the relational model.

Where are they used? • Native XML databases are most commonly used to store document-centric documents. The main reason for this is their support of XML query languages, which allow you to ask queries that are clearly difficult to ask in a language like SQL. • Native XML databases are also commonly used to integrate data.Native XML databases handle schema changes more easily than relational databases and can handle schemaless data as well. • The third major use case for native XML databases is semi-structured data, such as is found in the fields of finance and biology, which change so frequently that definitive schemas are often not possible. • The final major use of native XML database is in handling schema evolution.

2. Native XML Database Architectures The architectures of native XML databases fall into twobroad categories: text-based and model-based. • A text-based native XML database is one that stores XML as text. This might be a file in a file system, a BLOB in a relational database, or a proprietary text format. • The second category of native XML databases is model-based native XML databases. Rather than storing the XML document as text, they build an internal object model from the document and store this model. How the model is stored depends on the database. Some databases store the model in a relational or object-oriented database.

3.Features of Native XML Databases In this section, a number of the features found in native XML databases will briefly be introduced. • Document Collections: Many native XML databases support the notion of a collection. This plays a role similar to a table in a relational database or a directory in a file system. • Query Languages: Almost all native XML databases support one or more query languages. The most popular of these are XPath and XQuery, although numerous proprietary query languages are supported as well.

Updates and Deletes: Native XML databases have a variety of strategies for updating and deleting documents, from simply replacing or deleting the existing document to modifications through a live DOM tree to languages that specify how to modify fragments of a document. • Transactions, Locking, and Concurrency: Virtually all native XML databases support transactions. However, locking is often at the level of entire documents, rather than at the level of individual nodes, so multi-user concurrency can be relatively low. • Indexes: All native XML databases support indexes as a way to increase query speed. Since this part will be explained in detail later, there is no need to go into detail here.

Application Programming Interfaces (APIs): Almost all native XML databases offer programmatic APIs. These are usually in the form of an ODBC-like interface, with methods for connecting to the database, exploring metadata, executing queries, and retrieving results. • Round-Tripping: One important feature of native XML databases is that they can round-trip XML documents. That is, you can store an XML document in a native XML database and get the "same" document back again.It is vital to many legal and medical applications, which are required by law to keep exact copies of documents.

4. Normalization, Referential Integrity, and Scalability For many people, especially those with relational database backgrounds, native XML databases raise a number of controversial issues, particularly with respect to issues surrounding the storage of data. These are: • Normalization: Normalization refers to the process of designing a database schema in which a given piece of data is represented only once.Normalizing data for a native XML database is largely the same as normalizing it for a relational database: you need to design your documents so that no data is repeated. One difference between native XML databases and relational databases is that XML supports multi-valued properties while (most) relational databases do not. This makes it possible to "normalize" data in a native XML database in a way that is not possible in a relational database.

Referential Integrity: Referential integrity refers to the validity of pointers to related data and is a necessary part of maintaining a consistent database state.In a relational database, referential integrity means ensuring that foreign keys point to valid primary keys -- that is, checking that the primary key row corresponding to any foreign key exists. In a native XML database, referential integrity means ensuring that "pointers" in XML documents point to valid documents or document fragments.

Scalability: Like hierarchical and relational databases, native XML databases use indexes as a way to initially find data. This means that locating documents and document fragments is related solely to index size, not to document size or the number of documents, and that native XML databases can locate the start of a document or fragment as fast as other databases using the same indexing technology. Thus, native XML databases will scale as well as other databases in this respect.

5. XML Indexing Optimizations • The history of database systems development is marked by a considerable effortof researchers to enhance the speed of data retrieval by utilizing various kinds ofspecific methods referred to as indexing mechanisms. • Three basic classes of indices will be explained: • Elementary Indices • Content Indices • Navigational Indices

5.1. Elementary Indices • Elementary indices receive single parameter without any respect to the structure ofan XML document. Any relationship of the XML structure is not visible for them.They usually map a given parameter to a set of elements.Let us now provide a short description of the basic elementary indices: • Text index :finds all elements holding a given keyword. • Value index : its principle is the same as of the text index - it searches contentsof elements, but it also accepts parameters with data-specific conditions. • Combined vocabulary index:include the index that combines search options of both text and label index.

5.1.1. Text Index • A text index might be powerful tool for supporting environments with substantialdemand for querying textual information. Used in semi-structureddatabases, it isresponsible for finding sets of elements which contain a given keywords. • We will now present one of the considerable implementations of the text indexwhich make use of inverted file structure. The idea is a simplified reversal ofan XML tree, where sets of elements are assigned to a keyword which they contain.

Its basic form consists of two tables. The first table maps keywords to addressesof inverted files. In the second table, recognized as simple inverted file, each row(which is addressed by first table) contains a sequence of physical addresses. • An example of the inverted file structure storing element addresses is depicted in the Figure:

Speeding-up of the retrieval time can be accomplished by using different indexesfor different types of elements. However, this method requires additional pagelookups to find the appropriate index.

5.1.2. Value Index • The value index is a structure that is capable of retrievingstring, real or integer values using a predicate selection over these types. Typesof the predicates are typical for mentioned data types. Concretely, the commonpredicates for number data type are: =,≤ ,≥ , >, <, and operators supportingstring data type are: =, containment. • One of the common problems that value indices encounter lie in data type ambiguouspredicates. For instance, the data that match the query predicate ='05’are: an integer with value 5 or string with value '05'. On the other hand, the stringwith value '5' does not match.

In order to address this problem a coercion function is usually employed. Coercion function: denoted as t1→t2 represents, in a particular database, a data type t1 can be converted to the data type t2. • For instance, if a string query ='05' is converted to an integer query =5, we write string→integer.

5.1.3. Combined Vocabulary Index • As a combination of the text index and the label index, the combined vocabularyindex answers queries which ask for elements containing a keyword. An example ofsuch query is TeacherName = 'Gadamer'. • The databases which employ only separate text index and value index has tointersect the results of these two indices in order to answer such query. However, theintersection operation is expensive. The combined vocabulary index is designated to avoid the intersection operation.

The combined vocabulary extends inverted file so it encompasses not only keywords,but also element names. The retrieval process searches first for a given keywordand then for a given element. The data structure used by this index is a nestedtable. An example of combined vocabulary index is shown in Figure:

5.2. Content Indices • The ability of rapid searching for textual data is the main characteristic of contentindices. They are commonly employed in those retrieval systems which putspecial emphasis on indexing textual information with regard to the structure of a document. • We have chosen three instances that represent three main methods in this domain. • The first method makesuse of a compact tree structure for encoding databases of words. (PATRICIA Trie) • The second methodis capable of answering different path queries with containment statement.(Inverted List Extension) • The thirdmethod enhances the text search by a filtering. (Context Filter)

5.2.1. PATRICIA Trie • Since 1968, a full text indexing technique known as PATRICIAtrie is used broadly until nowadays. • A trie encodes an array of strings into a tree structure with nodes representingprefix characters of a string. Each node corresponds to single character of a string.An example of a trie for the set of strings {know, respect, response, recognize} is presented in the following Figure (a):

Strings that have only few common characters withother strings, such as knowledge in the example, introduce a storage overhead. Inorder to reduce such overhead, prefix trees are compacted by encoding only diversecharacters of strings. Common parts of strings are ignored. The compacted versionof trie is recognized as the PATRICIA trie. Edges in PATRICIA trie are labeledwith pairs <x; y> where x represents first character of common part of strings,and y a length of the common part. Rest of common part is omitted. A leaf node inPATRICIA tree is a reference to the original string. We can see how the whole prefixtree is simplified by conversion to PATRICIA trie in example in Figure (b).

Patricia trie can be searched recursively. When searching for pattern P, we needto compare the first character in P with labels x on edges connected to the rootnode. The matching edge is chosen and its label number yidentifies how manycharacters are skipped in pattern P in next search. Recursively, the subpattern isthen used to search children of selected node. When traversal of Patricia trie reachesa leaf node, the search pattern must be compared with the original string referredby the leaf node. This is required because of the lossy compression of the commonparts. • A signicant drawback of the PATRICIAtrie is the complexity of updating the trie.

5.2.2. Inverted List Extension • A simple inverted list is suitable for environments which require indexing of anytextual information. If an index that is answering containment queries is required,then a slightly re-designed and augmented inverted file structure can be utilized. • The occurrences of both terms,elements and words, are indexed “by its document number, its position and its nesting depth within the document. This is denoted as (docno; begin:end; level) for anelement and (docno; wordno; level) for a text word.”

An example of an XML document and its index representation is presented in Figure: • The inverted list extension have the following properties: - Containment property - Direct containment property - Tight containment property - Proximity property We will just give an example of direct containment property and not go into more detail.

For example, when the XPath query book/section is issued, elements book and section are retrieved from the label index and then those elements are selected where book element direct contain section element. • Due to using local labeling scheme, this index is resistant to any update. Eachinsertion, deletion of any term, and also changes of a text imply relabeling andreconstruction of the whole index structure.

5.2.3. Context Filter • A context filter improves the efficiency of the retrieval by filtering unimportant data. Even though a context filterdoes not represent any structure of an index, it essentially extends the functionality of content indices • The context filter is a powerful tool when combined with an content index.Occurrences of terms in a content index are extended to contain also a linear context information. • Linear context of an occurrence is the set of all ancestor labels on the path froma root node to the occurrence.

A linear context is encoded as a bitstring with fixed length. Both the linearcontext of occurrences and the linear context of a query are encoded in the same way. • The Figure shows an example of a source document (a) with its label mappingto a bitstring (b) and a text index enriched with bitstrings (c).

With the purpose of better understanding how the context filter works, let usconsider the following example: a query processor attempts to query data from thedatabase in the given Figure. The given XPath query has the form: /book/chapter/name[contains(.,'totality')] • In the preprocessing stage, the linear context 101001 is assigned to the keyword'totality'. Then all occurrences of the keyword are probed from the text index:{(&9,101000), (&12; 111001)}. Afterwards, the filtering process executes the operation and the node (&12) is filtered out of the result set.

5.3. Navigational Indices • In this section we will concentrate on navigational indices which are: • DataGuideIndex: covers field of summary indices that index by paths and offersreasonable results in incremental update mechanism. • T-index: as a most general approach in template indexing, it can be used in databases thatadjust index structures according to statistics of most frequent queries. (This type is the general version of 1-index and 2-index. Since it is more complicated, we will not go into it much.) • 1-index:is suitable for indexing of absolute paths • 2-index: is suitable for indexing of relative paths

5.3.1. DataGuide Index • The DataGuide index was developed to be a convenient tool for databases concerningsemistructured documents without any fixed document schema. • When the schemais not provided, obscurity of a document structure impedes the user from formingmeaningful queries over the database and also every search within such documentis burdened with ineffective traversing of the whole graph without any hints ofworthless or valuable branches. • The DataGuide competes these issues and "servesas dynamic schema,generated from the database."

This index summarizes all paths in the database graph. Each edge from thesource graph appears exactly once in the DataGuide graph. The DataGuide nodescontain annotations - set of IDs of nodes reachable by a given path. • Examples of DataGuides are presented in the Figurewhere (a) is a sourcedocument and (b), (c) are two of its derivable forms of DataGuides.

Minimal DataGuide: is the smallest variation that can be constructed. Anexample is presented in the previous Figure(b). It is the desirable form of aDataGuide in such environments that do not support update of a data. • Strong DataGuide: addresses the problem of the insertion. The definition ofa strong DataGuide states: every time one of index' nodes can be reached bymultiple paths, these paths reach the same set of nodes in the source document.Thus an insertion of a new noderesults in simple adding of a new edge or annotation in the index. An exampleof the strong DataGuide is presented on the previous Figure(c).

5.3.2. T-Index • The template index, known as T-index, was designed to provide an effective generalindex structure for semi-structured data with competitive results. • With its utilization of regular expressions into input parameters, indexingalmost any type of path relationship within document is granted. • The essentialprinciple used by T-index is a non-deterministic automaton. • While the T-index isthe most general structure of template indexing, we will not go into its detail and only describe two of its instances, 1-index and 2-index, which are simpler. • 1-index is suitable for indexing of absolute paths and 2-index for indexing of relative paths.The T-index is build to evaluate regular path expression queries with path templates.

First let us define regular path expressions and then we focus on path templates.

1-index:A node in 1-index refers to a set of nodes in database with the same equivalenceclass. Each node in the 1-index has its unique ID. Nodes and edges of 1-index form agraph which is stored in a standard way. In the worst case, the graph of the 1-indexis as large as the graph of the original database. • If the query instantiated from the template P x is to be evaluated, all nodesthat satisfy the query path are selected from the 1-index. Then the result of thequery is a union of all database nodes that are referenced by the selected 1-index nodes. • If the query q = d.vx isevaluated in the 1-index, the index processor finds two paths d.v in the 1-index andreturns the set of nodes: {&3; &7}U{&8}.

2-index:it can be thought as an ancestor-descendant index retrieving all pairsof nodes which are connected by a path defined by the placeholder P . A node inthe 2-index annotates all pairs of nodes with the same equivalence class. • An example of a 2-index for data from previous Figure(a) is portrayed in new Figure. For query *x1 v.z x2 all edges v from the root are traversed, then all edges z aretraversed. The result is a union of all pairs referenced by selected nodes: {(&6, &9)}U {(&2; &4)}.

An overview of templates for 1-index, 2-index, and T-index is located in the following Table: • There are also F&B index, XR-Tree, AC-Tree, A(k)-index, FabricIndex,Virtual Suffix Tree Indexing as a branch of navigational indices. Those who are interested can look them also.

6. Index Comparison • After providing description of a behavior of selected indices, their propertiesand features shall be precisely compared. • Since we already explained the features of indices we mentioned, it will be better to compare their properties with a table rather than with textual comments. Now, we will present three tables which, we hope, summarize the overall comparison of the indices.

REFERENCES • Milos Janek, Indexing Techniques for Native XML Database Systems, Czech Technical University, 2010. • http://www.rpbourret.com/xml/XMLAndDatabases.htm

XML Storage and Indexing Native XML

XML Storage and Indexing Native XML

Presentation Transcript

XML Indexing Structure

XML Indexing Techniques

SQL/XML, XQuery , and Native XML Programming Languages

Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics

XML Storage and Query Processing

The NATIVE XML Server

Indexing of XML Data

Adaptive XML Storage

Storing XML using native storage

Sedna: A Native XML DBMS

Lecture 13: XQuery XML Publishing, XML Storage

Native XML Databases

XML Storage and Indexing Native XML

XML to XML through XML

XML Native Query Processing

TIMBER A Native XML Database

XML Compression and Indexing

Native XML Databases

Lecture 12: XML Publishing, XML Storage

XML Indexing and Search