XPath: A Complete Guide to XML Document Navigation

XPath By Laouina Marouane

Outline • Introduction • Data Model • Expression • Patterns • Location Paths • Example • XPath 2.0 • Practice • Conclusion

What is XPath? • A scheme for locating documents and identifying sub-structures within them. • A language designed to be used by both XSL Transformations (XSLT) and XPointer. • Provides common syntax and semantics for functionality shared between XSLT and XPointer. • Primary purpose: Address ‘parts’ of an XML document, and provide basic facilities for manipulation of strings, numbers and booleans. • W3C Recommendation. November 16, 1999 • Latest version: http://www.w3.org/TR/xpath

Why XPath? • Unique identifiers are not sufficient • Assigning unique identifier to every element is a burden • Identity of element may be unknown • Identifiers cannot handle ranges of text • May be inconvenient to identify a large number of objects by listing their identifiers

Introduction • XPath uses a compact, string-based, rather than XML element-based syntax. • Operates on the abstract, logical structure of an XML document (tree of nodes) rather than its surface syntax. • Uses a path notation (like URLs) to navigate through this hierarchical tree structure, from which it got its name. • A subset of it can be used for matching, i.e. testing whether or not a node matches a pattern. • Models an XML document as a tree of nodes of types: element, attribute, text. • Supports Namespaces. • Name of a node (a pair consisting of a local part and namespace URI). • Example of an XPath expression: /bib/book/publisher

Data Model • Treats an XML document as a logical tree • This tree consists of 7 nodes: • Root Node – the root of the document not the document element • Element Nodes – one for each element in the document • Unique ID’s • Attribute Nodes • Namespace Nodes • Processing Instruction Nodes • Comment Nodes • Text Nodes • The tree structure is ordered and reads from top to bottom and left to right

bib Data Model The root Processing instruction Comment The root element book book publisher author . . . . Addison-Wesley Serge Abiteboul

Example For this simple doc: <doc> <?Pub Caret?> <para>Some emphasis here. </para> <para>Some more stuff.</para> </doc> Might be represented as: root <doc> <?Pub Caret?> <para> <para> text text text text

Expressions • A text string to select an element, attribute, processing instructions, or text • The primary syntactic construct in XPath. • An expression is evaluated to yield an object, which has one of the following four basic types: • node-set (an unordered collection of nodes without duplicates) • boolean (true or false) • number (a floating-point number) • string (a sequence of UCS characters)

Element Context • Meaning of element can depend upon its context • <book><title>…</title></book><person><title>…</title></person> • Want to search for, e.g. title of book, not title of person • XPath exploits sequential and hierarchical context of XML to specify elements by their context (i.e. location in hierarchy) • title book/title person/title

Context • Expression evaluation occurs with respect to a context . • The context consists of: • a node (the context node) • a pair of non-zero positive integers (the context position and the context size) • a set of variable bindings • a function library • the set of namespace declarations in scope for the expression

More on context types • The context position is always less than or equal to the context size • The variable bindings consist of a mapping from variable names to variable values • The function library consists of a mapping from function names to functions. Each function takes zero or more arguments and returns a single result • The namespace declarations consist of a mapping from prefixes to namespace URIs

Patterns • A pattern is an expression used not to find objects, but to establish if a specific object matches certain criteria • Very important in XSLT specification • The '|' symbol is used to specify alternative patterns for matching • note|warning|/book/intro

Location Paths • One important kind of expression is a location path (special case of expr) • The result of evaluating an expression that is a location path is the node-set containing the nodes selected by the location path • Location paths can recursively contain expressions that are used to filter sets of nodes • LocationPath (most important construct) describes a path from 1 point to another. • Analogy: Set of street directions. “Second store on the left after the third light” • Two types of paths: Relative & Absolute • Composed of a series of steps (1 or more) and optional predicates

Relative Paths • A relative location path consists of a sequence of one or more location steps separated by / • Each node in that set is used as a context node for the following step • E.g. para will select children of the current node that are of name 'para' • <chapter> //Current node <title>…</title> <para>…</para> //Selected <note> <para>…</para> //Not selected until note <note></chapter> • Verbose expression is child::para

Absolute Paths • An absolute location path consists of / optionally followed by a relative location path • A / by itself selects the root node of the document containing the context node

Location Steps • A location step has three parts: • an axis, which specifies the tree relationship between the nodes selected by the location step and the context node, • a node test, which specifies the node type and expanded-name of the nodes selected by the location step, and • zero or more predicates, which use arbitrary expressions to further refine the set of nodes selected by the location step.

Location Steps parts explained • Axes • 13 axes defined in XPath • Ancestor, ancestor-or-self • Attribute • Child • Descendant, descendant-or-self • Following • Preceding • Following-sibling, preceding-sibling • Namespace • Parent • Self • Node test • Identifies type of node. Evaluates to true/false • Can be a name or function to evaluate/verify type • Predicate • XPath boolean expressions in square brackets following the basis(axis & node test)

Location Steps in syntax • The syntax for a location step is the axis name and node test separated by a double colon, followed by zero or more expressions each in square brackets. • For example, in child::para[position()=1], child is the name of the axis, para is the node test and [position()=1] is a predicate

Abbreviated Syntax • child:: can be omitted from a location step.(child is the default axis)div/para is equivalent to child::div/child::para • attribute:: can be abbreviated to @ • // is short for /descendant-or-self::node()/ • A location step of . is short for self::node()ex: .//para is short for self::node()/descendant-or-self::node()/child::para • Location step of .. is short for parent::node()

Wildcards • Sometimes don't or can't know names • Can use wildcard '*' for any single element • book/intro/titleand book/chapter/titleare matched by book/*/title(but so is book/appendix/title) • Verbose child::* • Multiple asterisks can match several levels • But must know exact level and that inappropriate matches won't be made

Descendants • Rather than use wildcard - Recursively search through descendants • chapter//para will go through chapter hierarchy and select any para elements • <chapter> //Starting node <title>…</title> <para>…</para> //Selected <note> <para>…</para> //Selected <note></chapter> • child::chapter/descendant-or-self::node()/child::para

Ancestors • To signify parent of context element • '..' • parent() • To find all 'title' elements that share parent of context node • ../title • parent::node()/child::title

Other Relationships • May move around siblings of current context element • preceding-sibling:: • following-sibling:: preceding-sibling:: child:: parent:: following-sibling::

Other Relationships (2) • Can access all ancestors and descendants of current context element • ancestor:: • descendant:: • These methods don't select siblings descendant:: ancestor::

Other Relationships (3) • Can access all ancestors and descendants of current context element • ancestor-or-self:: • descendant-or-self:: • These methods don't select siblings descendant-or-self:: ancestor-or-self::

Other Relationships (4) • Can access all preceding and following completed nodes of current context element • preceding:: • following:: • Can access attributes • attribute:: preceding:: attribute:: following::

Predicate Filters • Location paths are indiscriminate • May get a list of items that are selected • Predicate filter is used to filter the list • Filter is held between '[ ]' • Simplest is position() function predicate • exon[position() = 1] //1st exon • intron[2] //2nd intron • Can combine tests with 'and' and 'or'

Position Tests • The last() operation • Locates the last sibling in list • The count() operation • Evaluates the number of items in list • child::transcript[count(child::intron) = 1] • The id() operation • Checks the identifier of the element • child::transcript[id("ENS0001")]

Attribute Tests • Attributes can be selected • feature/@type • Elements can be selected dependant upon attribute value • feature[@type="exon"]

Functions Functions in XPath: • text() = matches the text value • node() = matches any node (= * or @* or text()) • name() = returns the name of the current tag

Booleans • A boolean can only have two values: true or false • The following expressions can be evaluated: • or • and • =, != • <=, <, >=, >

Example • Operations perform boolean tests on conditions • exon[not(position() = 1)] • transcript[not(exon)] • intron[position != last()] • exon[position > 2] • exon[position >= 3] • exon[position() = 1 or last()]

Numbers • A number represents a floating-point number • The numeric operators convert their operands to numbers • Operators include: • +, -, *, div, mod • Since XML allows - in names, the - operator typically needs to be preceded by whitespace • Example: 5 mod 2 returns 1

Strings • Strings consist of a sequence of zero or more character • A character is defined in the XML Recommendation

Example • Strings can be tested for characters and substrings • <note>hello there</note> • note[contains(text(), "hello")] • <note>hello there</note> • note[contains(., "hello")] • The '.' is current node, and will go through all children

Example (2) • starts-with(string, pattern) • note[starts-with(., "hello")] • string(exp) • note[contains(string(2))] • string-after(string, terminator) • string-before(string, terminator) • substring(string, offset, length)

Example (3) • normalize(string) • Removes trailing and leading whitespace • translate(string, source, replace) • translate(., ";+", ",") • concat(strings) • string-length(string)

Core Function Library • XPath defines a core set of functions and operators • All implementations of Xpath must implement the core function library • Node Set Functions list/item[position() mod2 = 1]selects all odd number element of a list id)(“foo”)/child::para[position()=5]selects the 5th para child of the element with the unique ID foo • String Functions substring(“12345”, 0, 3) returns “12” • Boolean Functions boolean true() returns “true” • Number Functions number sum(node-set) returns the sum of the nodes

Example for XPath Queries <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><bookprice=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib>

Example summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book/[@price<“55”]/author/lastname matches…

XPath 2.0 • Latest version: • http://www.w3.org/TR/xpath20/ • W3C Working Draft 22 August 2003 • Any expression that is syntactically valid and executes successfully in both XPath 2.0 and XQuery 1.0 will return the same result in both languages

XPath 2.0 (2) • XPath 2.0 is a much more powerful language that operates on a much larger domain of data types • A better way of describing XPath 2.0 is as an expression language for processing sequences, with built-in support for querying XML documents • driving forces behind XPath 2.0 include not only the XPath 2.0 Requirements document but also many of the XML Query language requirements. • XPath 2.0 is a strict syntactic subset of XQuery 1.0

XPath 2.0 (3) • XPath 2.0 introduces support for the XML Schema primitive types, which immediately gives the user access to 19 simple types, including dates, years, months, URIs, etc. • In addition, a number of functions and operators are provided for processing and constructing these different data types

XPath 2.0 (4) • Everything is a sequence • sequences are ordered • In XPath 1.0, if you wanted to process a collection of nodes, you had to deal with node-sets. • In XPath 2.0, the concept of the node-set has been generalized and extended. • sequences may contain simple-typed values as well as nodes • “for” expression enables iteration over sequences

XPath 2.0 (5) • sum(for $x in /order/item return $x/price * $x/quantity) • Conditional expression: • if ($widget1/unit-cost < $widget2/unit-cost) • then $widget1 • else $widget2 • Quantifiers: • some $x in /students/student/name satisfies $x = "Fred“ • every $x in /students/student/name satisfies $x = "Fred"

XPath 2.0 (6) • Intersections, differences, unions: • The except operator to select all of a given node-set, except for certain nodes • @* except @exc:foo • the intersect operator • $x intersect /foo/bar

Some Practice • Try XPath Visualizer. • You can download it from: http://www.vbxml.com/downloads/files/xpathvisualiserseptember.zip • It can help you with: • Learning and playing with XPath expressions. • Composing and visually verifying the exact XPath expression when designing an XSLT stylesheet. • Obtaining the quantitative characteristics of an xml document, counts, sums, arithmetical and relational results, strings, substrings, etc.

Conclusion • XPath provides a concise and intuitive way to address into XML documents • Standard part of the XSLT and XPointer specifications • Implementing XPath basically requires learning the abbreviated syntax of location path expressions and the functions of the core library

References • http://www.w3.org/TR/xpath • http://www.w3.org/TR/xpath20/ • http://www.vbxml.com/xpathvisualizer/default.asp • http://www.xml.com/pub/a/2002/03/20/xpath2.html • XML in a Nutshell

XPath: A Complete Guide to XML Document Navigation

XPath: A Complete Guide to XML Document Navigation

Presentation Transcript

XPath

XPath

XPATH

XPath

XPATH

XPath

XPath

XPATH

XPath

XPath

XPath

XPath

XPATH

XPath

Xpath

XPath