1 / 45

Lecture 08: XML and Semistructured Data

Lecture 08: XML and Semistructured Data. Outline. XML (Section 17) XML syntax, semistructured data Document Type Definitions (DTDs) XPath. Additional Readings on XML. XML http://www.w3.org/XML/1999/XML-in-10-points www.zvon.org/xxl/XMLTutorial/General/book_en.html

Download Presentation

Lecture 08: XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 08: XML and Semistructured Data

  2. Outline • XML (Section 17) • XML syntax, semistructured data • Document Type Definitions (DTDs) • XPath

  3. Additional Readings on XML • XML • http://www.w3.org/XML/1999/XML-in-10-points • www.zvon.org/xxl/XMLTutorial/General/book_en.html • http://db.bell-labs.com/galax/ • http://www.w3.org/TR/REC-xml-names (1/99) • Xpath • http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html • Xquery • http://www.w3.org/TR/xmlquery-use-cases/ • http://www.xmlportfolio.com/xquery.html • Main source: www.w3.org (but hard to read)

  4. XML • eXtensible Markup Language • XML 1.0 – a recommendation from W3C, 1998 • Roots: SGML (used in publishing). • After the roots: a format for sharing data

  5. XML Data • Relational data does not have a syntax • I can’t “give” you my relational database • Need to import it from other syntax, like CSV (comma-separated-values) • XML = rich syntax for data • But XML is not relational: semistructured • Usage: • Map any data to XML • Store it in files, exchange on the Web, etc. • Even query it directly, using XPath, XQuery

  6. XML Data Sharing and Exchange application application object-relational Integrate XML Data WEB (HTTP) Transform Warehouse application relational data legacy data Specific data management tasks

  7. From HTML to XML HTML describes the layout

  8. HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999

  9. XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the structure

  10. XML Terminology • tags: book, title, author, … • start tag: <book>, end tag: </book> • elements: <book>…</book>,<author>…</author> • elements are nested • empty element: <red></red> abbrv. <red/> • well formed XML document • if it has matching tags • tags are properly nested • single root element • and more constraints, e.g. on names

  11. More XML: Attributes <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data

  12. More XML: IDs and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> Scope of IDs and references is the document

  13. More XML: CDATA Section • Syntax: <![CDATA[ .....any text here...]]> • Example: <example> <![CDATA[ some text here </notAtag> <>]]></example>

  14. More XML: Entity References • Syntax: &entityname; • Used like macros • Example: <element> this is less than &lt; </element> some predefined entities complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html

  15. More XML: Processing Instructions • Syntax: <?target argument?> • Example: • Processed by external applications, e.g. php(bad style) <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>

  16. More XML: Comments • Syntax <!-- .... Comment text... --> • Yes, they are part of the data model !!!

  17. Elementnode Attributenode Textnode XML Data: a Tree ! data • <data> • <person id=“o555”> • <name> Mary </name> • <address> • <street> Maple </street> • <no> 345 </no> • <city> Seattle </city> • </address> • </person> • <person> • <name> John </name> • <address> Thailand </address> • <phone> 23456 </phone> • </person> • </data> person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle Order matters !!!

  18. <persons> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </persons> From Relational Data to XML Data XML: persons persons row row row phone name phone name phone name “John” 3634 “Sue” 6343 “Dick” 6363

  19. XML Data • XML is self-describing • Schema elements become part of the data • Relational schema: persons(name,phone) • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data

  20. Semi-structured Data Explained • Missing attributes: • Could represent ina table with nulls <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person>  no phone !

  21. Semi-structured Data Explained • Repeated attributes • Impossible in tables: <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person>  two phones ! ???

  22. Semistructured Data Explained • Attributes with different types in different objects • Nested collections (no 1NF) • Heterogeneous collections: • <db> contains both <book>s and <publisher>s <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person>  structured name !

  23. Document Type DefinitionsDTD • part of the original XML specification • an XML document may have a DTD • XML document: well-formed = if tags are correctly closed valid = if it has a DTD and conforms to it • validation is useful in data exchange

  24. Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>

  25. Very Simple DTD Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company>

  26. DTD: The Content Model <!ELEMENT tag (CONTENT)> • Content model: • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)* contentmodel

  27. DTD: Regular Expressions DTD XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName> </name> optional <name> <lastName> . . . . . </lastName> </name> <!ELEMENT name (firstName?, lastName)) <person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . . </person> star (repeated occurrence) <!ELEMENT person (name, phone*)) alternation <person> <name> . . . . . </name> <email> . . . . . </email> </person> <!ELEMENT person (name, (phone|email)))

  28. DTD: Attributes • Document Type Definition<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST personage CDATA #REQUIRED "18"birthdate CDATA #IMPLIEDnationality CDATA #FIXED "CH"gender (male|female) "female"> • Document <personage="24" nationality="CH" gender="male"> <ssn> … </ssn> …<phone> … </phone> </person> mandatory optional default enumeration

  29. DTD: Entities • DTD:<!ENTITY address SYSTEM "address.xml"><!ENTITY name "<name>Tim Berners Lee</name>"> • Document:<celebrity>&name;&address;</celebrity> internal entity external entity

  30. Inclusion of DTD in Documents External DTD Declaration <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test> Internal DTD Declaration <!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/> Mixed usage <!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>

  31. XML Namespaces • Different DTDs can use the same names! • how to avoid conflicts when combining names from different DTDs? • XML namespace is a collection of names (markup vocabulary) • identified by a prefix (URL reference)

  32. XML Namespaces • name ::= [prefix:]localname default name space <book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book> names belong to default name space

  33. XML Namespaces • syntactic: <number> , <isbn:number> • semantic: URL used as unique identifier • URL may not exist, has no function <tagxmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> Belong to this namespace

  34. Querying XML Data • XPath = simple navigation through the tree • XQuery = the SQL of XML • XSLT = recursive traversal • will not discuss • XQuery and XSLT build on XPath

  35. Sample Data for Queries <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><bookprice=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib>

  36. bib Data Model for XPath The root The root element book book publisher author . . . . Addison-Wesley Serge Abiteboul

  37. XPath: Simple Expressions /bib/book/year Result: <year> 1995 </year> <year> 1998 </year> Result: empty (there were no papers) /bib/paper/year

  38. XPath: Restricted Kleene Closure //author Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author> Result: <first-name> Rick </first-name> /bib//first-name

  39. XPath: Text Nodes /bib/book/author/text() Result: Serge Abiteboul Jeffrey D. Ullman Rick Hull doesn’t appear because he has firstname, lastname Functions in XPath: • text() = matches the text value • node() = matches any node (= * or @* or text()) • name() = returns the name of the current tag

  40. XPath: Wildcard Result: <first-name> Rick </first-name> <last-name> Hull </last-name> * Matches any element //author/*

  41. XPath: Attribute Nodes /bib/book/@price Result: “55” @price means that price is has to be an attribute

  42. XPath: Predicates /bib/book/author[firstname] Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>

  43. XPath: More Predicates Result: <lastname> … </lastname> <lastname> … </lastname> /bib/book/author[firstname][address[.//zip][city]]/lastname

  44. XPath: More Predicates /bib/book[@price < “60”] /bib/book[author/@age < “25”] /bib/book[author/text()]

  45. XPath: Summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book[@price<“55”]/author/lastname matches…

More Related