1 / 53

Introduction to Semistructured Data and XML

Introduction to Semistructured Data and XML. Chapter 27. How the Web is Today. HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations No application interoperability: HTML not understood by applications

Download Presentation

Introduction to Semistructured Data and XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Semistructured Data and XML Chapter 27

  2. How the Web is Today • HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations • No application interoperability: • HTML not understood by applications • Database technology: client-server

  3. New Universal Data Exchange Format: XML A recommendation from the W3C • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations

  4. Paradigm Shift on the Web • From documents (HTML) to data (XML) • From information retrieval to data management • For databases, also a paradigm shift: • from relational model to semistructured data • from data processing to data/query translation • from storage to transport

  5. HTML • HTML is widely used for formatting and structuring Web documents. • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure and meaning of data in the Web pages. • Fixed tag set. Text (PCDATA) Opening tag • <HTML> • <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> • <BODY> • <H1>Introduction</H1> • <IMGSRC=”dragon.jpeg"WIDTH="200"HEIGHT="150” > • </BODY> • </HTML> Closing tag “Bachelor” tag Attribute name Attribute value

  6. Semistructure data • Information integration: important new application that motivates what follows. • Semistructured data: a new data model designed to cope with problems of information integration. • XML (Extensible Markup Language) : a new Web standard that is essentially semistructured data. • XQUERY: an emerging standard query language for XML data.

  7. Information Integration Problem: related data exists in many places. They talk about the same things, but differ in model, schema, conventions (e.g., terminology). Example: In the real world, every bar has its own database. • Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed. • Some keep phones of manufacturers but not addresses. • Some distinguish beers and ales; others do not.

  8. The Semistructured Data Model Bib Object Exchange Model (OEM) &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” atomic object

  9. Characteristics of Semistructured Data • Missing or additional attributes • Multiple attributes • Different types in different objects • Heterogeneous collections Self-describing, irregular data, no a priori structure

  10. { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Comparison with Relational Data

  11. XML (Extensible Markup Language) • A W3C standard to complement HTML • Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web • Motivation: • HTML describes presentation • XML describes content

  12. From HTML to XML HTML describes the presentation

  13. HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

  14. XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

  15. Why are we DB’ers interested? • It’s data. That’s us. • Database issues: • How are we going to model XML? (graphs). • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!)

  16. XML Terminology • Tags: book, title, author, … • start tag: <book>, end tag: </book> • Elements: <book>…<book>,<author>…</author> • elements can be nested • empty element: <red></red> (Can be abbrv. <red/>) • XML document: Has a single root element • Well-formed XML document: Has matching tags • Valid XML document: conforms to a schema

  17. Well-Formed XML 1. Declaration = <? ... ?> . • Normal declaration is<? XML VERSION = "1.0" STANDALONE = "yes" ?> • “Standalone” means that there is no DTD specified. 2. Root tag surrounds the entire balance of the document. • <FOO> is balanced by </FOO>, as in HTML. 3. Any balanced structure of tags OK. • Option of tags that don’t require balance, like <P> in HTML.

  18. XML: An Example <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <BOOKLIST> <BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>Waiting for the Mahatma</TITLE> <PUBLISHED>1981</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>The English Teacher</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> </BOOKLIST>

  19. attribute closing tag open tag data attribute value element name XML – Elements <BOOK genre="Science" format="Hardcover">…</BOOK> • Xml is case and space sensitive • Element opening and closing tag names must be identical • Opening tags: “<” + element name + “>” • Closing tags: “</” + element name + “>” • Empty Elements have no data and no closing tag: • They begin with a “<“ and end with a “/>” <BOOK/>

  20. attribute open tag closing tag attribute value data element name XML – Attributes <BOOK genre="Science" format="Hardcover">…</BOOK> • Attributes provide additional information for element tags. • There can be zero or more attributes in every element; each one has the the form: attribute_name=‘attribute_value’ • There is no space between the name and the “=‘” • Attribute values must be surrounded by “ or ‘ characters • Multiple attributes are separated by white space (one or more spaces or tabs).

  21. Elements The segment of an XML document between an opening and a corresponding closing tag is called an element. <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element element, a sub-element of not an element

  22. attribute open tag closing tag attribute value element name data XML – Data and Comments <BOOK genre="Science" format="Hardcover">…</BOOK> • Xml data is any information between an opening and closing tag • Xml data must not contain the ‘<‘ or ‘>’ characters • Comments:<!- comment ->

  23. XML text XML has only one “basic” type -- text. It is bounded by tags, e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding.

  24. XML – Nesting & Hierarchy • Xml tags can be nested in a tree hierarchy • Xml documents can have only one root tag • Between an opening and closing tag you can insert: 1. Data 2. More Elements 3. A combination of data and elements <root> <tag1> Some Text <tag2>More</tag2> </tag1> </root>

  25. projects: title budget managedBy employees: name ssn age Representing relational DBs:Two ways

  26. Project and Employee relations in XML Projects and employees are intermixed <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </db>

  27. Project and Employee relations in XML (cont’d) Employees follows projects <db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy>Joe </managedBy> </project> <project> <title>Auto guided vehicles</title> <budget> 70000 </budget> <managedBy>Sandra</managedBy> </project> : </projects> <employees> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age>35 </age> </employee> : <employees> </db>

  28. More XML: Oids and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> oids and references in XML are just syntax

  29. XML Data Model (Graph)

  30. Document Type Descriptors • Sort of like a schema but not really. • Inherited from SGML DTD standard • BNF grammar establishing constraints on element structure and content • Definitions of entities

  31. DTD – An Example <?xml version='1.0'?> <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> <!ELEMENT Apple EMPTY> <!ATTLIST Apple color CDATA #REQUIRED> <!ELEMENT Orange EMPTY> <!ATTLIST Orange location ‘Florida’> -------------------------------------------------------------------------------- <Basket> <Cherry flavor=‘good’/> <Apple color=‘red’/> <Apple color=‘green’/> </Basket> <Basket> <Apple/> <Cherry flavor=‘good’/> <Orange/> </Basket>

  32. DTD - !ELEMENT <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > • !ELEMENT declares an element name, and what children elements it should have • Content types: • Other elements • #PCDATA (parsed character data) • EMPTY (no content) • ANY (no checking inside this structure) • A regular expression Name Children

  33. DTD - !ELEMENT (Contd.) • A regular expression has the following structure: • exp1, exp2, exp3, …, expk: A list of regular expressions • exp*: An optional expression with zero or more occurrences • exp+: An optional expression with one or more occurrences • exp1 | exp2 | … | expk: A disjunction of expressions

  34. DTD - !ATTLIST <!ATTLIST Cherry flavor CDATA #REQUIRED> <!ATTLIST Orange location CDATA #REQUIRED color ‘orange’> • !ATTLISTdefines a list of attributes for an element • Attributes can be of different types, can be required or not required, and they can have default values. Element Attribute Type Flag

  35. DTD – Well-Formed and Valid <?xml version='1.0'?> <!ELEMENT Basket (Cherry+)> <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> -------------------------------------------------------------------------------- Not Well-Formed <basket> <Cherry flavor=good> </Basket> Well-Formed but Invalid <Job> <Location>Home</Location> </Job> Well-Formed and Valid <Basket> <Cherry flavor=‘good’/> </Basket>

  36. Example: An Address Book Exactly one name <person> <name> MacNiel, John </name> <greet> Dr. John MacNiel </greet> <addr>1234 Huron Street </addr> <addr> Rome, OH 98765 </addr> <tel> (321) 786 2543 </tel> <fax> (321) 786 2543 </fax> <tel> (321) 786 2543 </tel> <email> jm@abc.com </email> </person> At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

  37. Specifying the structure • name to specify a nameelement • greet? to specify an optional (0 or 1) greet elements • name,greet? to specify a name followed by an optional greet

  38. Specifying the structure (cont) • addr* to specify 0 or more address lines • tel | faxa telor a fax element • (tel | fax)* 0 or more repeats of tel or fax • email*0 or more email elements

  39. A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>

  40. DTD for the example relational DB <!DOCTYPE db [ <!ELEMENT db (projects,employees)> <!ELEMENT projects (project*)> <!ELEMENT employees (employee*)> <!ELEMENT project (title, budget, managedBy)> <!ELEMENT employee (name, ssn, age)> ... ]>

  41. Summary of XML regular expressions • Each element name is a tag. • Its components are the tags that appear nested within, in the order specified. • A The tag A occurs • e1,e2 The expression e1 followed by e2 • e* 0 or more occurrences of e • e? Optional -- 0 or 1 occurrences • e+ 1 or more occurrences • e1 | e2 either e1 or e2 • (e) grouping

  42. XML Querying Path Expressions : • Bib.paper • Bib.book.publisher • Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects

  43. Bib &o1 paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &o44 &o45 &o46 &o52 &96 1997 &o51 &o50 &o49 &o47 &o48 last firstname firstname lastname first lastname &o70 &o71 &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Path Expressions Examples: DB = Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206}

  44. XQuery Emerging standard for querying XML documents. Basic form: FOR <variables ranging oversets of elements> WHERE <condition> RETURN <set of elements>; • Sets of elements described by paths, consisting of: • URL, if necessary. • Element names forming a path in the semistructured data graph, e.g., //BAR/NAME =“start at any BAR node and go to a NAME child.” • Ending condition of the form [<condition about subelements, @attributes, and values>]

  45. XQuery Overview: • FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses List of tuples WHERE Clause List of tuples ORDERBY/RETURN Clause Instance of Xquery data model

  46. XQuery • FOR$x in expr -- binds $x to each value in the list expr • LET$x = expr -- binds $x to the entire list expr • Useful for common subexpressions and for aggregations

  47. FOR v.s. LET Returns: <result> <book>...</book></result> <result> <book>...</book></result> <result> <book>...</book></result> ... FOR$xINdocument("bib.xml")/bib/book RETURN <result> $x </result> LET$xINdocument("bib.xml")/bib/book RETURN <result> $x </result> Returns: <result> <book>...</book> <book>...</book> <book>...</book> ... </result>

  48. XQuery Find all book titles published after 1995: FOR$xINdocument("bib.xml")/bib/book WHERE$x/year > 1995 RETURN$x/title Result: <title> abc </title> <title> def </title> <title> ghi </title>

  49. XQuery For each author of a book by Morgan Kaufmann, list all books s/he published: FOR$aINdistinct(document("bib.xml")/bib/book[publisher=“Morgan Kaufmann”]/author) RETURN <result> $a, FOR$tIN /bib/book[author=$a]/title RETURN$t </result> distinct = a function that eliminates duplicates

  50. XQuery Result: <result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>

More Related