1.08k likes | 1.2k Views
CIS 550 Fall 2001. Handout 6 XML. Outline (ambitious). Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document Type Descriptors XML query languages: XML-QL and XSL. XML additions: Xlink, Xpointer, RDF, SOX, XML-Data
E N D
CIS 550Fall 2001 Handout 6 XML Fall 2001
Outline (ambitious) • Background: documents (SGML/HTML) and databases (structured and semistructured data) • XML Basics and Document Type Descriptors • XML query languages: XML-QL and XSL. • XML additions: Xlink, Xpointer, RDF, SOX, XML-Data • Document Object Model (XML API's) Fall 2001
Some Useful Articles XML, Java, and the future of the web http://webreview.com/wr/pub/97/12/19/xml/index.html XML and the Second-Generation Web http://www.sciam.com/1999/0599issue/0599bosak.html Articles/standards for XML, XSL, XML-QL http://www.w3c.org/ http://www.w3.org/TR/REC-xml Fall 2001
Part I: Background What’s the difference between the world of documents and information retrieval and databases and query interfaces? Fall 2001
Document world > plenty of small documents > usually static > implicit structure section, paragraph, toc, > tagging > human friendly > content form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description Documents vs Databases Fall 2001
Documents editing printing spell-checking counting words retrieving (IR) searching Database updating cleaning querying composing/transforming What to do with them Fall 2001
HTML • Lingua franca for publishing hypertext on the World Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure. • Fixed tag set. Text (PCDATA) Opening tag • <HTML> • <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> • <BODY> • <H1>Introduction</H1> • <IMGSRC=”dragon.jpeg"WIDTH="200"HEIGHT="150” > • </BODY> • </HTML> Closing tag “Bachelor” tag Attribute name Attribute value Fall 2001
Thin red line • The line between the document world and the database world is not clear. • In some cases, both approaches are legitimate. • An interesting middle ground is data formats -- of which XML is an example • Examples • Personal address book • Swissprot • ASN.1 Fall 2001
Personal address book over 20 years 1977 NAchison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science A University of Edinburgh A Kings Buildings A Edinburgh E12 8QQ A Scotland T 031-123-8855 ext. 4359 (work) T 031-345-7570 (home) N Albani, Paolo F Prof. Paolo Albani A Dip. Informatica e Sistemistica A Universita di Roma La Sapienza ... 1990 N Achison, Malcolm F Prof. M.P. Achison A Dept. of Computing Science A University of Glasgow A Lilybank Gardens A Glasgow G12 8QQ A Scotland T 041-339-8855 ext. 4359 T 041-357-3787 (private) T 031-667-7570 (home) X 041-339-0090 C mpa@uk.ac.gla.cs N Achison, Malcolm F Prof. M.P. Achison A 34 Inverness Place A Edinburgh, EH3 8UV 1980 N Achison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science .... T 031-667-7570 (home) C mpa@uk.ac.ed.cs 1997 N Achison, Malcolm F Prof. M.P. Achison A Department of Computing Science ... T 031-667-7570 (home) X 041-339-0090 C mpa@dcs.gla.ac.uk W http://www.dcs.gla.ac.uk/mpa 2000 ? Fall 2001
Swissprot ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARANISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). Fall 2001
Swissprot (cont’d) CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV ... Fall 2001
The Structure of XML • XML consists of tags and text • Tags come in pairs<date> ...</date> • They must be properly nested <date> <day> ... </day> ... </date> --- good <date><day> ... </date>... </day> --- bad (You can’t do <i> ... <b> ... </i> ...</b> in HTML) Fall 2001
XML text XML has only one “basic” type -- text. It is bounded by tags e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data Fall 2001
XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : <person> <name>Malcolm Atchison</name> <tel>(215) 898 4321</tel> <email>mp@dcs.gla.ac.sc</email> </person> Fall 2001
XML structure (cont.) • We can represent a list by using the same tag repeatedly: <addresses> <person> ... </person> <person> ... </person> <person> ... </person> ... </addresses> Fall 2001
Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element element, a sub-element of not an element Fall 2001
person name tel tel email XML is tree-like Malcolm Atchison (215) 898 4321 (215) 898 4321 mp@dcs.gla.ac.sc Semistructured data models typically put the labels on the edges Fall 2001
Mixed Content An element may contain a mixture of sub-elements and PCDATA <airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto> </airline> Data of this form is not typically generated from databases. It is needed for consistency with HTML Fall 2001
A Complete XML Document <?xmlversion="1.0"?> <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> Fall 2001
Two ways of representing a DB projects: title budget managedBy employees: name ssn age Fall 2001
Project and Employee relations in XML <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> Projects and employees are intermixed <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </db> Fall 2001
Project and Employee relations in XML (cont’d) Employees follow projects <employees> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age>35 </age> </employee> : <employees> </db> <db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <project> <title> Auto guided vehicles </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </projects> Fall 2001
Project and Employee relations in XML (cont’d) Or without “separator” tags … <db> <projects> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> <title> Auto guided vehicles </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> : </projects> <employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees> </db> Fall 2001
Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element <entry> <wordlanguage= “en”> cheese </word> <wordlanguage= “fr”> fromage </word> <wordlanguage= “ro”> branza </word> <meaning> A food made … </meaning> </entry> Fall 2001
Attributes (cont’d) Another common use for attributes is to express dimension or type <picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif”compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data> </picture> A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed . Fall 2001
When to use attributes It’s not always clear when to use attributes <person ssno= “123 45 6789”> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> <person> <ssno>123 45 6789</ssno> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> Fall 2001
Using IDs <family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> <mother/> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person> </family> Fall 2001
ODL schema classMovie ( extentMovies, key title ) { attribute string title; attribute string director; relationshipset<Actor> casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set<Movie> acted_In inverse Movie::casts; attribute int age; attribute set<string> directed; } ; Fall 2001
An example <db> <movie id=“m1”> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=“a1 a3”></cast> <budget>100,000</budget> </movie> <movie id=“m2”> <title>Dragonheart</title> <director>Rob Cohen</director> <cast idrefs=“a2 a9 a21”></cast> <budget>110,000</budget> </movie> <movie id=“m3”> <title>Moondance</title> <director>Dagmar Hirtz</director> <cast idrefs=“a1 a8”></cast> <budget>90,000</budget> </movie> : <actor id=“a1”> <name>David Kelly</name> <acted_In idrefs=“m1 m3 m78” > </acted_In> </actor> <actor id=“a2”> <name>Sean Connery</name> <acted_In idrefs=“m2 m9 m11”> </acted_In> <age>68</age> </actor> <actor id=“a3”> <name>Ian Bannen</name> <acted_In idrefs=“m1 m35”> </acted_In> </actor> : </db> Fall 2001
Part II The Document Object Model (DOM) Programming with XML Fall 2001
XML Parsers • traditional: build data structure (DOM) • event based: SAX (Simple API for XML) • http://www.megginson.com/SAX • write handler for start tag and for end tag Fall 2001
Programming in XML -- why we need database technology • Let’s examine some elements of the Document Object Model. It provides an interface to parsed data. Fall 2001
DOM -- overview • Interface to parsed XML • “… language-neutral...” interface (IDL) • Level 1. Functionality for XML document navigation and manipulation. • Level 2. Stylesheets and namespaces Level 3. (future) Document loading and saving DTDs and schemas http://www.w3.org/DOM/ Fall 2001
Document getTagName() = “part” Element Element Element Attr Attr getTagName() = “name” getTagName() = “weight” CharacterData CharacterData getName()= “part” getValue=“at23” getName()= “units” getValue=“mks” getData()= “widget” getData()= “0.454” DOM representation -- a tree of nodes <partdb> <part id = “a123” units= “mks”> <name> widget </name> <weight> 0.454 </weight> </part> </partdb> Fall 2001
Node interface public interface Node { ... public String getNodeName(); ... public NodeList getChildNodes(); ... public NamedNodeMap getAttributes(); ... } Fall 2001
Sub-elements --an “array” public interface NodeList { public Node item(int index); public int getLength(); } public interface NamedNodeMap { public Node getNamedItem(String name); ... } Attributes -- a “dictionary” Fall 2001
<doc1> <employee> <name> John Doe </name> <contact-info> <address> … </address> <tel> 123 7456 </tel> <email> jd@abc.edu</email> </contact-info> <dept> Math </dept> </employee> <employee> … </employee> ... </doc1> A common form of data extraction John Doe 123 7456 Jane Dee 234 5678 … ... Find the names and telephones of all employees in Math Fall 2001
Top-level traversal in DOM public class Test { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); NodeList nodes = doc.getDocumentElement.getChildNodes(); for (int i=0; i<nodes.getLength(); i++) { Node n = (Element) nodes.item(i); //coercion // select Math depts } } } Not specified by DOM Fall 2001
Selecting the math departments NodeList ndl = n.getChildNodes(); for(int idl=0; idl<ndl.getLength(); idl++) { if ((n.tagName = "dept") && (((CharacterData) (n.getFirstChild)).getData="math")) // coercion { //inner code return; } } Fall 2001
Preorder search of subtree. Convenient, but is it what we want? The inner code NodeList nnl= n.getChildNodes(); for(int ii=0; ii < nnl.getLength(); ii++) { Node nn = (Element) nodes.item(i); //coercion if (nn.tagName = "name" ) Nodelist ncl = n.getElementsByTagName("tel"); for(int iii = 0; iii < cl.getLength(); iii++) { nc = ncl.item(iii); System.out.print((CharacterData) nn.firstChild).data //coercion System.out.println((CharacterData) nc.firstChild).data //coercion } } Fall 2001
Comments on our code • Already quite cumbersome. Compare with an “equivalent” semistructured database query: • Code may fail wherever there is a coercion, or give “empty” results. • Code is already inefficient (double iterations over the same set of nodes) • Need for types !!! select {name: $N, tel: $T} where {name: $N, dept: “Math”, contact-info.tel: $T} <- DB Fall 2001
<doc1> <employee> <name> John Doe </name> <contact-info> <address> … </address> <tel> 123 7456 </tel> <email> jd@abc.edu</email> </contact-info> <dept> Math </dept> </employee> <employee> … </employee> ... </doc1> Constructing data <doc2> <employee> <name> John Doe </name> <tel> 123 7456 </tel> </employee> <employee> <name> Jane Dee </name> <tel> 234 5678 </tel> </employee> ... </doc2> Fall 2001
Constructing Data using the DOM Document d = new DocumentImplementation … Element root = d.createElement(“doc2”) // set root of document //top level loop { Element emp = d.createElement(“employee”) root.appendChild(emp) //innermost loop { ... Element name = d.createElement(“name”) // set s to appropriate character string name.appendChild(createCDATASection(s)) emp.appendChild(name) ... } … } All node constructors are methods of the document implementation Could also use an existing node or a “cloned” node Fall 2001
<doc1> <employee> <name> John Doe </name> <contact-info>...</contact-info> <dept> Math </dept> </employee> ... </doc1> A join <doc3> <row> <name> John Doe </name> <building> A123 </building> </row> <row> <name> Jane Dee </name> <building> B456 </building> </row> ... </doc3> <doc2> <department> <dname> Math </dname> <building> A123 </building> </department> ... </doc2> The names of employees and their department buildings Fall 2001
Implementing a Join in the DOM?? nl1 = r1.getChildNodes() //r1 is root of doc1 for (int i1 = 1; i1 < n1.getLength; i1++) { nl2 = r2.getChildNodes() //r1 is root of doc2 for (int i2 = 1; i2 < n2.getLength; i2++) … } • Even if we can get both documents in core, this is not the most efficient method • If not? • This is a typical database query! Fall 2001
Conclusions so far • We need types for our XML documents, and we’d like to be able to use them as a static check of our program correctness. • We need database technology for large “documents” and -- perhaps -- for simplicity of code. Fall 2001
Before we leave APIs… SAX, a low-level alternative to DOM • SAX = Simple API for XML • supported by most XML parsers • event-driven • Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events, e.g., • startDocument • startElement • endElement • endDocument • The programmer has to write a document handler that captures these events and do something with the tokens • Simpler than DOM: programmer must do more storage management • Less likely than DOM to fail on large documents!! Fall 2001
Why Databases Can Help • Efficiency. Querying and Storage • Simplicity (a consequence of efficiency?) Query languages have a simple “algebraic” structure. Query transformation = optimization. • (Maybe) suggestions for type systems/constraints. select name, bldg from emps, depts where emps.dname = depts.dname Efficiently executed in SQL through use of indexing and clever join algorithms) Fall 2001
Other things that databases might do for XML • Concurrency (allow a granularity finer than that of a file) • Proposals for updates? • Integrity. Keys and referential integrity (XML proposals still weak) • Data mining • Compression Fall 2001
The Down-Side • Query languages are typically rather weak. You can’t always express a computation in a QL. • We may have to tamper with the semantics of XML (sets vs. lists) • We may have to simplify XML or its attendant “type systems” (DTDs, XML-schema) -- No bad thing :) Fall 2001