1 / 156

e-Science e-Business e-Government and their Technologies XML Schema

e-Science e-Business e-Government and their Technologies XML Schema. Bryan Carpenter, Geoffrey Fox, Marlon Pierce Pervasive Technology Laboratories Indiana University Bloomington IN 47404 January 12 2004 dbcarpen@indiana.edu gcf@indiana.edu mpierce@cs.indiana.edu

cbowens
Download Presentation

e-Science e-Business e-Government and their Technologies XML Schema

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. e-Science e-Business e-Government and their TechnologiesXML Schema Bryan Carpenter, Geoffrey Fox, Marlon Pierce Pervasive Technology Laboratories Indiana University Bloomington IN 47404 January 12 2004 dbcarpen@indiana.edu gcf@indiana.edu mpierce@cs.indiana.edu http://www.grid2004.org/spring2004

  2. Introduction • We saw that DTDs provide an approach to validating XML documents: ensuring they have the structure expected for a particular application. • With the increasing use of XML for data-centric applications—e.g. XML formats for messages exchanged by Web Services—limitations of DTDs (which were inherited from SGML) soon became apparent. • XML Schema is a more recent validation framework for XML, which attempts to address the shortcomings of DTDs for data-centric applications, for example by providing a much richer set of data types.

  3. Problems with DTDs • DTDs have some clear limitations: • Restricted set of data types: attribute data is either general character data, name tokens, ID or IDREF (or arcane cases); element content is either general character data or nested elements or some mixture. • For data-centric applications, we might want a value to be a well-formed number, date, etc, etc. • DTDs are not convenient for dealing with XML Namespaces—essential for modularity on the Web. • The uniqueness and consistency requirements associated with ID, IDREF are powerful, but could be much more refined. • There are various obscure constraints on element content specifications, needed purely for historical SGML compatibility.

  4. XML Schema • XML Schema address all the issues mentioned on the previous slide. • Also have the interesting property that an XML Schema is itself a well-formed XML document—some people consider this a significant advantage. • This is the good news. The less good news is that the XML Schema 1.0 specification is longer by almost an order of magnitude than the basic XML specification—DTDs and all.

  5. General Comparison

  6. Reading Material • The XML Schema Specification itself comes in parts 0, 1, and 2. Parts 1 and 2 are long and tough to read, but part 0 is a reasonable (“non-normative”) introduction: XML Schema Part 0: Primer, May 2001. http://www.w3.org/TR/xmlschema-0/ • There are some good and bad books. A good one is: Definitive XML Schema, Priscilla Walmsley, Prentice Hall, 2002. • There is a comprehensive (but again rather long) tutorial introduction to XML Schema by Roger Costello at: http://www.xfront.com/

  7. “Report” Format Revisited • When discussing DTDs we described a simple “report” format. Here is a slightly expanded version of the DTD given there: <!DOCTYPE report[ <!ELEMENT report (title, (paragraph | figure)*, bibliography?) > <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA)> <!ELEMENT figure EMPTY> <!ATTLIST figure source CDATA #REQUIRED > <!ELEMENT bibliography (reference)* > … ] > • We begin our detailed discussion of schema by considering how to give an equivalent XML Schema for this document.

  8. Declaring a paragraph Element • The report schema is surprisingly long: we will build up to it in several incremental steps. First consider the paragraph element. • Using DTD, we declared this element by: <!ELEMENT paragraph (#PCDATA)> • An equivalent declaration in XML schema might be: <xsd:element name="paragraph" type="xsd:string"/> • xsd:element is itself an element in the XML Schema namespace; this example assumes we use xsd as the prefix for that namespace. • xsd:type is a predefined type in that namespace.

  9. xsd:string Primitive Type • XML Schema has a complex system of types. Different types may describe: • the allowed values of attributes, • the allowed content of elements, or • the allowed content and the allowed attributes of elements. • There is a subset of types, called the simple types, that can be used in either of the first two roles. • One of the simplest of all is string. Used as an attribute type, this is equivalent to the DTD type CDATA; used as an element type, this is equivalent to the DTD content specification (PCDATA).

  10. Declaring a report Element • We initially simplify to a schema in which a report consists only of a series of paragraphs. In DTD a possible declaration of the root element would be: <!ELEMENT report (paragraph)*> • An equivalent declaration in XML schema might be: <xsd:element name=“report"> <xsd:complexType> <xsd:sequence> <xsd:element ref="paragraph" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element>

  11. Elements with Complex Type • This rather verbose declaration says: • The element named report has complex type. • The content associated with this complex type is a sequence of elements. • This sequence consists of at least 0 and at most an unbounded number of occurrences of paragraph elements. • Here the xsd:element element has different roles: • Outermost xsd:elementdeclares the element named report. • Innermost xsd:elementuses the element named paragraph, declared elsewhere. The role is determined by the presence or absence of the ref attribute.

  12. Local Declarations • In fact xsd:element can have in a third role, which is considered to be a combineddeclaration and use, e.g.: <xsd:element name= "report"> <xsd:complexType> <xsd:sequence> <xsd:element name="paragraph" type="xsd:string“ minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> • Here the report element has its own local declaration of paragraph; no separate global declaration is necessary.

  13. Global vs Local Element Declarations • Declarations that occur as children of the top-level schema element are global declarations. • These are the only declarations that can actually be “used” from elsewhere. • “Local declarations”—like the one illustrated on the previous slide—are “used” exactly once at their point of declaration. • This is different from the concept of local declarations in most programming languages. • Local element declarations interact with namespaces in a non-obvious way: perhaps best avoided until you are sure you know what you are doing.

  14. Global vs Local Type Definitions • The type of the report element was specified by an xsd:complexType element nested within the element declaration. • The type of the paragraph element was specified by a type attribute on the declaration, referencing a named type. • In fact types, like elements, can always be defined locally where they are used, or defined globally, then referenced from a point of use. • The following slide illustrates yet another way to declare report.

  15. Named Type Definitions • In this version we introduce a named complex type called reportType, then declare the report element with this type: <xsd:complexType name="reportType"> <xsd:sequence> <xsd:element ref="paragraph" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:element name="report" type="reportType"/> • This abstraction facility—introducing new named types—is a central theme of XML Schema.

  16. A Complete XML Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.grid2004.org/ns/report1" xmlns="http://www.grid2004.org/ns/report1"> <xsd:element name="report"> <xsd:complexType> <xsd:sequence> <xsd:element ref="paragraph" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="paragraph" type="xsd:string"/> </xsd:schema>

  17. Remarks • Recall this schema is essentially equivalent to the DTD: <!DOCTYPE report[ <!ELEMENT report (paragraph)* > <!ELEMENT paragraph (#PCDATA)> ] > Clearly the schema has more baggage (or more added value, according to your point of view!) • Our schema declares two element names, report and paragraph, and puts them in a namespace called http://www.grid2004.org/ns/report1.

  18. Namespace Considerations • The root element of any schema is a schema element from the http://www.w3.org/2001/XMLSchema namespace. • The targetNamespace attribute on this element specifies which namespace the elements declared here “go into”. • We have seen the other namespace attributes before: • The xmlns:xsd attribute associates the prefix xsd with the XML Schema namespace. • The xmlns attribute makes the default namespace http://www.grid2004.org/ns/report1 for this document. • Often one uses xsd as the prefix for schema elements, and makes the target namespace the default namespace of the schema document, but neither is essential.

  19. An XML Instance Document <?xml version="1.0"?> <report xmlns="http://www.grid2004.org/ns/report1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid2004.org/ns/report1 report1.xsd"> <paragraph>Recently uncovered documents prove... </paragraph> <paragraph>The author is grateful to W3C for making this research possible.</paragraph> </report>

  20. Namespace Considerations • Assuming the document vocabulary belongs to a namespace, we must declare this namespace. • In this example http://www.grid2004.org/ns/report1 is declared as the default namespace. • If the instance document is to be validated against a schema, we must normally define where the schema for the namespace is located. • This is done here by putting an attribute schemaLocation on the root element of the document. • This attribute is itself defined in a standard namespace, called http://www.w3.org/2001/XMLSchema-instance. So we must introduce a prefix for this (xsi is traditional).

  21. schemaLocation • The value of the schemaLocation attribute should be a pair of IRIs: a namespace name and the corresponding Schema URI. • If the document uses more than one namespace, the value can be several consecutive pairs. • All tokens are separated by white space. • In this example the schema should be in the file report1.xsd in the same directory as the instance document.

  22. Schema Validation Using dom.Writer • If I save the instance document in a file called “xsdreport1.xml”, and the schema in a file called “report1.xsd”, I can validate the file with the Xerces parser by using the dom.Writer sample application as follows: > java dom.Writer –v –s –f xsdreport1.xml • If validation is successful, this simply prints a formatted version of the input file. If schema validation fails, you will see error messages early in the output. • The –v–s flags are needed here. Without –s the parser will try to do just DTD validation. -f means “full” schema validation—presumably a good thing.

  23. Schema Validation from Java • Unfortunately it doesn’t seem to be possible to enable XML Schema validation in Xerces using the “vendor-neutral” JAXP API. • The DOM Level 3 API will enable this, but it is not finalized or fully deployed at the time of this writing. • For now you must directly use the “proprietary” org.apache.xerces.parsers.DOMParser Xerces implementation class. • Use is sketched on the next slide.

  24. The Xerces DOMParser API import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.*; … static final String VALIDATION_FEATURE_ID = "http://xml.org/sax/features/validation" ; static final String SCHEMA_VALIDATION_FEATURE_ID = "http://apache.org/xml/features/validation/schema" ; static final String SCHEMA_FULL_CHECKING_FEATURE_ID = "http://apache.org/xml/features/validation/schema-full-checking" ; … DOMParser parser = new DOMParser(); // Turn Schema Validation on parser.setFeature(VALIDATION_FEATURE_ID, true); parser.setFeature(SCHEMA_VALIDATION_FEATURE_ID, true); parser.setFeature(SCHEMA_FULL_CHECKING_FEATURE_ID, true); parser.setErrorHandler(new MyErrorHandler()) ; parser.parse(uri) ; // uriis XML instance file Document document = parser.getDocument() ; …

  25. More on Complex Types • If an element may have nested elements, or if it may have attributes, it must be described by a complex type. • If neither of these conditions holds—the element has only character data content and no attributes—it is usually more convenient to use a simple type. • Attributes on complex types are specified by an attribute element, e.g.: <xsd:element name="figure"> <xsd:complexType> <xsd:attribute name="source" type="xsd:string"/> </xsd:complexType> </xsd:element>

  26. Attribute Declarations • Like element declarations, attributes may be declared globally, then used inside a complex type declaration, through an xsd:attribute element with a ref attribute. • In contrast to the situation with elements, local declaration of attributes is often a natural choice. • The figure example above has a complex type with no content. In general attribute specifications go after the content specification, in the body of the xsd:complexType element.

  27. Element Sequences and Choices • To finish this introductory foray into XML Schema, we restore our report element back to its original specification. The XML Schema declaration is given on the next slide. • Recall this is supposed to be equivalent to the DTD declaration: <!ELEMENT report(title, (paragraph | figure)*, bibliography?) > • The use of the xsd:sequence and xsd:choice elements should be reasonably self explanatory. • Note how the minOccurs, maxOccurs attributes replace use of the *, ? operators: both have default values of 1.

  28. Original report Element Structure <xsd:element name="report"> <xsd:complexType> <xsd:sequence> <xsd:element ref="title"/> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:element ref="paragraph"/> <xsd:element ref="figure"/> </xsd:choice> <xsd:element ref="bibliography" minOccurs="0"/> </xsd:sequence> </xsd:complexType> </xsd:element>

  29. Simple Types:Schema Datatypes

  30. XML Schema Simple Types • Recall simple types can be used to describe the values of attributes, or the content of elements that have no nested elements (“character data” content). • So far we only illustrated one simple type built in to XML Schema: namely string. • As an attribute type this is similar to the DTD attribute type CDATA; as an element type, it is similar to the DTD content specification (PCDATA). • Most of the details of simple types are defined in the W3 recommendation XML Schema Part 2: Datatypes.

  31. Built In and User-Defined Types • XML Schema provides over 40 built in simple types. • It also provides flexible mechanisms for creating your own simple types, • which may in fact impose rather complex patterns on text content.

  32. Schema Built In Types

  33. Built In Simple Types

  34. Built In Simple Types (continued)

  35. Built In Simple Types (continued)

  36. Built In Simple Types (continued)

  37. Built In Simple Types (continued)

  38. Built In Simple Types (continued)

  39. Creating New Simple Types • There are three basic approaches to building new simple types (deriving simple types): • Restricting facets of an existing simple type. • Creating a list type from an existing simple type. • Creating a union type from some existing simple types. • The most sophisticated mechanism is the first—restriction using facets.

  40. Facets • The 19 primitive types (the built in types derived directly from anySimpleType) have a set of constraining facets restricting allowed values. • The constraining facets of a simple type are a subset of: • length, minLength, maxLength, pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minExclusive, minInclusive, totalDigits, fractionDigits • Restricted types have all the facets of their base types—though values of the facets may be different. • There is no way for schema writers to introduce new facets—users cannot directly restrict anySimpleType. • Technically simple types have additional fundamental facets, but values of these flags cannot be set directly. They are: equal, ordered, bounded, cardinality, numeric

  41. Restriction • Here is a characteristic example of restriction: <xsd:simpleType name="singleDigit"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="-9"/> <xsd:maxInclusive value="9"/> </xsd:restriction> </xsd:simpleType> • This starts from the built in xsd:integer, and defines a derived type singleDigit by setting the facet minInclusive to -9 and the facet maxInclusive to 9. • Thus the type singleDigit represents a whole number between -9 and +9.

  42. Length • The facets length, minLength, maxLength allow to constrain the length of an item like a string (also allow to constrain the number of items in a list type, see later). • Values of length, minLength, minLength should be non-negative integers. Example: <xsd:simpleType name="state"> <xsd:restriction base="xsd:string"> <xsd:length value="2"/> </xsd:restriction> </xsd:simpleType> defines a type state representing strings containing exactly two characters. • These facets supported by all primitive types other than numeric and date- and time-related types. Also supported by list types.

  43. Pattern • Perhaps the most powerful facet is pattern, which allows to specify a regular expression: any allowed value must satisfy the pattern of this expression. • Example <xsd:simpleType name="weekday"> <xsd:restriction base="xsd:string"> <xsd:pattern value="(Mon|Tues|Wednes|Thurs|Fri)day"/> </xsd:restriction> </xsd:simpleType> defines a type weekday representing the names of the week days.

  44. Regular Expressions • XML Schema has its own notation for regular expressions, but very much based on the corresponding Perl notation. • For the most part Schema use a subset of the Perl 5 grammar for regular expressions. • Includes most of the purely “declarative” features from Perl regular expressions, but omits many “procedural” features related to search, matching algorithm, substitution, etc. • XML Schema adds a few features of its own, e.g.: • Matching characters legal in XML names. • Character class subtraction. • Inherits general XML escape mechanisms for Unicode characters, replacing analogous Perl mechanisms.

  45. Metacharacters • The following characters, called metacharacters, have special roles in Schema regular expressions: . \ ? * + | { } ( ) [ ] • Like Perl, but treats }, ] uniformly as metacharacters, and omits search-related metacharacters ^ and $. • To match these characters literally in patterns, must escape them with \, e.g.: • The pattern “2\+2” matches the string “2+2”. • The pattern “f\(x\)” matches the string “f(x)”.

  46. Escape Sequences • In general one should use XML character references to include hard-to-type characters. But for convenience Schema regular expressions allow: • \n matches a newline character (same as &#xA;) • \r matches a carriage return character (same as &#xD;) • \t matches a tab character (same as &#x9;) • All other escape sequences (except \- and \^, used only in character class expressions) match any single character out of some set of possible values. • For example \d matches any decimal digit, so the pattern “Boeing \d\d\d” matches the strings “Boeing 747”, “Boeing 777”, etc.

  47. Multicharacter Escapes • The simplest patterns matching classes of characters are: • . matches any character except carriage return or newline. • \d matches any decimal digit. • \s matches any white space character. • \i matches any character that can start an XML name. • \c matches any character that can appear in an XML name. • \w matches any “word” character (excludes punctuation, etc.) The escapes \D, \S, \I, \C and \W are negative forms, e.g. \D matches any character except a decimal digit. • Similar to Perl, except: Perl doesn’t have \i, \I; Perl uses \c, \C for other things; detailed definitions of \w, \W are different.

  48. Category Escapes • A large and interesting family of escapes is based on the Unicode standard. General form in Perl or Schema is \p{Name} where Name is a Unicode-defined class name. • The negative form \P{Name} matches any character not in the class. • Simple examples include: \p{L} (any letter), \p{Lu} (upper case letters), \p{Ll} (lower case letters), etc. • More interesting cases are based on the Unicode block names for alphabets, e.g.: • \p{IsBasicLatin}, \p{IsLatin-1Supplement}, \p{IsGreek}, \p{IsArabic}, \p{IsDevanagari}, \p{IsHangulJamo}, \p{IsCJKUnifiedIdeographs}, etc, etc, etc.

  49. Character Class Expressions • Allow you to define terms that match any character from a custom set of characters. Basic syntax is familiar from Perl and UNIX: [List-of-characters] or the negative form: [^List-of-characters] Here List-of-characters can include individual characters, and also ranges of the form First-Last where First and Last are characters. • Examples: • [RGB] matches one ofR, G, or B. • [0-9A-F] or [\dA-F] match one of0, 1, …, 9, A, B,…, F. • [^\r\n] matches anything except CR, NL (same as . ).

  50. Class Subtractions • A feature of XML Schema, not present in Perl 5. A class character expression can take the form: [List-of-characters-Class-char-expr] or: [^List-of-characters-Class-char-expr] where Class-char-expr is another class character expression. • Example: • [a-zA-Z-[aeiouAEIOU]] matches any consonant in the Latin alphabet.

More Related