380 likes | 487 Views
TDX: a High-Performance Table-Driven XML Parser. Wei Zhang Robert van Engelen. Department of C omputer Science Florida State University. Outline. Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion.
E N D
TDX: a High-Performance Table-Driven XML Parser Wei Zhang Robert van Engelen Department of Computer Science Florida State University
Outline • Motivation • Introduction • Recent Work • Table-Driven XML Parsing – TDX • TDX Construction Toolkit • Results and Preliminary Conclusion
Motivation • Enhance performance for XML-based Web Services • Provide flexibility • Offer high-level modularity
Roadmap • Motivation • Introduction • Recent Work • Table-Driven XML parsing – TDX • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
data conversion validation well-formedness Introduction application • Validating XML Parsing • Three stages • Well-formedsness • Validation • Data conversion • Frequent access to schema • Separation introduces overhead and requires frequent access to schema XML XML
Data Conversion Well-formedness Validation Introduction (cont’d) • Schema-specific XML parsing (SSP) • Merging well-formedness and validation • No requirement to frequent access to schema • Separation stage of data conversion in implemented SSP
Roadmap • Motivation • Introduction • Recent Work • Table-Driven XML parsing – TDX • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Recent Work • Chiu: “A compiler-based cpproach to schema-specific XML parsing” • Merging parsing and validation by constructing PDA • No namespace support • Conversion from NFA to DFA may result in exponentially growing space requirement
Recent Work(cont'd) • van Engelen: “Constructing finite automata for high-performance web services” • Integrates parsing and validation into one stage by parsing actions encoded by DFA • Cannot process cyclic XML schema
Recent Work(cont'd) • van Engelen: ”The gSOAP toolkit for web services and peer-to-peer Computing Networks ” • Namespace support • Merging parsing and validation • Implementing a recursive-decent parsing • Disadvantages of recursive-descent • Code size and function calling overhead
Roadmap • Motivation • Introduction • Recent Work • Table-Driven XML parsing – TDX • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Table-XML Parsing (TDX) • LL(1) grammar can be derived from schema • XML documents can be parsed and validated using LL(1) grammar • Well-formedness (parsing) can be verified through grammar rules • Validation can be accomplished using semantic actions • Application-specific events can also be encoded as semantic actions
Illustrating Example <schema> <element name=“book” type=“bookType”> <complexType name=“bookType”> <sequence> <element name=“title” type=“string”> <element name=“author” type=“string”> </sequence> </complexType> </schema> LL(1) Grammar: s ‘<book>’ t ‘</book>’ t t1 t2 t1 ‘<title>’ DATA //imp_s(s.val) ‘</title>’ t2 ‘<author>’ DATA //imp_s(s.val) ‘</author>’
Illustrating Example (cont'd) <book> <title> XML Tech </title> <author> Bob </author> </book> s ‘</book>’ t ‘<book>’ t1 t2 DATA ‘<title>’ ‘</title>’ DATA ‘<author>’ ‘<author>’ imp_s(“XML Tech”) imp_s(“Bob”) (a) An XML Instance (b) Predictive Parsing
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
TDX - Architecture Modules Ll(1) Grammar Productions and Actions LL(1) Parsing Table Tokens application Scanner/ Tokenizer (DFA) Token CDATA Parsing Engine (TDX) <XML> Events Error: invalid
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/Tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Token Generation • Defined by • <namespace, tag> • Element name (opening and closing) • Attribute name • some data type • Such as Enumeration • Namespace binding • Identical tag names under different namespaces are represented as different tokens • Normalized tokens
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/Tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Mapping Schema to LL(1) Grammar • Structural constraints are mapped to rules • Validation constraints are mapped to semantic actions • Note that many types of validation constraints are mapped to rules • Such as occurrence, enumeration
Mapping Example(1) <simpleType name=“state”> <restriction base=“string”> <enumeration value=“OFF”/> <enumeration value=“ON”/> </restriction> </simpleType> state“OFF” | “ON” <simpleType name=“value”> <restriction base="integer"> <minInclusive value="10"/> <maxInclusive value="250"/> </restriction> </simpleType> value DATA//imp_i(char *s)
c’’2 c’2 c’’2 c’’2 Mapping Example(2) <complexType name=“example”> <choice> <element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“2” maxOccurs=“unbounded”/> </choice> </complexType> example c1| c2 c1‘<id>’ id_type ‘</id>’ c1 c’2‘<value>’ value_type ‘</value>’ c2c’2c’2c’’2 <sequence> example c1c2
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/Tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
LL(1) Parsing Table • Constructed from LL(1) grammar • Indexed by nonterminals and terminals • Contains either index of grammar production or error entry
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/Tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Parsing Engine • Schema Independent • Maintains • Parsing table • Production table • Action table • Stack
Roadmap • Recent Work • Table-Driven XML parsing – TDX • Illustrating example • Architecture • Token generation • Mapping schema to LL(1) • Parsing table • Parsing engine • Scanner/Tokenizer • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Scanner/Tokenizer • Constructed from schema • Schema provides DFA states information • Element name • Has attribute? • Attribute name • Root element needs special care • Schema information
Scanner/Tokenizer example <book xmlns:x ="http://www.x.org" xmlns:y ="http://www.y.org" targetnamespace ="http://www.x.org"> <title>XML Bible</title> <author> <name> Bob </name> <y:title> professor</y:title> </author> </book> <"www.x.org", "title"> DATA <"www.x.org", "/title"> <"www.y.org", "title">
Roadmap • Motivation • introduction • Recent Work • Table-Driven XML parsing – TDX • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
TDX Construction Toolkit Service_flex.l flex tab.yy.c Service.wsdl wsdl2TDX Service_TDX.h Service_TDX.c
Roadmap • Motivation • introduction • Recent Work • Table-Driven XML parsing – TDX • TDX construction Tool Kit • Experiment Results and Preliminary Conclusion
Experiment Setup • Compare with • DFA-based Parser • gSOAP 2.7 • eXpat 1.2 • Xerces 2.7.0 • Memory-resident XML message • Elapsed real time using timeofday()
Conclusion • Enhance parsing speed • Flexible framework • Encoding value-based validation and application-specific events as semantic rules • Combining structural, syntactic and semantic constraints in one pass • High-level of modularity