350 likes | 534 Views
Reading Microsoft Word XML files with SAS® . Larry Hoyle, Policy Research Institute, University of Kansas. Three Scenarios. Extracting text and attributes Extracting data from tables Extracting drawing object parameters . XML - Syntax. Must begin with this prolog tag.
E N D
Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas
Three Scenarios • Extracting text and attributes • Extracting data from tables • Extracting drawing object parameters
XML - Syntax Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Tags and content called "element" Tags can be Qualified by attributes <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Elements can be nested, Start and end in same parent
Word XML Body Section Paragraph Run Text Properties
What Does SAS Need? • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract
Example Document Styles and Colors Have Meaning I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
Style and Color • Style is “Treated” – a statement about treatment • Color is “Red” - represents negative affect
Example Document as XML I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.
Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the Text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>
Columns – the Text Element Number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES“ retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the Paragraph Number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>
Columns –Paragraph Color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – Run Color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Columns – Run Style <COLUMN name="RStyleval"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:rStyle/@val</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>11</LENGTH> </COLUMN>
Our Sample Tables • Read all data from all tables into one dataset • Add variables to indicate table, row, column
The Tables Dataset Ended first table Started third table
Word XML – Tables • Absolute Path /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t • Relative Path w:tc/w:p/w:r/w:t
Count Table Beginnings • <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> w:tbl</INCREMENT-PATH>
Count Table Endings • <INCREMENT-PATH beginend=“END" syntax="XPath"> w:tbl</INCREMENT-PATH>
Drawing Object Parameters VML – Vector Markup Language • This example will only read lines • (they’re easiest) • Other drawing objects have different XML elements
One Row for Each Line Element <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH>
Columns Parameters as Attributes <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from </PATH>
The Dataset Trick: "Flip" indicates coordinates are swapped
Example Code in Paper • Convert colors • Parse stroke weight (e.g. 2pt) • Detect the keyword “flip” and flip coordinates
Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31