1 / 35

Reading Microsoft Word XML files with SAS®

Reading Microsoft Word XML files with SAS® . Larry Hoyle, Policy Research Institute, University of Kansas. Three Scenarios. Extracting text and attributes Extracting data from tables Extracting drawing object parameters . XML - Syntax. Must begin with this prolog tag.

dusty
Download Presentation

Reading Microsoft Word XML files with SAS®

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas

  2. Three Scenarios • Extracting text and attributes • Extracting data from tables • Extracting drawing object parameters

  3. XML - Syntax Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Tags and content called "element" Tags can be Qualified by attributes <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Elements can be nested, Start and end in same parent

  4. Word XML

  5. Word XML Body Section Paragraph Run Text Properties

  6. Extracting Text and Properties

  7. What Does SAS Need? • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract

  8. Example Document Styles and Colors Have Meaning I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

  9. Style and Color • Style is “Treated” – a statement about treatment • Color is “Red” - represents negative affect

  10. Example Document as XML I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.

  11. Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>

  12. Columns – the Text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>

  13. Columns – the Text Element Number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES“ retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>

  14. Columns – the Paragraph Number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>

  15. Columns –Paragraph Color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>

  16. Columns – Run Color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>

  17. Columns – Run Style <COLUMN name="RStyleval"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:rStyle/@val</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>11</LENGTH> </COLUMN>

  18. The Data as Read into SAS

  19. Tables

  20. Our Sample Tables • Read all data from all tables into one dataset • Add variables to indicate table, row, column

  21. The Tables Dataset

  22. The Tables Dataset Ended first table Started third table

  23. Word XML – Tables • Absolute Path /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t • Relative Path w:tc/w:p/w:r/w:t

  24. Count Table Beginnings • <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> w:tbl</INCREMENT-PATH>

  25. Count Table Endings • <INCREMENT-PATH beginend=“END" syntax="XPath"> w:tbl</INCREMENT-PATH>

  26. Graphics

  27. Drawing Object Parameters VML – Vector Markup Language • This example will only read lines • (they’re easiest) • Other drawing objects have different XML elements

  28. Our Example Drawing

  29. Word XML – Drawn Lines

  30. One Row for Each Line Element <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH>

  31. Columns Parameters as Attributes <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from </PATH>

  32. The Dataset Trick: "Flip" indicates coordinates are swapped

  33. Example Code in Paper • Convert colors • Parse stroke weight (e.g. 2pt) • Detect the keyword “flip” and flip coordinates

  34. As Drawn by SAS

  35. Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31

More Related