1 / 28

OrientX4.0 系统开发报告

OrientX4.0 系统开发报告. XML Group July 25, 2009. XML Keyword Search. <?xml version="1.0" encoding="GB2312"?> <bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author> <last>Stevens</last> <first>W.</first> </author> <publisher>Addison-Wesley</publisher> <price>65.95</price>

vea
Download Presentation

OrientX4.0 系统开发报告

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OrientX4.0系统开发报告 XML Group July 25, 2009

  2. XML Keyword Search • <?xml version="1.0" encoding="GB2312"?> • <bib> • <book year="1994"> • <title>TCP/IP Illustrated</title> • <author> • <last>Stevens</last> • <first>W.</first> • </author> • <publisher>Addison-Wesley</publisher> • <price>65.95</price> • </book> • ...... • <book year="1992"> • <title>Advanced Programming in the Unix environment</title> • <author> • <last>Stevens</last> • <first>W.</first> • </author> • <publisher>Addison-Wesley</publisher> • <price>65.95</price> • </book> Keyword: Stevens, Addison-Wesley <book year="1994"> <title>TCP/IP Illustrated</title> <author> <last>Stevens</last> <first>W.</first> </author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book> <book year="1992"> <title>Advanced Programming in the Unix environment</title> <author> <last>Stevens</last> <first>W.</first> </author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book>

  3. 大纲 • OrientX4.0已完成的工作 • XML文档的SAX解析 • XML文档的Dewey编码 • 存储和索引 • SLCA 算法的实现

  4. 大纲 • OrientX4.0已完成的工作 • XML文档的SAX解析 • XML文档的Dewey编码 • 存储和索引 • SLCA 算法的实现

  5. XML Documents Query Result Query/ Keywords Execute Engine Data Definition XQuery XPath Update Keyword-search Element Node Schema Address Schema Manager Index Manager Data manager Coder Records Storage OrientX4.0系统结构 xml文档的解析 Storage Manager

  6. SAX简介 • SAX(Simple API for XML) • SAX是一种解析XML文件的技术。 • 使用事件基础来处理XML文件,目前大部分XML解析器除了支持DOM外,都会一并支持SAX解析。 • SAX是一组程序接口。 • 可以将XML文件视为字符串流的数据,在读取XML元素时触发一系列事件,只需撰写所需的事件处理程序,就可以分析或取得XML元素。

  7. <book> <title>TCP/IP</title> <author> Stevens</author> <publisher>Addison</publisher> <price>65.95</price> </book> 解析文件 产生事件 SAX解析器 事件处理程序 startElement (book) startElement (title) characters (TCP/IP) . . . . . . . XML 文件 SAX 图例 • XML文件在经过SAX解析后,产生一系列事件,我们可以建立事件处理程序来处理这些事件。

  8. SAX主要事件 • startDocument事件 • startElement事件 • attribute a; • a.getLength(); • a.getQname(); • a.getValue(); • Characters事件 • endElement事件 • endDocument事件

  9. bib book year author publisher price title 1994 last first Addison 65.95 TCP/IP W. Stevens SAX解析举例 • <?xml version="1.0" encoding="GB2312"?> • <bib> • <book year="1994"> • <title>TCP/IP Illustrated</title> • <author> • <last>Stevens</last> • <first>W.</first> • </author> • <publisher>Addison-Wesley</publisher> • <price>65.95</price> • </book> • </bib>

  10. 大纲 • OrientX4.0已完成的工作 • XML文档的SAX解析 • xml文档的Dewey编码 • 存储和索引 • SLCA 算法的实现

  11. XML Documents Query Result Query/ Keywords Execute Engine Data Definition XQuery XPath Update Keyword-search Element Node Schema Address Schema Manager Index Manager Data manager Coder Records Storage OrientX4.0系统结构 xml文档的编码 Storage Manager

  12. bib book year author publisher price title 1994 last first Addison 65.95 TCP/IP W. Stevens Dewey编码 • intdewey[51]={0,-1,-1,…,-1} • 对元素、属性、属性值、文本进行编码 • 举例 0 0.0 0.0.2 0.0.0 0.0.1 0.0.4 0.0.3 0.0.0.0 0.0.1.0 0.0.2.0 0.0.2.1 0.0.3.0 0.0.4.0 0.0.2.0.0 0.0.2.1.0

  13. bib book year author publisher price title 1994 last first Addison 65.95 TCP/IP W. Stevens 编码举例 初始情况:intdewey[51]={0,-1,-1,-1,…,-1} start: bib {1,0,………} start: book {2,0,0,……..} start: year {3,0,0,0,……} start:1994 {4,0,0,0,0,…….} end:1994 {3,0,0,0,0,-1,……} end: year {2,0,0,0,-1,-1} start: title {3,0,0,1,……} • <?xml version="1.0" encoding="GB2312"?> • <bib> • <book year="1994"> • <title>TCP/IP Illustrated</title> • <author> • <last>Stevens</last> • <first>W.</first> • </author> • <publisher>Addison-Wesley</publisher> • <price>65.95</price> • </book> • </bib> 0 0.0 0.0.0 0.0.1 0.0.0.0

  14. 大纲 • OrientX4.0已完成的工作 • XML文档的SAX解析 • xml文档的Dewey编码 • 存储和索引 • SLCA 算法的实现

  15. XML Documents Query Result Query/ Keywords Execute Engine Data Definition XQuery XPath Update Keyword-search Element Node Schema Address Schema Manager Index Manager Data manager Coder Records Storage OrientX4.0系统结构 Dewey编码的存储 与索引 Storage Manager

  16. 编码的组织 • 目标:给出要查询的关键字迅速找到对应的Dewey编码 • hash倒排表存储编码 • 几点要注意的问题 • 关键字的分割 • 无意义关键字 • Hash冲突解决

  17. 倒排表结构 图 hash inverted list

  18. 552 Stevens 5 0.0.2.0.0 NULL NULL 一个例子 • key: Stevens; Deweycode: 0.0.2.0.0 • hash(Stevens)=((S*31+t)*31+…)/Nhash=552

  19. 552 Stevens 5 0.0.2.0.0 NULL 一个例子 • key: Stevens; Deweycode: 0.1.2.0.0 • hash(Stevens)=552 5 0.1.2.0.0 NULL

  20. 552 Stevens 5 0.0.2.0.0 一个例子 • key: xxx, 0.1.3.0 • hash(xxx)=552 5 0.1.2.0.0 NULL xxx 4 0.1.3.0 NULL NULL

  21. 一些问题 • 关键字的分割 • 文本结点 • 属性值结点 • 选择合适的分隔符: 空格,逗号,冒号,…. • 意义不大的关键字 • 定冠词 • 连词 • …… 可能含有多个关键字

  22. 大纲 • OrientX4.0已完成的工作 • XML文档的SAX解析 • xml文档的Dewey编码 • 存储和索引 • SLCA 算法的实现

  23. XML Documents Query Result Query/ Keywords Execute Engine Data Definition XQuery XPath Update Keyword-search Element Node Schema Address Schema Manager Index Manager Data manager Coder Records Storage OrientX4.0系统结构 SLCA算法的实现 Storage Manager

  24. SLCA的实现 • Naïve 方法 • 对所有关键字的组合求LCA • 从结果中去掉祖先结点 • 剩余的就是SLCA结点 • Key S Stevens:0.0.2.0.0,0.1.2.0.0 • Key A Addison-Wesley:0.0.3.0,0.1.3.0

  25. bib book book …… author publisher price author publisher …. …. …. last first Addison 65.95 last first Addison W. W. Stevens Stevens Stack算法 • Key S Stevens:0.0.2.0.0,0.1.2.0.0 • Key A Addison-Wesley:0.0.3.0,0.1.3.0 S A (a) node0.0.2.0.0 0 0.0 0.1 0.0.2 0.0.3 0.0.2.0 0.0.3.0 0.1.3.0 0.0.2.0.0 0.1.2.0.0

  26. bib book book …… author publisher price author publisher …. …. …. last first Addison 65.95 last first Addison W. W. Stevens Stevens Stack算法 • Key S Stevens:0.0.2.0.0,0.1.2.0.0 • Key A Addison-Wesley:0.0.3.0,0.1.3.0 S A S A (a) node0.0.2.0.0 0 (b) node0.0.3.0 0.0 0.1 0.0.2 0.0.3 0.0.2.0 0.0.3.0 0.1.3.0 0.0.2.0.0 0.1.2.0.0

  27. bib book book …… author publisher price author publisher …. …. …. last first Addison 65.95 last first Addison W. W. Stevens Stevens Stack算法 • Key S Stevens:0.0.2.0.0,0.1.2.0.0 • Key A Addison-Wesley:0.0.3.0,0.1.3.0 S A S A (a) node0.0.2.0.0 0 report 0.0 as a slca (b) node0.0.3.0 0.0 0.1 0.0.2 0.0.3 (c) node0.1.2.0.0 0.0.2.0 0.0.3.0 0.1.3.0 0.0.2.0.0 0.1.2.0.0 report 0.1 as a slca (d) node0.1.3.0

  28. Thanks!

More Related