1 / 24

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort. Toshiyuki Shimizu (Kyoto University) Masatoshi Yoshikawa (Kyoto University). ICADL 2007 12 th December. XML-IR systems. Growing demand for XML Information Retrieval (XML-IR) Systems

pancho
Download Presentation

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Ranking Scheme for XML InformationRetrieval Based on Benefit and Reading Effort Toshiyuki Shimizu (Kyoto University) Masatoshi Yoshikawa (Kyoto University) ICADL 2007 12th December

  2. XML-IR systems • Growing demand for XML Information Retrieval (XML-IR) Systems • We can identify meaningful document fragments by encoding documents in XML • ex) Sections, subsections and paragraphs in scholarly articles • Browsing only document fragments relevant to a certain topic • The most simple form of queries for XML-IR is just a set of keywords • Simple, intuitively understandable, yet useful form of queries, especially for unskilled end-users • Active research area as in INEX* *INitiative for the Evaluation of XML Retrieval (http://inex.is.informatik.uni-duisburg.de/)

  3. Results of XML-IR Systems <?xml version="1.0"?> <article> <sec> <p>XML labeling</p> <p>The structure of XML is a tree, and each node in the XML is labeled.</p> <p>We can get tag name of each XML element.</p> </sec> <sec> <p>Tree index</p> <p>XML index is constructed using the labels</p> </sec> </article> • Document fragment (element) • With relevance degree (Score) • ex) Query term was “XML” e0 0.56 article 0.35 0.64 e1 e5 sec sec 0.4 0.9 0.33 0 0.8 e7 e6 e2 e3 e4 p p p p p Score

  4. Naïve XML-IR System • Thorough strategy of INEX 2005 • Simply retrieves relevant elements from all elements and ranks them in order of relevance e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2(0.4) e5(0.35) e4(0.33) Score e0 0.56 article 0.35 0.64 e1 e5 sec sec 0.4 0.9 0.33 0 0.8 e7 e6 e2 e3 e4 p p p p p • Thorough is considered for system evaluation • User behavior of browsing search results must be considered

  5. Problems of Thorough Retrieval for XML-IR • Nesting elements • Browsing both elements is useless • Ancestor element ea Descendant element ed • edhas been fully seen • Descendant element ed Ancestor element ea • eahas been partially seen before • Element size • Elements retrieved by XML-IR systems varies widely in size • Large element, such as article (whole document) • Small element, such as p (paragraph) • Total output size of top-k elements is uncontrollable by simply giving an integer k

  6. Overview of our Approach • Introduction of the concepts of benefit and reading effort • Users can control the total output size • Systems can retrieve non-overlapping elements

  7. Properties of Benefit and Reading Effort (1/2) • Benefit • The benefit of an element is the amount of gain about the query by reading the element • Assumption 1: The benefit of an element is greater than or equal to the sum of the benefit of the child elements • Information complementation among sibling elements • ex) For two query terms A and Be6 contains topics about A e7 contains topics about B The benefit of e5 seems to be greater than the sum of benefit of e6 and e7 e5 sec e7 e6 p p

  8. Properties of Benefit and Reading Effort (2/2) • Reading Effort • The reading effort of an element is the amount of cost by reading the content of the element • Assumption 2: The reading effort of an element is less than or equal to the sum of the reading effort of the child elements • Readability of continuous reading • ex) Users can read the same content more easily by reading e5 rather than separate e6 and e7 e5 sec : e7 : e6 : e5 : e7 e6 p p

  9. Overview of our Approach • Introduction of the concepts of benefit and reading effort • Users can control the total output size • Systems can retrieve non-overlapping elements • Flexible retrieval • Users specify a threshold for the total amount of reading effort • The systems return relevant elements that provide larger benefit and that can be read within specified reading effort

  10. Flexible Retrieval • Systems calculate benefit and reading effort • A variant of knapsack problems • ex) Threshold of reading effort : 15  Retrieve {e2, e3} (Total benefit: 11) e0 article e1 e5 sec sec e2 e3 e4 e7 e6 p p p p p

  11. Flexible Retrieval • Systems calculate benefit and reading effort • A variant of knapsack problems • ex) Threshold of reading effort : 20  Retrieve {e3, e7} (Total benefit: 17) e0 article e1 e5 sec sec e2 e3 e4 e7 e6 p p p p p

  12. Search Result Continuity • ex) reading effort : 15 Retrieve {e2, e3} (benefit: 11)reading effort : 20 Retrieve {e3, e7} (benefit: 17) e0 article e1 e5 sec sec e2 e3 e4 e7 e6 p p p p p • The running example violate search result continuity • The content of element set for reading effort r must be contained in the content of element set for reading effort r’if r <= r’ • The optimal solution • is NP-hard (A variant of knapsack problems) • may violate search result continuity • Greedy retrievalalgorithm

  13. RetrievalAlgorithm • Based on the result of Thorough strategy* • Adjust benefit and reading effort for nesting elements of retrieved element, and rerank • Remove overlapping contents by nestings * Simply retrieves relevant elements from all elements and ranks them in order of relevance e0 Result of Thorough 0.56 article e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2(0.4) e5(0.35) e4(0.33) e1 e5 0.64 0.35 sec sec 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  14. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 9 Amount of reading effort : 10 e7(0.8) e1 (0.64) e1 (0.5) Adjust e1 , e0 e0 e0 (0.56) e0 (0.48) 0.48 0.56 e2(0.4) article e5(0.35) e4(0.33) e1 e5 0.5 0.64 0.35 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  15. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 17 Amount of benefit : 9 Amount of reading effort : 20 Amount of reading effort : 10 e7(0.8) e7(0.8) e1 (0.5) e0 e0 (0.37) e0 (0.48) 0.48 0.37 Adjust and rerank e5 , e0 e2(0.4) article e5(0) e5(0.35) e4(0.33) e1 e5 0.5 0 0.35 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  16. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 26 Amount of benefit : 17 Amount of reading effort : 20 Amount of reading effort : 38 e7(0.8) e7(0.8) e1 (0.5) e1 (0.5) e0 e2(0.4) 0.17 0.37 e0 (0.17) e0 (0.37) article Adjust and rerank e0 e4(0.33) e5(0) e1 e5 0.5 0 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  17. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e7(0.8) Amount of benefit : 26 Amount of benefit : 26 e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e7(0.8) e1 (0.5) e0 e2(0.4) 0.17 article e4(0.33) e0 (0.17) e5(0) e1 e5 0.5 0 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  18. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e7(0.8) Amount of benefit : 26 Amount of benefit : 26 e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e7(0.8) e1 (0.5) e0 e2(0.4) 0.17 article e4(0.33) e0 (0.17) e5(0) e1 e5 0.5 0 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  19. RetrievalAlgorithm Result of Thorough Our result e3 (0.9) e7(0.8) Amount of benefit : 26 Amount of benefit : 26 e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e7(0.8) e1 (0.5) e0 e2(0.4) 0.17 article e4(0.33) e0 (0.17) e5(0) e1 e5 0.5 0 sec sec Threshold of reading effort : 40 0 0.4 e2 e3 e4 e7 e6 0.9 0.8 0.33 p p p p p

  20. Evaluation Metrics • Based on benefit and reading effort • b/e graph (benefit/effort graph) • Comparison with BTIL (Best Thorough Input List) • BTIL system is the system which use actual benefit and reading effort • Actual benefitis calculated using manually constructed assessments (e.g. INEX) • We can observe relative effectiveness of benefit changing the specified threshold of reading effort • Use the same values for reading effort between implemented system and BTIL system

  21. e0 e0 article article e1 e1 e5 e5 sec sec sec sec e2 e3 e4 e2 e3 e7 e6 e4 e7 e6 p p p p p p p p p p Calculated benefit/reading effort Actualbenefit/reading effort For the threshold value 30 of reading effort BTIL system retrieves {e3, e6} Obtained actual benefit is 23 Implemented system retrieves {e3, e7}Obtained actual benefit is 10

  22. Examples of b/e Graph using INEX 2005 Test Collection (1/2) • XMLdocument set, Topics, Assessments • Calculate actual benefit andreading effort from Assessments • ex (Exhaustivity): Highlyexhaustive (HE)  1 Partially exhaustive (PE)  0.5 Not exhaustive(NE)  0 • rsize: relevant text length (in number of characters) • size: element length (in number of characters) • We implemented a system using tf-ief • ief stands for inverse element frequency • satisfies Assumptions for benefit and reading effort : parameter

  23. Examples of b/e Graph using INEX 2005 Test Collection (2/2) Topic 207 Topic 206 • We can observe relative effectiveness of implemented systems against BTIL system

  24. Conclusions and Future Works • Conclusions • Introduction of benefit and reading effort • Handling nesting elements • Variety of element size • Algorithm for flexible retrieval • Result elements change depending on the specified reading effort • System evaluation • Future Works • Introduction of switching effort • Cost of switching a result item in the results list • Retrieving numerous results increases the cost of browsing • Integration with user interface

More Related