790 likes | 2.36k Views
XML Retrieval. Chapter 10. Introduction. Chapter Outline XML basic concepts Differences between XML and Unstructured Retrieval Vector space model in XML Retrieval Evaluation on XML retrieval: INEX Text-centric vs. Data-centric XML Retrieval. XML Retrieval. Structured Retrieval
E N D
XML Retrieval Chapter 10
Introduction • Chapter Outline • XML basic concepts • Differences between XML and Unstructured Retrieval • Vector space model in XML Retrieval • Evaluation on XML retrieval: INEX • Text-centric vs. Data-centric XML Retrieval
XML Retrieval • Structured Retrieval • 구조화된 문서에서 정보를 찾음 • Text-centric XML만을 주로 다룸 • e.g. Data-centric XML • XML retrievalvs. Parametric zone search • Unstructured 와 structure의 중간개념 • Parametric field and zones(author, title…) • Flat - no nesting of attributes, # of attribute is small • XML만의 특성 • has more complex tree • attributes can be nested • # of attribute is greater than parametric zone search • Always refers to XML retrieval in this book
Basic concept of XML • XML document • ordered, labeled tree • each node of tree: XML element • has opening, closing tag • has attributes • internal node, leaf node • internal node encodes structure • leaf node has text • XML DOM API • One of standard for processing XML • Hierarchical object structure • Process start from Root element, descend into its children <play> <element> <author>shakespeare</author> </element> <element> <title>Macbeth</title> </element> <element> <act number=“I”> <scene number=“vii”> <verse>Will I with..</verse> <title>Macbeth’s castle</title> </scene> </act> </element> </play>
Basic concept of XML • XPath • standard for enumerating paths in XML Document • act/scene • selects all scene element • whose parent is act • play//scene • selects all scene element • occurring in a play element • /play/title • selects play’s title • /play//title • selects play’s title and scene’s title • /scene/title • no elements • title#“Macbeth” • selects all titles containing term “Macbeth”
Basic concept of XML • NEXI • Narrowed Extended XPath I • Common format of XML Query • Element + Modifier • Example • //article[.//yr=2001 or .//yr=2002]//section[about(.,summer holidays)] • Path filter • two yr condition(arithmetic filtering) • about clause(string filtering)
Challenges in XML Retrieval • Structured Retrieval • Queries and Documents are either structuredor unstructured • ex) //article//section vs. “summer holidays” • Most user wants part of documents • ex) Shakespeare, “Macbeth’s castle” • Should we return <Scene>,< Act>, or entire <play> element? • “Macbeth’s castle” is scene, which is probably user’s need • Structured document retrieval principle • A system should always retrieve the most specific part of document answering query
Challenges in XML Retrieval • Structured document retrieval principle • Applying principle in practice: not easy • Title#“Macbeth” • → /play/title/“Macbeth” or /play/act/scene/“Macbeth’s Castle” • This time, play title is preferred • Indexing unit problem • Which parts of a document to index? • In unstructured retrieval, whole document is indexing unit • In structured retrieval, several strategy exists
Challenges in XML Retrieval • Indexing unit strategy • Grouping nodes into non-overlapping pseudo-documents • Select one of largest element as indexing unit • descend into its leaves in post-processing(two-step, top-down) • Select all leaves as indexing unit • extend to larger unit in post-processing(two-step, bottom-up) • Index all element Ex) Non-overlapping pseudo-documents
Challenges in XML Retrieval • Relevant statistics for XML retrieval • Nested element can cause confusion in statistics • Ex) inverse document frequency • Term “Gates” both exists author#“Gates” and section#“Gates” • In this case, computing Idf for “Gates”: term only, or structure+term • Schema heterogeneity • Also referred as schema diversity • Equivalent element may have different name → creator(d2) vs. author(d3) • Equivalent element may have different structure: → author(q3) vs. first/last name(d3)
Challenges in XML Retrieval • Schema heterogeneity • Extended query • Transform query: q3 → q4 • in pseudo-xpath expression: book//#“Gates” • Users are not familiar with element names & structure • Allowing any number of intervening nodes between “book” and “gates”
Challenges in XML Retrieval • Schema heterogeneity • Extended query q6 will return nothing • Structural mismatch • extended query do not help here • Should be ranked lower, but should not omitted from search results • Structural constraintshould be interpreted as “hints”
Vector space model • Concept: Structural Term • Element with single vocabulary term in the end • XML context/term pair, denoted by <C,t> • 7 structural term shown in figure(total 9) • 2 are not shown • /book/author#“Bill” , /book/author#“Gates” Lexicalized Subtree → Not a structural term
Vector space model • XML query examples – structural term q = { (t1, c1), (t2, c1), (t3, c2), (t4)… } <chapter><title>XML tutorials</title></chapter> q = { (XML, chapter/title), (tutorial, chapter/title) } <article> <sec>non-monotonic reasoning</sec> “belief revision” </article> q = { (non-monotonic, article/sec), (reasoning, article/sec), (belief revision, article) } XML Context Non-structure term
Vector space model • SimNoMerge(q,d) • CR : Context Resemblance • B : set of all XML context • V : the vocabulary of non-structural terms • weight(q,t,c), weight(d,t,c) • weight of term t in XML context c in query q and document d • weight: one of weightings from Chapter 6, such as idft·wft,d • Not a true cosine measure – result may larger than 1
Vector space model • Relevance scoring function cosine similarity between query q and document d cosine similarity between XML fragment q and XML document d (from Carmel et al. 2002, An Extension of the Vector Space Model for Querying XML Document)
Vector space model • Structural resemblance • CR: context resemblance • (if Cqmatches Cd) = 0 (if Cqdoes not match Cd) • |Cq|, |Cd| : # of nodes in the query path and document path
Vector space model • CR example • CR(Cq, Cd) = 1 (if path of q = path of d) ex) CR(Cq4, Cd2) = 3/4 = 0.75 CR(Cq4, Cd3) = 3/5 = 0.6
Vector space model • SimNoMerge Pseudo-code N: # of document to retrieve B: all XML context V: all term(unstructured) q: query (contains structured term pair) normalizer: sqrt( sum of (term-doc weight)2 ) Inverted doc index
Evaluation on XML Retrieval • INEX • INnitiative for the Evaluation of XML retrieval • Collection: 12,000 IEEE journal(2002) → en.wikipedia.org(2006) • 2 Types of topics • CAS(Content & Structure) • CO(Content Only) • Component Coverage • Exact coverage(E) • Too small(S) • Too large(L) • No converge(N) • Topical Relevance • Four levels, 3(Highly relevant) ~ 0(Non-relevant)
Evaluation on XML Retrieval • Quantizer function • Combination of relevance & coverage • Q(rel, cov) = • Ex) 2S component • #(relevant items retrieved) = • As an approximation, precision, recall, F measure can be applied on this definition(with notation) 1.00 if (rel, cov) = 3E 0.75 if (rel, cov) ∈ {2E, 3L} 0.50 if (rel, cov) ∈ {1E, 2L, 2S} 0.25 if (rel, cov) ∈ {1S, 1L} 0.00 if (rel, cov) = 0N
Evaluation on XML Retrieval • Effectiveness in XML retrieval is often lower than unstructured retrieval • XML retrieval is harder • Partial retrieval(coverage issue) • XML retrieval scored lower • Binary relevance = { 1 or 0 }, XML retrieval graded = { best case 1 } • Structured retrieval score is not compared with unstructured retrieval
Evaluation on XML Retrieval • Large increase a Precision at k at k=5 and k=10 • Structure help to increase precision at top of the result list • Structured retrieval is better at precision-oriented task • Recall may suffer
Text-centric vs. Data-centric XML • Text-centric XML Retrieval • Long text field • Inexact matching • Relevance-ranked results • Assembly manuals, issues of journals, Newswire articles… • Data-centric XML Retrieval • No ranking • Exact matching • Commonly used for data collection with complex structure • Mainly contain non-text data • Most data-centric XML retrieval systems are extensions of Relational database systems
Appendix • Text-centric vs. Data-centric XML document • Also referred as Document-like vs. Record-like XML document • Document-like XML also referred as Narrative-like XML document • in XML in a Nutshell(O’Reilly, 3rd ed.) • Document-like(=text-centric) XML example • xHTMLs: MSDN library documents, Wikipedia • Meant for human beings to read(with appropriate Schema/DTD) • Record-like(=data-centric) XML example • SOAP, RSS specification using XML • commonly used in communication-type applications • cf. XML database • Does not really store native (text) XML document • Provides XML document as fundamental unit of logical storage • XML-Enabled RDBMS vs. Native XML Database
Text-centric XML document • Page from INEX 2009 corpus
Text-centric XML document • Page from Wikipedia
Data-centric XML document • from SOAP response • Clearly, information for machine
Data-centric XML document • from RSS format • Classified as record-like XML, but partiallyhuman-readable