280 likes | 396 Views
This paper presents an effective approach to XML keyword search that addresses ambiguities and ranks results based on user intent. The study identifies key issues in existing search methods, including the failure to consider semantic meanings and the challenges posed by keyword ambiguity. It introduces a relevance-oriented ranking system utilizing TF-IDF metrics and algorithms to evaluate node types and their confidence as potential search results. Experimental evaluations demonstrate significant improvements in the relevance of search outcomes, ultimately leading to more meaningful user experiences.
E N D
Effective XML Keyword SearchwithRelevanceOriented Ranking Presentationby Volker Rehberg Paper by ZhifengBao, Tok Wang Ling, Bo Chen, Jiaheng Lu
Agenda I ) Motivation and Background II) Inferring Keyword Search Intention III ) RelevanceOriented Ranking IV ) Algorithms V ) Experimental Evaluation VI ) Conclusion
Motivation and Background Whatis „Effective XML Keyword SearchwithRelevance Oriented Ranking“ all about? • Keyword search Issue 1: identifysearchfornodeIssue 2: identitysearch via node Issue 3: rank each query result
Motivation and Background Ambiguities in interpretingthesearchfornodeandsearch vianode: Ambiguity 1: Keyword canappearas a xml tag nameandas a textvalueofsomeothernodes.
Motivation and Background Ambiguities in interpretingthesearchfornodeandsearch via node: Ambiguity 2: Keyword canappearasthetextvaluesof different typesofxmlnodesandcarry different meanings.
Motivation and Background Keyword query: Customer interestart SLCA returns 5 resultswithoutanyranking onlycomstumerwith ID C4 isdesiredandshouldbe top ranked
Motivation and Background Problems of SLCA: • does not considersemanticsof query and XML Data • Keyword ambiguityproblem • Norelevanceorientedranking answers irrelevant touserssearchintention • answers not meaningfulland informative enough
Motivation and Background TF *IDF (Term Frequency * Inverse DocumentFrequency) • Rule 1: Inverse DocumentFrequency • Rule 2: Term Frequency • Rule 3: Normalization
Motivation and Background query . flat document keyword Normalize document/term frequency: Number of documents occurencesof k in document d documents containing k Weightsof query q anddocument d:
Inferring Keyword Search Intention Talking about “Art”: • Intuition :elementof „interest“ node, becausemanypeopleareinterested in art • statisticsofunderlyingdatabase
Inferring Keyword Search Intention Node type Tissearchfornodeif: 1: Tisintuitivelyrelatedtoevery query keyword in q. 2: Tis informative enoughtocontainenough relevant information 3: T does not containtomuch irrelevant information numberofT – typednodesthatcontainkaseithervaluesor tag names in theirsubtrees keyword in query q reductionfactor (range 0-1) normallychosentobe 0.8
Inferring Keyword Search Intention Confidenceof a node type T tobedesiredsearchfornode: numberofT – typednodesthatcontainkaseithervaluesor tag names in theirsubtrees keyword in query q reductionfactor (range 0-1) normallychosentobe 0.8 Confidenceof a node type T tobedesiredsearch via node:
Inferring Keyword Search Intention Keyword query: Customer name rock interestart • „art“ shouldbe in interestand „rock“ shouldbesearchedfor in name • order ofkeywords in the query important
Inferring Keyword Search Intention Value TypedDistance (Dist) Max(Distq (q, v, kt, k) , Dists (q, v, kt, k) In-Query Distance (IQD) Position distancebetweenktandk in q, ifktappearsbefore k in query StructuralDistance (Distq) Depthdistancebetweenv andthenearestkt – typedancestornodeofv node keyword that matches in v keyword that matches type of an anchester node of v
Inferring Keyword Search Intention Keyword query: Customer name rock interestart
RelevanceOriented Ranking Ranking Principles Searchingforcustomer via streetnodewith keyword query: Art Street Principle 1 only search via nodes affect relevance
RelevanceOriented Ranking Ranking Principles Searchingforcustomersinterested in artusing query: „art“ Principle 1 Principle 2 only search via nodes affect relevance search via node should contain keyword
RelevanceOriented Ranking Ranking Principles Keyword query: Customer name rock interestart Principle 1 Principle 2Principle 3 only search via nodes affect relevance search via node should contain keyword Order of keywords in query is important
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node similarityvaluebetween q and a query First (base) case: similaritiesbetweenleafnodeandthe query Recursivecase: recursivesimilaritiesbetweeninternalnodenandthe query
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node query similarityvaluebetween q and a similar to Classic TF*IDF: query flat document keyword
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node query similarityvaluebetweenqanda ConfidenceofTctobesearch via node childnodeof a Similaritybetweencandq (recursively) Overall weightofaforthegiven query q Intuition Intuition relevant ifchildrenhavehighconfidencetobe a search via nodeandare relevant toq more relevant childrenincreaserelevanceofnode type
Algorithms Parsingtheinput XML document foreachnodenvisited: (1) Assign a DeweyIDton (2) Store theprefixpathprefixPathofn in hashtable
Algorithms Build 2 indices: 1. Keyword invertedlist : (1): Dup : DeweyIDand XML TF*IDF (fa,k) (2): DupType: Dup + node type (prefixpath) (3): DupTypeNorm: DupType + normalizationfactorWa „Node“ tuple: <DeweyID, prefixPath, fa,k , Wa > 2. Frequency Table: - stores (frequencyofk in node type T)
Algorithms The Algorithm: 1. Input: keywordsof query, invertedlist, frequencytable 2. Identifythesearchintentionandsearchfornode type 3. Rank bycomputing XML TF*IDF similaritybetweennandgiven query 4. returnrankedlist
Experimental Evaluation XReal vs. SLCA vs. XSeek AimsofTesting: • Searcheffectiveness • Ranking effectiveness Datasets: • real Datasets (Washington XML Data Repository, DBLP) • syntheticdatasets (XMarkbenchmark)
Conclusion • Identifysearchintentionand rank resultswithstatistics • Confidenceleveltobesearchfor/via nodewith XML TF*IDF • XML TF*IDF similarityrankingscheme • approachtriestosolveambiguityproblem • Prototype XReal