1 / 21

Cheshire II at INEX 2003: Component and Algorithm Fusion

Cheshire II at INEX 2003: Component and Algorithm Fusion. Ray R. Larson School of Information Management and Systems University of California, Berkeley. Overview. Cheshire II feature overview Logistic Regression Ranking and Boolean Operations Additions from INEX ‘02

taariq
Download Presentation

Cheshire II at INEX 2003: Component and Algorithm Fusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cheshire II at INEX 2003: Component and Algorithm Fusion Ray R. Larson School of Information Management and Systems University of California, Berkeley INEX 2003 -- Ray R. Larson

  2. Overview • Cheshire II feature overview • Logistic Regression Ranking and Boolean Operations • Additions from INEX ‘02 • XML Schemas and Element Retrieval • CORI, Okapi BM-25 ranking algorithms • Result Set sorting, merging and ranking operators • Evaluation Results INEX 2003 -- Ray R. Larson

  3. Overview of Cheshire II • It supports SGML and XML with components and component indexes • It is a client/server application • Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented • Server supports a Relational Database Gateway • Supports Boolean searching of all servers • Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search • Search engine supports ``nearest neighbor'' searches and relevance feedback • GUI interface on X window displays and Windows NT • WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire • Scriptable clients using Tcl and (new) Python • Store SGML/XML as files or “Datastore” database INEX 2003 -- Ray R. Larson

  4. XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format.. INEX 2003 -- Ray R. Larson

  5. XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… INEX 2003 -- Ray R. Larson

  6. Boolean Search Capability • All Boolean operations are supported • “zfind author x and (title y or subject z) not subject A” • Named sets are supported and stored on the server • Boolean operations between stored sets are supported • “zfind SET1 and subject widgets or SET2” • Nested parentheses and truncation are supported • “zfind xtitle Alice#” INEX 2003 -- Ray R. Larson

  7. Probabilistic Retrieval • Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time • Z39.50 “relevance” operator used to indicate probabilistic search • Any index can have Probabilistic searching performed: • zfind topic @ “cheshire cats, looking glasses, march hares and other such things” • zfind title @ caucus races • Boolean and Probabilistic elements can be combined: • zfind topic @ government documents and title guidebooks INEX 2003 -- Ray R. Larson

  8. Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Note that we did NOT retrain the coefficients this year INEX 2003 -- Ray R. Larson

  9. Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged INEX 2003 -- Ray R. Larson

  10. Combining Boolean and Probabilistic Search Elements • Two original approaches: • Boolean Approach • Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries INEX 2003 -- Ray R. Larson

  11. Ranking Methods added since INEX ‘02 • CORI -- From Jamie Callan: Simple implementation of a weighting scheme for distributed search. Very effective for distributed search collection selection. Not used for official INEX runs. • OKAPI BM-25 -- From Steve Robertson. This is now seems to be the “default” retrieval algorithm in experimental IR • New operators (later) let us mix and match ranking methods and Boolean operations INEX 2003 -- Ray R. Larson

  12. Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight. INEX 2003 -- Ray R. Larson

  13. Final Ranked List Sort/ Merge Query Results INEX ‘02 Fusion Search • Merge multiple resultsets and sort new set • Sort by index name/key (ATTRIBUTE) • Sort by rank (ELEMENTS) • Merges ranked results and Boolean results • Sort by XML/SGML Tag contents (TAG) INEX 2003 -- Ray R. Larson

  14. Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators (not used for INEX) • !FUZZY_AND • !FUZZY_OR • !FUZZY_NOT • Containment operators: Restrict components to or with a particular parent • !RESTRICT_FROM • !RESTRICT_TO • Merge Operators • !MERGE_SUM • !MERGE_MEAN • !MERGE_NORM INEX 2003 -- Ray R. Larson

  15. Query Generation - CO • # 91 TITLE = Internet traffic • (topicshort @+ {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM (alltitles @+ {Internet traffic}) !MERGE_NORM (kwd @+ {Internet traffic}) !MERGE_NORM (topicshort @ {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM (alltitles @ {Internet traffic}) !MERGE_NORM (kwd @ {Internet traffic}) • TARGETPATH = XML_ELEMENT_article INEX 2003 -- Ray R. Larson

  16. INEX CO Runs INEX 2003 -- Ray R. Larson

  17. Query Generation - SCAS • #66 TITLE = /article[./fm//yr &lt; '2000’] //sec[about(.,'"search engines"')] • ((date < '2000')) !RESTRICT_FROM ((sec_words @ {"search engines"} !MERGE_MEAN (sec_words {$search engines$}))) • TARGETPATH = XML_ELEMENT_sec INEX 2003 -- Ray R. Larson

  18. Query Generation -- SCAS • This run uses Logistic regression matching combined with Boolean phrase matching and MERGE_MEAN partial result combinations FUZZY_AND and FUZZY_OR operators were used in combining AND and OR elements within an "about" predicate. Containment operators were used to constrain component searches within ancestor elements, E.g.: INEX 2003 -- Ray R. Larson

  19. INEX SCAS Runs INEX 2003 -- Ray R. Larson

  20. Future Plans • Bug fixes -- incorrect query generation for some SCAS queries, for example… • TITLE = //article[about(.,'security +biometrics') AND about(.//sec,'"facial recognition"')] • Submitted : (topicshort @ {security biometrics} !MERGE_MEAN (topicshort @ {biometrics biometrics biometrics biometrics}) ) !FUZZY_AND (sec_title @ {"facial recognition"} !MERGE_MEAN (sec_title {$facial recognition$})) • Should have included sec_words and Boolean subquery for biometrics merged with ranked subquery INEX 2003 -- Ray R. Larson

  21. Future Plans • Add Language Model ranking for components • Retrain Logistic Regression coefficients on INEX assessment data -- and experiment with including new variables, such as relative component size • Find bugs in Okapi BM-25 • Find more bugs ahead of time, and be more consistent in runs! INEX 2003 -- Ray R. Larson

More Related