Internet and Industry Panel Discussion

Internet and Industry Panel Discussion James G. Shanahan Email:Jimi@clairvoyancecorp.com FLINT-CIBI 2003, December 2003

1 Minute 10 Minutes 1 Hour 1 Day 1 Week 1 Month Knowledge Management Systems Text Mining Applications Search Engines How Much ‘Quality’ is Possible?Human Effort vs. Retrieval Performance Performance (Precision/Recall) Seconds 100 101 102 103 104 105 106 [Source : David E. Evans]

Internet and Industry Panel Discussion • Anticipatory information systems • Document Souls • Blind feedback • Stunning cluster hypothesis • Feature engineering

a new paradigm for information access Document Souls James Shanahan* and Gregory Grefenstette* * Work performed at Xerox Research (France)

Interesting juncture • High Bandwidth, lots of sleeping computers, very cheap memory/disks • Niche browsers, very specialised information services • Recent studies have estimated the size of the hidden web to be 500 billion pages, while the size of the indexed web is three billion.(http://www.completeplanet.com/Tutorials/DeepWeb/index.asp) • Search Engines have significant limitations • Out of date, only index 1% of online pages, documents with authentication requirements generally are not indexed. • Context is ignored • Anticipatory services

Xerox Document Soul Endowed Document INTRANET Always Connected Anticipating Adding Value Old dead document

Document Content Service Providers Package of content services -------------- Company Names Find Products Stock Chart OrgChart Similar Patents URLs Job Openings Press Releases ... Package of content services -------------- Company Names Find Products Stock Chart OrgChart Similar Patents URLs Job Openings Press Releases ... Xerox Document Soul Endowed Document XML Document Soul Coordinator TechWatch personality DOM

Giving personalities to documents… (1)Original Document (3)Annotated Document (4)Service Propositions (2)UserClick (Xerox) Lawyer This is a company. I can point you to its R&D, publications, patents, patent statistics. I can track changes and alert you. Xerox Spins Out ‘Gyricon Media Inc.' To Commercialize Electronic Reusable Paper -- ‘The Paper Of The Future’ XeroxSpins Out ‘Gyricon Media Inc.' To CommercializeElectronic Reusable Paper-- ‘The Paper Of The Future’ This is a technical category. I can point you to publications and patent information in this category. I can track changes and alert you. I can point you to all Xerox patent files, publications and news in this technical category. I can track changes and alert you. Information Sources

Anne Mulcahy + Google Search - Business - Society + Patent Search (EPO) - Business Process

Document Souls System • A new paradigm of information access, • Document gets a life • Constantlyanticipating your information needs • $6 Million Project • DS System (beta product) • ViviDocs

Blind feedback for query expansion • Given a query Q, use it to rank documents • Expand original query Q using prominent terms (e.g., 30 terms) from the top N (e.g., 6 documents) • Rerank documents using expanded query • [Evans 1993; Buckley 1994]

Cluster Hypothesis • We explored the cluster hypothesis in the context of relevance feedback. • The cluster hypothesis states that closely associated documents tend to be relevant to the same requests [Van Rijsbergen, 1979]. • Supervised feedback • via cluster-based clarification forms • Automatic feedback • Based upon clusters of documents, • As an alternative to blind feedback based upon taking the top N ranked documents, an approach that is commonly used.

6/8 Rels Doc1 ; Doc4 ; Doc5 ; Doc7 ; Doc8 ; Doc10; Doc14; Doc18 Doc2 ; Doc3; Doc6 ; Doc19 Doc9 ; Doc11 ; Doc12 ; Doc16 ; Doc17 Doc13 ; Doc15 ; Doc20 Illustration of Clustering Effect Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 Q  7/20 Rels

Baseline Result (No BF) 1 2 Top-20,85 BF Result Clustering of Top 50 BF Result 3 BF Result Max BF Result BF Result Can a Cluster Out-Perform Top-N? Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 … Doc50 Q 

CLARIT Indexing and Retrieval • Indexes based on sub-documents • Vector space model IDF(t) inverse document frequency TF(t) term frequency C(t) “importance coefficient”

Query Expansion Rank terms using Prob2 weighting. Select top k terms N = the number of documents in the target corpus Nt = the number of documents in the corpus that contain term t R = the number of documents for feedback that are (presumed to be) relevant to the topic Rt = the number of documents that are (presumed to be) relevant to the topic and contain term t Coefficients in the expanded query terms in both the expanded set and the original query = 1.5 ; terms that occur in the query = 1.0; and terms that occur in the expanded set only = 0.5.

Per-Topic Performance Sorted by Decreasing Performance of Top-N BF Max Cluster BF is typically much better than Top-N! 0.2848 vs. 0.2390 vs. 0.2050 NRRC [Evans et al. 2003]: TREC Disks 4 and 5; Financial Times, LA Times etc.

Clarification Form 1 - Titles and Terms

TREC med & max vs CC new baseline & optimal cluster run

TREC 2003 HARD Track Corpus:2 Gig of Newswire and Federal registry documents [Shanahan et al., 2003]

Conclusions • Exploiting the cluster hypothesis in a manual setting boosts performance by 20% over blind feedback • Stunning cluster hypothesis • Relevant documents tend to cluster • Select best cluster (s) (in an ideal setting) using an oracle and expand query with these • Boosts performance by 20% over blind feedback • Ongoing work is investigating techniques that would automatically select the optimal cluster(s).

Internet and Industry Panel Discussion

Internet and Industry Panel Discussion

Presentation Transcript

Panel Discussion

Panel discussion

Panel Discussion:

Panel discussion

Panel Discussion:

PANEL DISCUSSION

Joint Academia/Industry/Government Panel Discussion

Panel discussion

Panel Discussion

PANEL DISCUSSION

Panel Discussion

Panel Discussion

Panel Discussion

Industry Session Panel Discussion

Panel Discussion

Panel Discussion

Panel Discussion

Panel Discussion

Panel Discussion

Panel Discussion