Overview of Information Retrieval and our Solutions

Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong

More and more online information in general (Information Overload) Many tasks rely on effective management and exploitation of information Textual information plays an important role in our lives Effective text management directly improves productivity Why Need Information Retrieval (IR)?

What is IR? • Narrow-sense: • IR= Search engine technologies (Google/Yahoo!/Live Search) • IR= Text matching/classification • Broad-sense: IR = Text information management: • How to find useful information? (info. retrieval) (e.g., Yahoo!) • How to organize information? (text classification) (e.g., automatically assign email to different folders) • How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

Difficulties • Huge Amount of Online Data • Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005) • Different types of data • Web-pages, emails, blogs, chatting-room messages; • Ambiguous Queries • Short: 2-4 words • Ambiguous: apple; bank…

Our Solutions • Query Classification • Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7) • Query Expansion/Suggestion • Submissions to: SIGIR’07; AAAI’07; KDD’07 • Entity Resolution • Submission to SIGIR’07 • Web page Classification/Clustering • SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007), DMKD (Vol. 12) • Document Summarization • SIGIR’05; IJCAI’07 • Analysis of Blogs, Emails, Chatting-room messages • SIGIR’06; ICDM’06 (2); IJCAI’07

Outline • Query Classification (QC) • Introduction • Solution 1: Query/category enrichment; • Solution 2: Bridging classifiers; • Entity Resolution • Summary of Other works

Query Classification

Introduction • Web-Query is difficult to manage: • Short; Ambiguous; Evolving • Query Classification (QC) can help to understand query better • Vertical Search • Re-rank search results • Online Advertisements • Difficulties of QC (Different from text classification) • How to represent queries • Target taxonomy is dynamic, e.g. online ads taxonomy • Training data is difficult to collect

Problem Definition Inspired by the KDDCUP’05 competition • Classify a query into a ranked list of categories • Queries are collected from real search engines • Target categories are organized in a tree with each node being a category

Related Work • Document Classification • Feature selection [Yang et al. 1997] • Feature generation [Cai et al. 2003] • Classification algorithms • Naïve Bayes [Andrew and Nigam 1998] • KNN [Yang 1999] • SVM [Joachims 1999] • …… An overall survey in [Sebastiani 2002]

Related work • Query Classification/Clustering • Classify the Web queries by geographical locality [Gravano 2003]; • Classify queries according to their functional types [Kang 2003]; • Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005]; • Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];

Related Work • Document/Query Expansion • Borrow text from extra data source • Using hyperlink [Glover 2002]; • Using implicit links from query log [Shen 2006]; • Using existing taxonomies [Gabrilovich 2005]; • Query expansion [Manning 2007] • Global methods: independent of the queries • Local methods using relevance feedback or pseudo-relevance feedback

Solutions Solution 1: Query/Category Enrichment Solution 1: Query/Category Enrichment Solution 2: Bridging classifier

Solution 1: Query/Category Enrichment • Assumptions & Architecture • Query Enrichment • Classifiers • Synonym-based classifiers • Statistical classifiers • Experiments

Assumptions & Architecture • The intended meanings of Web queries should be reflected by the Web; • A set of objects exist that cover the target categories.

Category information Title Snippet Category Full text Query enrichment • Textual information

Category Mapping Synonym-based classifiers C*

E D Synonym-based classifiers • Map by Word Matching • Direct Matching • High precision, low recall • Extended Matching • Wordnet • “Hardware" → “Hardware; Device ; Equipment“

Statistical classifiers: SVM • Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy • Obtain <pages, target category> as the training data • Train SVM classifiers for the target categories;

Statistical Classifier: SVM • Circles (triangles) denote crawled pages • Black ones are mapped to the two categories successfully • Fail to map the white ones; • For a query, if it happens to be represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can • Advantages • Disadvantages • Recall can be higher, but precision may hurt • Once the target taxonomy changes, we need to train classifiers again

Putting them together: Ensemble of classifiers • Why ensemble? • Two kinds of classifiers based on different mechanisms • They can be complementary to each other • Proper combination can improve the performance • Combination strategies • EV (Use validation data) • EN (No validation data)

Experiment--Data Sets & Eval. Criteria • Queries: from KDDCUP 2005 • 800,000 queries, • 800 labeled; three labelers • Evaluation

Performance of each labeler against another labelers The distribution of the labels assigned by the three labelers. Experiment: Quality of the Data Sets • Consistency between labelers

Experiment Results--Direct vs. Extended Matching • Number of pages collected for training using different mapping methods • F1 of the synonym based classifier and SVM

Experiment Results--The number of assigned labels

Experiment Results-- Effect of Base Classifiers

Solutions Solution 1: Query/Category Enrichment Solution 2: Bridging classifier Solution 2: Bridging classifier

Solution2: Bridging Classifiers • Our Algorithm • Bridging Classifier • Category Selection • Experiments • Data Set and Evaluation Criteria • Results and Analysis

Algorithm--Bridging Classifier • Problem with Solution 1: • target if fixed, and training needs to repeat • Goal: • Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge

The relation between and Algorithm--Bridging Classifier (Cont.) • How to connect? The relation between and The relation between and Prior prob. of

Algorithm--Bridging Classifier (Cont.) • Understand the Bridging Classifier • Given and : • and are fixed • and • which reflects the size of acts as a weighting factor • tends to be larger when and tend to belong to the same smaller intermediate categories

Algorithm--Category Selection • Category Selection for Reducing Complexity • Total Probability (TP) • Mutual Information

Experiment--Data Sets and Eval. Criteria • Intermediate taxonomy • ODP: 1.5M Web pages, in 172,565 categories Number of Categories on Different Levels Statistics of the Numbers of Documents in the Categories on Different Levels

Experiment--Result of Bridging Classifiers • All intermediate categories are used • Snippet only • Best result when n = 60 • Improvement by 10.4% and 7.1% in terms of precision and F1 respectively compared to two previous approaches

Experiment--Result of Bridging Classifiers Performances of the Bridging Classifier with Different Granularity of Intermediate Taxonomy • Best results when using all intermediate categories • Reason: • A category with larger granularity may be a mixture of several target categories • It can not be used to distinguish different target categories

Experiment--Effect of category selection • When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches • MI works better than TP • It favors the categories which are more powerful to distinguish the target categories

Entity Resolution

Definition: Reference & Entity • Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 • Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006 Name Reference Venue Reference Journal /Conf. Entity Author Entity

Current Author Search • DBLP • CiteSeer • Google All of them return the MIXED list of references

Graphical Model • We convert the Entity Resolution into a Graph Partition Problem • Each node denotesa reference • Each edge denotesthe relation of tworeferences

How to measure the Reference Relation • Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 • Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005: Research Community Coauthors Research Area Authors Plaintext Similarity Coauthors Authors

Features • F1: Title Similarity • F2: Coauthor Similarity • F3: Venue Similarity • F4: Research Community Overlap • F5: Research Area Overlap

Research Community Overlap • A1, A2 stands for two author name references • F4.1:Similarity(A1, A2)=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) • F4.2:Similarity(A1, A2)=Venues(Coauthors(A1))∩Venues(Coauthors(A2)) • Coauthors(X) returns the coauthor name set of eachauthor in set X • Venues(Y) returns the venue name set of eachauthor in set Y

Research Area Overlap • V1, V2 stands for two venue references • F4.1:Similarity(V1, V2)=Authors(Articles(V1))∩Authors(Articles(V2)) • F4.2:Similarity(V1, V2)=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2))) • Authors(X) returns the author name set of eacharticle in set X • Articles(Y) returns the article set holding a referenceof each element in set Y

System Framework Similarity Probability

Experiment Results • Our Dataset:1000 references to 20 author entities from DBLP • Getoor’s DatasetsCiteSeer: 2,892 author references to 1,165 author entitiesarXiv: 58,515 references to 9,200 author entitiesF1 = 97.0%

Summary of Other Work

Summary of Other Work • Summarization using Conditional Random Fields (IJCAI ’07) • Thread Detection in Dynamic Text Message Streams (SIGIR ’06) • Implicit Links for Web Page Classification (WWW ’06) • Text Classification Improved by Multigram Models (CIKM ’06) • Latent Friend Mining from Blog Data (ICDM ’06) • Web-page Classification through Summarization (SIGIR ’04)

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Summarization using Conditional Random Fields (IJCAI ’07) • Motivation • Observation • Summarization  Sequence labeling • Solution: CRF • Feature functions: , • Parameters: , Step 1: Step 2: Step 3:

Thread Detection in Dynamic Text Message Streams (SIGIR ’06) • Representation • Content-based • Structure-based • Sentence Type; Personal Pronouns • Clustering

Overview of Information Retrieval and our Solutions

Overview of Information Retrieval and our Solutions

Presentation Transcript

Information retrieval

An Overview of Information Retrieval

Modern Information Retrieval: A Brief Overview

Information Retrieval

STORAGE AND RETRIEVAL OF INFORMATION

Information Retrieval and Information Extraction

Information Retrieval

Overview of Information Retrieval

Information Retrieval

An overview of the technology used Information Retrieval

An Overview of Information Retrieval

Information Retrieval Overview

Information retrieval: overview

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval