1 / 56

Overview of Information Retrieval and our Solutions

Overview of Information Retrieval and our Solutions. Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong. More and more online information in general (Information Overload)

paniz
Download Presentation

Overview of Information Retrieval and our Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong

  2. More and more online information in general (Information Overload) Many tasks rely on effective management and exploitation of information Textual information plays an important role in our lives Effective text management directly improves productivity Why Need Information Retrieval (IR)?

  3. What is IR? • Narrow-sense: • IR= Search engine technologies (Google/Yahoo!/Live Search) • IR= Text matching/classification • Broad-sense: IR = Text information management: • How to find useful information? (info. retrieval) (e.g., Yahoo!) • How to organize information? (text classification) (e.g., automatically assign email to different folders) • How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

  4. Difficulties • Huge Amount of Online Data • Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005) • Different types of data • Web-pages, emails, blogs, chatting-room messages; • Ambiguous Queries • Short: 2-4 words • Ambiguous: apple; bank…

  5. Our Solutions • Query Classification • Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7) • Query Expansion/Suggestion • Submissions to: SIGIR’07; AAAI’07; KDD’07 • Entity Resolution • Submission to SIGIR’07 • Web page Classification/Clustering • SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007), DMKD (Vol. 12) • Document Summarization • SIGIR’05; IJCAI’07 • Analysis of Blogs, Emails, Chatting-room messages • SIGIR’06; ICDM’06 (2); IJCAI’07

  6. Outline • Query Classification (QC) • Introduction • Solution 1: Query/category enrichment; • Solution 2: Bridging classifiers; • Entity Resolution • Summary of Other works

  7. Query Classification

  8. Introduction • Web-Query is difficult to manage: • Short; Ambiguous; Evolving • Query Classification (QC) can help to understand query better • Vertical Search • Re-rank search results • Online Advertisements • Difficulties of QC (Different from text classification) • How to represent queries • Target taxonomy is dynamic, e.g. online ads taxonomy • Training data is difficult to collect

  9. Problem Definition Inspired by the KDDCUP’05 competition • Classify a query into a ranked list of categories • Queries are collected from real search engines • Target categories are organized in a tree with each node being a category

  10. Related Work • Document Classification • Feature selection [Yang et al. 1997] • Feature generation [Cai et al. 2003] • Classification algorithms • Naïve Bayes [Andrew and Nigam 1998] • KNN [Yang 1999] • SVM [Joachims 1999] • …… An overall survey in [Sebastiani 2002]

  11. Related work • Query Classification/Clustering • Classify the Web queries by geographical locality [Gravano 2003]; • Classify queries according to their functional types [Kang 2003]; • Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005]; • Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];

  12. Related Work • Document/Query Expansion • Borrow text from extra data source • Using hyperlink [Glover 2002]; • Using implicit links from query log [Shen 2006]; • Using existing taxonomies [Gabrilovich 2005]; • Query expansion [Manning 2007] • Global methods: independent of the queries • Local methods using relevance feedback or pseudo-relevance feedback

  13. Solutions Solution 1: Query/Category Enrichment Solution 1: Query/Category Enrichment Solution 2: Bridging classifier

  14. Solution 1: Query/Category Enrichment • Assumptions & Architecture • Query Enrichment • Classifiers • Synonym-based classifiers • Statistical classifiers • Experiments

  15. Assumptions & Architecture • The intended meanings of Web queries should be reflected by the Web; • A set of objects exist that cover the target categories.

  16. Category information Title Snippet Category Full text Query enrichment • Textual information

  17. Category Mapping Synonym-based classifiers C*

  18. E D Synonym-based classifiers • Map by Word Matching • Direct Matching • High precision, low recall • Extended Matching • Wordnet • “Hardware" → “Hardware; Device ; Equipment“

  19. Statistical classifiers: SVM • Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy • Obtain <pages, target category> as the training data • Train SVM classifiers for the target categories;

  20. Statistical Classifier: SVM • Circles (triangles) denote crawled pages • Black ones are mapped to the two categories successfully • Fail to map the white ones; • For a query, if it happens to be represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can • Advantages • Disadvantages • Recall can be higher, but precision may hurt • Once the target taxonomy changes, we need to train classifiers again

  21. Putting them together: Ensemble of classifiers • Why ensemble? • Two kinds of classifiers based on different mechanisms • They can be complementary to each other • Proper combination can improve the performance • Combination strategies • EV (Use validation data) • EN (No validation data)

  22. Experiment--Data Sets & Eval. Criteria • Queries: from KDDCUP 2005 • 800,000 queries, • 800 labeled; three labelers • Evaluation

  23. Performance of each labeler against another labelers The distribution of the labels assigned by the three labelers. Experiment: Quality of the Data Sets • Consistency between labelers

  24. Experiment Results--Direct vs. Extended Matching • Number of pages collected for training using different mapping methods • F1 of the synonym based classifier and SVM

  25. Experiment Results--The number of assigned labels

  26. Experiment Results-- Effect of Base Classifiers

  27. Solutions Solution 1: Query/Category Enrichment Solution 2: Bridging classifier Solution 2: Bridging classifier

  28. Solution2: Bridging Classifiers • Our Algorithm • Bridging Classifier • Category Selection • Experiments • Data Set and Evaluation Criteria • Results and Analysis

  29. Algorithm--Bridging Classifier • Problem with Solution 1: • target if fixed, and training needs to repeat • Goal: • Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge

  30. The relation between and Algorithm--Bridging Classifier (Cont.) • How to connect? The relation between and The relation between and Prior prob. of

  31. Algorithm--Bridging Classifier (Cont.) • Understand the Bridging Classifier • Given and : • and are fixed • and • which reflects the size of acts as a weighting factor • tends to be larger when and tend to belong to the same smaller intermediate categories

  32. Algorithm--Category Selection • Category Selection for Reducing Complexity • Total Probability (TP) • Mutual Information

  33. Experiment--Data Sets and Eval. Criteria • Intermediate taxonomy • ODP: 1.5M Web pages, in 172,565 categories Number of Categories on Different Levels Statistics of the Numbers of Documents in the Categories on Different Levels

  34. Experiment--Result of Bridging Classifiers • All intermediate categories are used • Snippet only • Best result when n = 60 • Improvement by 10.4% and 7.1% in terms of precision and F1 respectively compared to two previous approaches

  35. Experiment--Result of Bridging Classifiers Performances of the Bridging Classifier with Different Granularity of Intermediate Taxonomy • Best results when using all intermediate categories • Reason: • A category with larger granularity may be a mixture of several target categories • It can not be used to distinguish different target categories

  36. Experiment--Effect of category selection • When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches • MI works better than TP • It favors the categories which are more powerful to distinguish the target categories

  37. Entity Resolution

  38. Definition: Reference & Entity • Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 • Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006 Name Reference Venue Reference Journal /Conf. Entity Author Entity

  39. Current Author Search • DBLP • CiteSeer • Google All of them return the MIXED list of references

  40. Graphical Model • We convert the Entity Resolution into a Graph Partition Problem • Each node denotesa reference • Each edge denotesthe relation of tworeferences

  41. How to measure the Reference Relation • Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 • Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005: Research Community Coauthors Research Area Authors Plaintext Similarity Coauthors Authors

  42. Features • F1: Title Similarity • F2: Coauthor Similarity • F3: Venue Similarity • F4: Research Community Overlap • F5: Research Area Overlap

  43. Research Community Overlap • A1, A2 stands for two author name references • F4.1:Similarity(A1, A2)=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) • F4.2:Similarity(A1, A2)=Venues(Coauthors(A1))∩Venues(Coauthors(A2)) • Coauthors(X) returns the coauthor name set of eachauthor in set X • Venues(Y) returns the venue name set of eachauthor in set Y

  44. Research Area Overlap • V1, V2 stands for two venue references • F4.1:Similarity(V1, V2)=Authors(Articles(V1))∩Authors(Articles(V2)) • F4.2:Similarity(V1, V2)=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2))) • Authors(X) returns the author name set of eacharticle in set X • Articles(Y) returns the article set holding a referenceof each element in set Y

  45. System Framework Similarity Probability

  46. Experiment Results • Our Dataset:1000 references to 20 author entities from DBLP • Getoor’s DatasetsCiteSeer: 2,892 author references to 1,165 author entitiesarXiv: 58,515 references to 9,200 author entitiesF1 = 97.0%

  47. Summary of Other Work

  48. Summary of Other Work • Summarization using Conditional Random Fields (IJCAI ’07) • Thread Detection in Dynamic Text Message Streams (SIGIR ’06) • Implicit Links for Web Page Classification (WWW ’06) • Text Classification Improved by Multigram Models (CIKM ’06) • Latent Friend Mining from Blog Data (ICDM ’06) • Web-page Classification through Summarization (SIGIR ’04)

  49. 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Summarization using Conditional Random Fields (IJCAI ’07) • Motivation • Observation • Summarization  Sequence labeling • Solution: CRF • Feature functions: , • Parameters: , Step 1: Step 2: Step 3:

  50. Thread Detection in Dynamic Text Message Streams (SIGIR ’06) • Representation • Content-based • Structure-based • Sentence Type; Personal Pronouns • Clustering

More Related