Semantic-based Language Models for Text Retrieval and Clustering

Semantic-based Language Models for Text Retrieval and Clustering Xiaohua (Davis) Zhou College of Information Science & Technology Drexel University 1

Summary of Research Work • Publication in last three years • 2 Journal Papers, 13 Conference Papers, 2 Workshop Papers, 10 of which are first-authored. • IJCAI’07 (best conference in AI and Knowledge Base), SIGIR’06 (best conference in Information Retrieval), and ICDM’06 (one the three primary conferences in Data Mining) • Topic Distribution • Information Retrieval (4), Information Extraction (4), Text Mining (7), Other (2) • Main Research Direction • Statistical Language Models for Text Retrieval and Mining 2

Selected Publications • Zhou, X., Zhang, X., and Hu, X., Semantic Smoothing of Document Models for Agglomerative Clustering, to appear in IJCAI 2007 (15.7%) • Zhou, X., Hu, X., Zhang, X., Lin, X., and Song, I.-Y. Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR, ACM SIGIR 2006 (18.5%) • Zhou, X., Hu, X., Lin, X., and Zhang, X. Relation-based Document Retrieval for Biomedical IR,Transactions on Computational Systems Biology, 2006 Vol. 4 • Zhou, X., Zhang, X., and Hu, X., Using Concept-based Indexing to Improve Language Modeling Approach to Genomic IR, ECIR 2006 (21%) • Zhang, X., Zhou, X., and Hu, X., Semantic Smoothing for Model-based Document Clustering, to appear in ICDM 2006 (short paper, 20%) 3

Statistical Language Models • Statistical Language Models (LM) • A language model is about a distribution over words • Text is randomly generated according to a given language model. • Basic Questions • Text Generation: given the model, compute the probability of generating a text • Inference: given a text, infer the behind model which generates the text. 4

Example of Language Modeling Figure 1. Illustration of the generative process and the problem of statistical inference underlying topic models. 5

Applications of LM • Text Prediction • Computing the probability of generating a sequence of words according to the trained model. • Applications: Information Extraction, Text Retrieval, Text Clustering, and Text Classification • Model Inference • Infer the underlying model according to the generated texts. • Applications: Text Decoding, Topic Models 6

LM for Text Retrieval • Relevance • The probability of generating the query by the document (model), i.e. p(q|d) • Example • Document 1={(A,3), (B, 5), (C,2)} • Document 2={(A,4), (B, 1), (C,5)} • Query={A, B} • Which document is more relevant to the query? Doc 1: 0.3*0.5=0.15 Doc 2: 0.4*0.1=0.04 Doc 1 is more relevant to the query than Doc 2 7

LM for Text Clustering • Agglomerative Clustering • The pairwise document similarity is defined as the similarity (i.e., KL-divergence) of two document models • Partitional Clustering • The similarity of between a document and a cluster is defined as the generative probability of the document by the cluster model, i.e. p(d|cj) 8

Where Is the Problem? • Sparsity of Topical Words • Document 1={(A,3), (B, 5), (C,2)} • Document 2={(A,4), (B, 1), (C,5)} • Query={A, D} • Which document is more relevant to the query? Doc 1: 0.3*0=0 Doc 2: 0.4*0=0 Obviously, this result is not reasonable. The text clustering has the same problem. 9

Where Is the Problem? • Density of Topic-free General Words • An extreme example is stop words. • Those words will be assigned with high probability, but no contribution to the retrieval or clustering task. • Any document pair could be considered similar for clustering because they share lots of common words • Need to discount the effect of those general words. The same idea as TF.IDF weighting schema. 10

Summary of the LM Problems • Need to assign reasonable probability (count) to those “unseen” words in the training data. • Technically avoid zero probability • Account for the semantic relationship between training words and testing words, e.g. the query “car” for document containing “auto”. • Need to discount topic-free general words. • Remove noise • Concentrate on topic-related words • These two issues are exactly the goals of statistical model smoothing. 11

Laplacian Smoothing • Basic Idea • Simply assume the prior count of each word is 2. • Just for preventing zero probability, not make lot of senses for real applications. 12

Background Smoothing • The document model is smoothed by the corpus model. c(w; d) is the count of word w in document d. C denotes the corpus (Zhai and Lafferty 2001) 13

Two-Stage Smoothing • Basic Idea • The first stage smoothes the document model using the corpus model and the second stage involves the computation of the likelihood of the query according to a query model. (Zhai and Lafferty 2002) 14

Cluster Language Model • Motivation • Have more similar documents to estimate a more accurate and smoothed model (Liu and Croft 2004) • Weakness • Time-consuming and not scalable to cluster a large collection • The assumption that one document is associated with only one cluster does not hold very well. 15

Statistical Translation Model • Motivating Example • The document containing “auto” should return for query “car” • Statistical Translation Model • Semantic relationships between document term and query term are considered (Berger and Lafferty 1999) • Follow-up: Jin, Hauptmann and Zhai 2002; Lafferty and Zhai 2001; Cao et al. 2005 • Unable to incorporate contextual and sense information into the translation procedure. 16

Context-Sensitive Semantic Smoothing (Our Approach) • Definition • Like the statistical translation model, term semantic relationships are used for model smoothing. • Unlike the statistical translation model, contextual and sense information are considered • Method • Decompose a document into a set of context-sensitive topic signatures and then statistically translate topic signatures into individual words. 17

Topic Signatures • Concept Pairs • A pair of two concepts which are semantically and syntactically related to each other • Example: computer and mouse, hypertension and obesity • Extraction: Ontology-based approach (Zhou et al. 2006, SIGIR) • Multiword Phrases • Example: Space Program, Star War, White House • Extraction: Xtract (Smadja 1993) 18

Translation Probability Estimate • Method • Use cooccurrence counts (topic signature and individual words) • Use a mixture model to remove noise from topic-free general words Figure 2. Illustration of document indexing. Vt, Vd and Vw are topic signature set, document set and word set, respectively. • Denotes Dk the set of documents containing the topic signature tk. The parameter α is the coefficient controlling the influence of the corpus model in the mixture model. 19

Translation Probability Estimate • Log likelihood of generating Dk • EM for estimation Where is the document frequency of term w in Dk, i.e., the cooccurrence count of w and tkin the whole collection. 20

Contrasting Translation Example 21

Topic Signature LM • Basic Idea • Linearly interpolate the topic signature based translation model with a simple language model. • The document expansions based on context-sensitive semantic smoothing will be very specific. • The simple language model can capture the points the topic signatures miss. Where the translation coefficient (λ) controls the influence of the translation component in the mixture model. 22

Topic Signature LM • The Simple Language Model • The Topic Signature Translation Model c(ti, d) is the frequency of topic signature ti in document d. 23

Text Retrieval Experiments • Collections • TREC Genomics Track 2004 and 2005 • Use sub-collection • 2004: 48,753 documents • 2005: 41,018 documents • Measures: • Mean Average Precision (AP), Recall • Settings • Simple language model as the baseline • Use concept pairs as topic signatures • Background coefficient: 0.05 • Pseudo-relevance feedback: top 50 documents, expand10 terms 24

IR Experiment Results Table 1. The comparison of the baseline language model with the topic signature document model and the topic signature query model . The parameters λ and γ are trained from TREC04 dataset. 25

Effect of Document Smoothing Figure 3. The variance of MAP with the translation coefficient (λ), which controls the influence of the translation component in the topic signature language model. 26

Effect of Document Smoothing Figure 4. The variance of MAP with the translation coefficient (λ), which controls the influence of the translation component in the topic signature language model. 27

vs. Context-Insensitive Model • Context-Insensitive Semantic Smoothing c(t(w, wk)) is the frequency count of topic signature t(w, wk) in the whole collection 28

vs. Context-insensitive Model • Experiment Results Table 2. Comparison of the context-sensitive semantic smoothing (Sensitive) to the context-insensitive semantic smoothing (Insensitive) on MAP. The rightmost column is the change of Sensitive over Insensitive. 29

vs. Other Approaches • Other Approaches • Simple language model with word as indexing unit • Local Information Flow (Song and Bruza 2003) • Context sensitive semantic smoothing • Can not incorporate domain knowledge • Model-based Feedback (Zhai and Lafferty 2001) • Findings • Our approach achieved the best result for both 2004 and 2005. • The incorporation of domain knowledge did not help much when using simple language model 30

vs. Other Approaches • Experiment Results Table 3. Comparison of the retrieval performance of six approaches on TREC genomic track 2004 and 2005. The concept-based indexing is based on the UMLS Metathesaurus. All approaches are implemented by us. 31

Text Clustering Experiments • Using multiword phrases as topic signatures • The meaning is unambiguous in most cases • There are many statistical approaches to phrase extraction • Applicable to any domain • Testing Collections • 20-newsgroups, LA Times, TDT2 • Evaluation Criterion • Normalized mutual information (NMI, Banerjee and Ghosh, 2002) • Entropy (Steinbach et al., 2000 ) • Purity (Zhao and Karypis, 2001 ) 32

Statistics of Three Datasets Table 4. Statistics of three datasets Notes: In the testing phase, we create both small and large testing collections. For small collections, 100 documents are randomly selected for each class. For large collections, 1000 documents are randomly selected for each class. Agglomerative clustering is only evaluated on small collections while partitional clustering tests both. 33

Agglomerative Clustering Table 5. NMI results of the agglomerative hierarchical clustering with complete linkage criterion. “Bkg” and “Semantic” denote simple background smoothing and semantic smoothing, respectively. * means stop words are not removed. The translation coefficient λ is trained from TDT2. 34

Effect of Document Smoothing Figure 5. The variance of the cluster quality with the translation coefficient (λ) which controls the influence of semantic smoothing 35

Partitional Clustering Table 6. NMI results of partitional clustering on large datasets and small datasets. “Lap”, “Bkg”, and “Semantic” denote Laplacian smoothing, background smoothing, and semantic smoothing, respectively. * means stop words are not removed. The translation coefficient λ is trained from TDT2. 36

Effect of Cluster Smoothing Figure 6. The variance of the cluster quality on small datasets with the translation coefficient (λ) which controls the influence of semantic smoothing. Stop words are removed. 37

Effect of Cluster Smoothing Figure 7. The variance of the cluster quality on large datasets with the translation coefficient (λ) which controls the influence of semantic smoothing. Stop words are removed. 38

Clustering Result Summary • Semantic smoothing is much more effective than other schemes on agglomerative clustering where data sparsity is the major problem. • For partitional clustering, when dataset is small and data sparsity is the major problem, semantic smoothing is very effective; otherwise, it equals to background smoothing. • Although both semantic smoothing and background smoothing can weaken the effect of general words, they are less effective than TF*IDF which is more aggressive on discounting general words. • Laplacian smoothing is the worst among all tested schemes. • Removing stops or not have no effect on TF*IDF, background smoothing, and semantic smoothing, but significant effect on other schemes. 39

Conclusions and Future Work • The topic signature language model is very effective in discounting general words and smoothing unseen topic-related words • Using different implementations of topic signature, the model can incorporate domain knowledge or not, can be applied to either general domains or specific domains. • Future work • the optimization of translation coefficient can be improved. • Applied to other applications such as text summarization and text classification. 40

Software and Papers • Download Dragon Toolkit http://www.ischool.drexel.edu/dmbio/dragontool • Related Papers and Slideshttp://www.pages.drexel.edu/~xz37 orhttp://www.daviszhou.net 41

Acknowledgements • Thank my advisor Dr. Tony Hu as well as thesis committee members Dr. Song and Dr. Lin for their advice on my research work • Thank Dr. Han and Dr. Weber for their help when I was working with them as RA • Thank IST for the generous support of my travels to academic conferences • My research work is supported in part by NSF Career grant (NSF IIS 0448023), NSF CCF 0514679, PA Dept of Health Grants (No. 240205, No. 240196, No. 239667). 42

Questions/Comments? 43

Semantic-based Language Models for Text Retrieval and Clustering

Semantic-based Language Models for Text Retrieval and Clustering

Presentation Transcript

Text Clustering

Text Based Information Retrieval - Text Mining

Language Models for Information Retrieval

Semantic Smoothing of Document Models for Agglomerative Clustering

Information Retrieval – Language models for IR

Statistical Models for Information Retrieval and Text Mining

Cluster-Based Retrieval Using Language Models

Semantic Smoothing for Text Clustering

Text Clustering

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Two-stage Language Models for Information Retrieval

Text Retrieval Based on Medical Subwords

Language and Document Models in Information Retrieval

N-gram Tokenization for Indian Language Text Retrieval

Risk Minimization and Language Modeling in Text Retrieval

Structured Text Retrieval Models