1 / 20

Probabilistic Dyadic Data Analysis with Local and Global Consistency

This paper discusses a probabilistic approach to topic modeling in text collections, considering the local and global consistency of the data's geometric structure. Experimental results on text clustering and classification demonstrate the effectiveness of the proposed approach.

fcarney
Download Presentation

Probabilistic Dyadic Data Analysis with Local and Global Consistency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Dyadic Data Analysis with Local andGlobal Consistency Deng Cai, Xuanhui Wang, Xiaofei He Zhejiang UniversityUniversity of Illinois at Urbana Champaign ICML 2009

  2. Outline • Motivation • Traditional topic modeling (e.g. PLSA, LDA) • The geometric structure of the data • Topic Modeling with Local Consistency • Locally-consistent Topic Modeling (LTM) • Experiments • Summary

  3. Why Topic Modeling Text Collection term 0.16relevance 0.08weight 0.07 feedback 0.04… web 0.21search 0.10link 0.08 graph 0.05… … Topic models (Multinomial distributions) Probabilistic Topic Modeling • Powerful tool for text mining • Topic discovery, • Summarization, • Opinion mining, • Many more … 3

  4. Probabilistic Latent Semantic Analysis • Documents • Terms ?

  5. Probabilistic Latent Semantic Analysis number of occurrences of term w in document d • Documents • Terms • Zero frequency problem: terms not occurring in a document get zero probability Naive Approach

  6. Probabilistic Latent Semantic Analysis • Terms • Documents imports economic TRADE Model fitting trade • Latent • Concepts T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.

  7. Various Topic Modeling Approaches • Probabilistic Latent Semantic Analysis (Indexing) (PLSI) [Hofmann 99] • Latent Dirichlet Allocation (LDA) [Blei et. al. 03] • Pachinko allocation [Li & McCallum 06] • Many more… Failed to consider the geometric structure of the data

  8. Manifold ? • Manifold assumption (maybe too strong) • Locally consistency assumption (much weaker) • Nearby points (neighbors) share the similar properties

  9. Geometric Structure for Topic Modeling p nearest neighbor graph W pLSA… Smoothed Topic distributionsp(z|d) over the graph 9 Intuition: A document has the similar topics to its neighbors

  10. Objective Function Log-Likelihood of pLSA Measure the smoothness of P(z|d) over the geometric structure of the data Regularized Log-Likelihood of LTM

  11. Parameter Estimation via EM • E step: posterior probability of latent variables (“concepts”) Same as PLSA • M step: parameter estimation based on “completed” statistics Same as PLSA A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977

  12. Parameter Estimation via EM • M step: parameter estimation based on “completed” statistics If λ = 0 Same as PLSA

  13. Experimental Results • Text Clustering • Reuters-21578 corpus. 30 categories, 8067 documents with 18832 distinct terms. • Text Classification • TDT2 corpus. 10 categories, 7456 documents with 33947 distinct terms. For the purpose of reproducibility, we provide our algorithms and data sets used in the experiments at: http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html

  14. Clustering Results on Reuters • P(z|d) can be used to indicate the cluster. • Compare 6 algorithms • 3 Topic modeling approaches • 3 Clustering algorithms

  15. Accuracy vs. Parameter λ(regularization)

  16. Accuracy vs. Parameter p (nearest neighbors)

  17. Classification Results on TDT2 • Classifier: SVM • LTM with Label • Construct the graph W considering the label information

  18. Performance vs. Number of hidden topics

  19. Summary • Topic modeling with local and global consistency (considering the geometric structure of the data) • We suggested to use EM to solve the optimization problem. • Experimental results on text clustering and classification show the effectiveness of the proposed approach. • Future work: • Experimental results on real applications • Extend to other topic models (e.g. LDA) • Other ways of constructing the document graph

  20. Thanks!

More Related