1 / 44

CS246

CS246. Latent Dirichlet Analysis. LSI. LSI uses SVD to find the best rank-K approximation The result is difficult to interpret especially with negative numbers Q: Can we develop a more interpretable method?. Theory of LDA (Model-based Approach).

berke
Download Presentation

CS246

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS246 Latent Dirichlet Analysis

  2. LSI • LSI uses SVD to find the best rank-K approximation • The result is difficult to interpret especially with negative numbers • Q: Can we develop a more interpretable method?

  3. Theory of LDA (Model-based Approach) • Develop a simplified model on how users write a document based on topics. • Fit the model to the existing corpus and “reverse engineer” the topics used in a document • Q: How do we write a document? • A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms

  4. Two Probability Vectors • For every document d, we assume that the user will first pick the topics to write about • P(z|d) : probability to pick topic z when the user write each word in document d. • Document-topic vector of d • We also assume that every topic is associated with each term with certain probability • P(w|z) : the probability of picking the term w when the user write on the topic z. • Topic-term vector of z

  5. Result of LDA (Latent Dirichlet Analysis) • TASA corpus • 37,000 text passages from educational materials collected by Touchstone Applied Science Associates • Set T=300 (300 topics)

  6. Inferred Topics

  7. Word Topic Assignments

  8. LDA Algorithm: Simulation • Two topics: River, MoneyFive words: “river”, “stream”, “bank”, “money”, “loan” • Generate 16 documents by randomly mixing the two topics and using the LDA model

  9. Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixtureWhite dot: “River”. Black dot: “Money”

  10. nwt: how many times the word w has been assigned to the topic t • ndt: how many words in the document d have been assigned to the topic t • Q: What is the meaning of each term?

  11. Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixtureAfter 64 iterations

  12. Inferred Topic-Term Matrix • Model parameter • Estimated parameter • Not perfect, but very close especially given the small data size

  13. Probabilistic Topic Model • There exists T number of topics • The topics-term vector for each topic is set before any document is written • P(wj|zi) is set for every ziand wj • Then for every document d, • The user decides the topics to write on, i.e., P(zi|d) • For each word in d • The user selects a topic ziwith probability P(zi|d) • The user selects a word wj with probability P(wj|zi)

  14. Probabilistic Document Model P(w|z) P(z|d) 1.0 money1 bank1 loan1bank1 money1 ... loan DOC 1 money 0.5 bank Topic 1 money1 river2 bank1stream2 bank2 ... 0.5 DOC 2 river 1.0 stream river2 stream2 river2bank2 stream2 ... DOC 3 bank Topic 2

  15. Example: Calculating Probability • z1 = {w1:0.8, w2:0.1, w3:0.1}z2 = {w1:0.1, w2:0.2, w3:0.7} • d’s topics are {z1: 0.9, z2:0.1}d has three terms {w32, w11, w21}. • Q: What is the probability that a user will write such a document?

  16. Corpus Generation Probability • T: # topics • D: # documents • M: # words per document • Probability of generating the corpus C

  17. Generative Model vs Inference (1) P(w|z) P(z|d) 1.0 money1 bank1 loan1bank1 money1 ... loan DOC 1 money 0.5 bank Topic 1 money1 river2 bank1stream2 bank2 ... 0.5 DOC 2 river 1.0 stream river2 stream2 river2bank2 stream2 ... DOC 3 bank Topic 2

  18. Generative Model vs Inference (2) ? money? bank? loan?bank? money? ... ? DOC 1 ? ? ? Topic 1 money? river? bank?stream? bank? ... ? DOC 2 ? ? ? river? stream? river?bank? stream? ... DOC 3 ? Topic 2

  19. Probabilistic Latent Semantic Index (pLSI) • Basic Idea: We pick P(zj|di),P(wk|zj), and zij values to maximize the corpus generation probability • Maximum-likelihood estimation (MLE) • More discussion later on how to compute the P(zj|di),P(wk|zj), and zij values that maximize the probability

  20. Problem of pLSI • Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate? • Q: Too much freedom. How can we avoid overfitting problem? • A: Adding constraints to reduce degree of freedom

  21. Latent Dirichlet Analysis (LDA) • When term probabilities are selected for each topic • Topic-term probability vector,(P(w1|zj), …, P(wW|zj)), is sampled randomly from Dirichlet distribution • When users select topics for a document • Document-topic probability vector, (P(z1|d), …, P(zT|d)), is sampled randomly from Dirichlet distribution

  22. What is Dirichlet Distribution? • Multinomial distribution • Given the probability pi of each event ei, what is the probability that each event ei occurs i times after n trial? • We assume pi’s. The distribution assigns i’s probability. • Dirichlet distribution • “Inverse” of multinomial distribution: We assume i’s. The distribution assigns pi’s probability.

  23. Dirichlet Distribution • Q: Given 1, 2,…, k, what are the most likely p1, p2, pk values?

  24. Normalized Probability Vector and Simplex • Remember that and • When (p1, …, pn) satisfies p1 + … + pn = 1, they are on a “simplex plane” • (p1, p2, p3) and their 2-simplex plane

  25. Effect of  values p1 p1 p3 p3 p2 p2

  26. Effect of  values p1 p1 p3 p3 p2 p2

  27. Effect of  values p1 p1 p3 p3 p2 p2

  28. Effect of  values p1 p1 p3 p3 p2 p2

  29. Minor Correction is not “standard” Dirichlet distribution. The “standard” Dirichlet Distribution formula: • Used non-standard to make the connection to multinomial distribution clear • From now on, we use the standard formula

  30. Back to LDA Document Generation Model • For each topic z • Pick the word probability vector P(wj|z)’s by taking a random sample from Dir(1,…, W) • For every document d • The user decides its topic vector P(zi|d)’s by taking a random sample from Dir(1,…, T) • For each word in d • The user selects a topic zwith probability P(z|d) • The user selects a word w with probability P(w|z) • Once all is said and done, we have • P(wj|z): topic-term vector for each topic • P(zi|d): document-topic vector for each document • Topic assignment to every word in each document

  31. Symmetric Dirichlet Distribution • In principle, we need to assume two vectors, (1,…, T) and (1,…, W) as input parameters. • In practice, we often assume all i’s are equal to  and all i’s =  • Use two scalar values  and ,not two vectors. • Symmetric Dirichlet distribution • Q: What is the implication of this assumption?

  32. Effect of  value on Symmetric Dirichlet • Q: What does it mean? How will the sampled document topic vectors change as  grows? • Common choice:  = 50/T, b = 200/W p1 p1 p3 p3 p2 p2

  33. Plate Notation a P(z|d) z b w P(w|z) M T N

  34. LDA as Topic Inference • Given a corpusd1: w11, w12, …, w1m…dN: wN1, wN2, …, wNm • Find P(z|d), P(w|z), zij that are most “consistent” with the given corpus • Q: How can we compute such P(z|d), P(w|z), zij? • The best method so far is to use Monte Carlo method together with Gibbs sampling

  35. Monte Carlo Method (1) • Class of methods that compute a number through repeated random sampling of certain event(s). • Q: How can we compute Pi?

  36. Monte Carlo Method (2) • Define the domain of possible events • Generate the events randomly from the domain using a certain probability distribution • Perform a deterministic computation using the events • Aggregate the results of the individual computation into the final result • Q: How can we take random samples from a particular distribution?

  37. Gibbs Sampling • Q: How can we take a random sample x from the distribution f(x)? • Q: How can we take a random sample (x, y) from the distribution f(x, y)? • Gibbs sampling • Given current sample (x1, …, xn), pick an axis xi, and take a random sample of xi value assuming all other (x1, …, xn) values • In practice, we iterative over xi’s sequentially

  38. Markov-Chain Monte-Carlo Method (MCMC) • Gibbs sampling is in the class of Markov Chain sampling • Next sample depends only on the current sample • Markov-Chain Monte-Carlo Method • Generate random events using Markov-Chain sampling and apply Monte-Carlo method to compute the result

  39. Applying MCMC to LDA • Let us apply Monte Carlo method to estimate LDA parameters. • Q: How can we map the LDA inference problem to random events? • We first focus on identifying topics {zij} for each word {wij}. • Event: Assignment of the topics {zij} to wij’s. The assignment should be done according to P({zij}|C) • Q: How to sample according to P({zij}|C)? • Q: Can we use Gibbs sampling? How will it work? • Q: What is P(zij|{z-ij},C)?

  40. nwt: how many times the word w has been assigned to the topic t • ndt: how many words in the document d have been assigned to the topic t • Q: What is the meaning of each term?

  41. LDA with Gibbs Sampling • For each word wij • Assign to topic t with probability • For the prior topic t of wij, decrease nwt and ndt by 1 • For the new topic t of wij, increase nwt and ndt by 1 • Repeat the process many times • At least hundreds of times • Once the process is over, we have • zij for every wij • nwt and ndt

  42. SVD vs LDA • Both perform the following decomposition • SVD views this as matrix approximation • LDA views this as probabilistic inference based on a generative model • Each entry corresponds to “probability”: better interpretability topic term term topic X = doc doc

  43. LDA as Soft Classfication • Soft vs hard clustering/classification • After LDA, every document is assigned to a small number of topics with some weights • Documents are not assigned exclusively to a topic • Soft clustering

  44. Summary • Probabilistic Topic Model • Generative model of documents • pLSI and overfitting • LDA, MCMC, and probabilistic interpretation • Statistical parameter estimation • Multinomial distribution and Dirichlet distribution • Monte Carlo method • Gibbs sampling • Markov-Chain class of sampling

More Related