CS246

CS246 Latent Dirichlet Analysis

LSI • LSI uses SVD to find the best rank-K approximation • The result is difficult to interpret especially with negative numbers • Q: Can we develop a more interpretable method?

Theory of LDA (Model-based Approach) • Develop a simplified model on how users write a document based on topics. • Fit the model to the existing corpus and “reverse engineer” the topics used in a document • Q: How do we write a document? • A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms

Two Probability Vectors • For every document d, we assume that the user will first pick the topics to write about • P(z|d) : probability to pick topic z when the user write each word in document d. • Document-topic vector of d • We also assume that every topic is associated with each term with certain probability • P(w|z) : the probability of picking the term w when the user write on the topic z. • Topic-term vector of z

Result of LDA (Latent Dirichlet Analysis) • TASA corpus • 37,000 text passages from educational materials collected by Touchstone Applied Science Associates • Set T=300 (300 topics)

Inferred Topics

Word Topic Assignments

LDA Algorithm: Simulation • Two topics: River, MoneyFive words: “river”, “stream”, “bank”, “money”, “loan” • Generate 16 documents by randomly mixing the two topics and using the LDA model

Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixtureWhite dot: “River”. Black dot: “Money”

nwt: how many times the word w has been assigned to the topic t • ndt: how many words in the document d have been assigned to the topic t • Q: What is the meaning of each term?

Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixtureAfter 64 iterations

Inferred Topic-Term Matrix • Model parameter • Estimated parameter • Not perfect, but very close especially given the small data size

Probabilistic Topic Model • There exists T number of topics • The topics-term vector for each topic is set before any document is written • P(wj|zi) is set for every ziand wj • Then for every document d, • The user decides the topics to write on, i.e., P(zi|d) • For each word in d • The user selects a topic ziwith probability P(zi|d) • The user selects a word wj with probability P(wj|zi)

Probabilistic Document Model P(w|z) P(z|d) 1.0 money1 bank1 loan1bank1 money1 ... loan DOC 1 money 0.5 bank Topic 1 money1 river2 bank1stream2 bank2 ... 0.5 DOC 2 river 1.0 stream river2 stream2 river2bank2 stream2 ... DOC 3 bank Topic 2

Example: Calculating Probability • z1 = {w1:0.8, w2:0.1, w3:0.1}z2 = {w1:0.1, w2:0.2, w3:0.7} • d’s topics are {z1: 0.9, z2:0.1}d has three terms {w32, w11, w21}. • Q: What is the probability that a user will write such a document?

Corpus Generation Probability • T: # topics • D: # documents • M: # words per document • Probability of generating the corpus C

Generative Model vs Inference (1) P(w|z) P(z|d) 1.0 money1 bank1 loan1bank1 money1 ... loan DOC 1 money 0.5 bank Topic 1 money1 river2 bank1stream2 bank2 ... 0.5 DOC 2 river 1.0 stream river2 stream2 river2bank2 stream2 ... DOC 3 bank Topic 2

Generative Model vs Inference (2) ? money? bank? loan?bank? money? ... ? DOC 1 ? ? ? Topic 1 money? river? bank?stream? bank? ... ? DOC 2 ? ? ? river? stream? river?bank? stream? ... DOC 3 ? Topic 2

Probabilistic Latent Semantic Index (pLSI) • Basic Idea: We pick P(zj|di),P(wk|zj), and zij values to maximize the corpus generation probability • Maximum-likelihood estimation (MLE) • More discussion later on how to compute the P(zj|di),P(wk|zj), and zij values that maximize the probability

Problem of pLSI • Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate? • Q: Too much freedom. How can we avoid overfitting problem? • A: Adding constraints to reduce degree of freedom

Latent Dirichlet Analysis (LDA) • When term probabilities are selected for each topic • Topic-term probability vector,(P(w1|zj), …, P(wW|zj)), is sampled randomly from Dirichlet distribution • When users select topics for a document • Document-topic probability vector, (P(z1|d), …, P(zT|d)), is sampled randomly from Dirichlet distribution

What is Dirichlet Distribution? • Multinomial distribution • Given the probability pi of each event ei, what is the probability that each event ei occurs i times after n trial? • We assume pi’s. The distribution assigns i’s probability. • Dirichlet distribution • “Inverse” of multinomial distribution: We assume i’s. The distribution assigns pi’s probability.

Dirichlet Distribution • Q: Given 1, 2,…, k, what are the most likely p1, p2, pk values?

Normalized Probability Vector and Simplex • Remember that and • When (p1, …, pn) satisfies p1 + … + pn = 1, they are on a “simplex plane” • (p1, p2, p3) and their 2-simplex plane

Effect of  values p1 p1 p3 p3 p2 p2

Minor Correction is not “standard” Dirichlet distribution. The “standard” Dirichlet Distribution formula: • Used non-standard to make the connection to multinomial distribution clear • From now on, we use the standard formula

Back to LDA Document Generation Model • For each topic z • Pick the word probability vector P(wj|z)’s by taking a random sample from Dir(1,…, W) • For every document d • The user decides its topic vector P(zi|d)’s by taking a random sample from Dir(1,…, T) • For each word in d • The user selects a topic zwith probability P(z|d) • The user selects a word w with probability P(w|z) • Once all is said and done, we have • P(wj|z): topic-term vector for each topic • P(zi|d): document-topic vector for each document • Topic assignment to every word in each document

Symmetric Dirichlet Distribution • In principle, we need to assume two vectors, (1,…, T) and (1,…, W) as input parameters. • In practice, we often assume all i’s are equal to  and all i’s =  • Use two scalar values  and ,not two vectors. • Symmetric Dirichlet distribution • Q: What is the implication of this assumption?

Effect of  value on Symmetric Dirichlet • Q: What does it mean? How will the sampled document topic vectors change as  grows? • Common choice:  = 50/T, b = 200/W p1 p1 p3 p3 p2 p2

Plate Notation a P(z|d) z b w P(w|z) M T N

LDA as Topic Inference • Given a corpusd1: w11, w12, …, w1m…dN: wN1, wN2, …, wNm • Find P(z|d), P(w|z), zij that are most “consistent” with the given corpus • Q: How can we compute such P(z|d), P(w|z), zij? • The best method so far is to use Monte Carlo method together with Gibbs sampling

Monte Carlo Method (1) • Class of methods that compute a number through repeated random sampling of certain event(s). • Q: How can we compute Pi?

Monte Carlo Method (2) • Define the domain of possible events • Generate the events randomly from the domain using a certain probability distribution • Perform a deterministic computation using the events • Aggregate the results of the individual computation into the final result • Q: How can we take random samples from a particular distribution?

Gibbs Sampling • Q: How can we take a random sample x from the distribution f(x)? • Q: How can we take a random sample (x, y) from the distribution f(x, y)? • Gibbs sampling • Given current sample (x1, …, xn), pick an axis xi, and take a random sample of xi value assuming all other (x1, …, xn) values • In practice, we iterative over xi’s sequentially

Markov-Chain Monte-Carlo Method (MCMC) • Gibbs sampling is in the class of Markov Chain sampling • Next sample depends only on the current sample • Markov-Chain Monte-Carlo Method • Generate random events using Markov-Chain sampling and apply Monte-Carlo method to compute the result

Applying MCMC to LDA • Let us apply Monte Carlo method to estimate LDA parameters. • Q: How can we map the LDA inference problem to random events? • We first focus on identifying topics {zij} for each word {wij}. • Event: Assignment of the topics {zij} to wij’s. The assignment should be done according to P({zij}|C) • Q: How to sample according to P({zij}|C)? • Q: Can we use Gibbs sampling? How will it work? • Q: What is P(zij|{z-ij},C)?

nwt: how many times the word w has been assigned to the topic t • ndt: how many words in the document d have been assigned to the topic t • Q: What is the meaning of each term?

LDA with Gibbs Sampling • For each word wij • Assign to topic t with probability • For the prior topic t of wij, decrease nwt and ndt by 1 • For the new topic t of wij, increase nwt and ndt by 1 • Repeat the process many times • At least hundreds of times • Once the process is over, we have • zij for every wij • nwt and ndt

SVD vs LDA • Both perform the following decomposition • SVD views this as matrix approximation • LDA views this as probabilistic inference based on a generative model • Each entry corresponds to “probability”: better interpretability topic term term topic X = doc doc

LDA as Soft Classfication • Soft vs hard clustering/classification • After LDA, every document is assigned to a small number of topics with some weights • Documents are not assigned exclusively to a topic • Soft clustering

Summary • Probabilistic Topic Model • Generative model of documents • pLSI and overfitting • LDA, MCMC, and probabilistic interpretation • Statistical parameter estimation • Multinomial distribution and Dirichlet distribution • Monte Carlo method • Gibbs sampling • Markov-Chain class of sampling

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data &amp; File Structures Secondary Memory

CS246 Data &amp; File Structures Lecture 1 Introduction to File Systems

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems