150 likes | 176 Views
Learn about Latent Semantic Analysis (LSA) limitations and explore Non-Negative Matrix Factorization (NMF) as an alternative for theme extraction in text clustering. Understand how NMF components track themes effectively and the significance of using KL-divergence in NMF for clustering tasks.
E N D
HCC classlecture 14 comments John Canny3/9/05
Clustering: LSA again • The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) • Columns are distinct terms • Matrix elements are term counts (x tfidf weight) • The idea is to “Factor” this matrix into A D B: Themes Terms Terms D B = Textblocks M A Textblocks
LSA again • A encodes the representation of each text block in a space of themes. • B encodes each theme with term weights. It can be used to explicitly describe the theme. Themes Terms Terms D B = Textblocks M A Textblocks
LSA limitations • LSA has a few assumptions that don’t make much sense: • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
Non-negative Matrix Factorization • NMF deals with non-negativity and orthogonality, but still uses gaussian statistics: • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
LSA again • The consequences are: • LSA themes are not meaningful beyond the first few (the ones with strongest singular value). • LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).
NMF • The corresponding properties: • NMF components track themes well (up to 30 or more). • The NMF components can be used directly as topic markers, so the choice is important.
NMF • NMF is an umbrella term for several algorithms. • The one in this paper uses least squares to match the original term matrix. i.e. it minimizes: (M – AB)2 • Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is: p log p/q • Another natural version of NMF uses KL-divergence between M and its approximation as A B.
NMF • KL-divergence is usually a more accurate way to compare probability distributions. • However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters. • KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering. • The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.
A Simpler Text Summarizer • A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference). • This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.
A Simpler Text Summarizer • Vertices in the graph represent sentences, edge weights are similarity between sentences: S1 S2 S7 S3 S6 S4 S5
Textrank • TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph. S1 S2 S7 S3 S6 S4 S5
Textrank • The highest-ranked vertices comprise the summary. • Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)
Discussion Topics T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.