HCC class lecture 14 comments

HCC classlecture 14 comments John Canny3/9/05

Administrivia

Clustering: LSA again • The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) • Columns are distinct terms • Matrix elements are term counts (x tfidf weight) • The idea is to “Factor” this matrix into A D B: Themes Terms Terms D B = Textblocks M A Textblocks

LSA again • A encodes the representation of each text block in a space of themes. • B encodes each theme with term weights. It can be used to explicitly describe the theme. Themes Terms Terms D B = Textblocks M A Textblocks

LSA limitations • LSA has a few assumptions that don’t make much sense: • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

Non-negative Matrix Factorization • NMF deals with non-negativity and orthogonality, but still uses gaussian statistics: • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

LSA again • The consequences are: • LSA themes are not meaningful beyond the first few (the ones with strongest singular value). • LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).

NMF • The corresponding properties: • NMF components track themes well (up to 30 or more). • The NMF components can be used directly as topic markers, so the choice is important.

NMF • NMF is an umbrella term for several algorithms. • The one in this paper uses least squares to match the original term matrix. i.e. it minimizes: (M – AB)2 • Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:  p log p/q • Another natural version of NMF uses KL-divergence between M and its approximation as A B.

NMF • KL-divergence is usually a more accurate way to compare probability distributions. • However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters. • KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering. • The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.

A Simpler Text Summarizer • A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference). • This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.

A Simpler Text Summarizer • Vertices in the graph represent sentences, edge weights are similarity between sentences: S1 S2 S7 S3 S6 S4 S5

Textrank • TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph. S1 S2 S7 S3 S6 S4 S5

Textrank • The highest-ranked vertices comprise the summary. • Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)

Discussion Topics T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.

HCC class lecture 14 comments

HCC class lecture 14 comments

Presentation Transcript

HCC class lecture 24 comments

HCC class lecture 15 comments

Class Comments

HCC class lecture 14 comments

HCC class lecture 17 comments

PDP Class Lecture February 14, 2011

HCC class lecture 22 comments

HCC class lecture 24 comments

HCC class lecture 17 comments

HCC class lecture 27 comments

HCC class lecture 26 comments

HCC class lecture 6 comments

HCC class lecture 16 comments