LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 25 • 4/17/2013

Recommended reading • http://en.wikipedia.org/wiki/Cluster_analysis • C. Christodoulopoulos, S. Goldwater, & M. Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come? Proceedings of EMNLP • Marie Labelle. 2005. The acquisition of grammatical categories: a state of the art. In Henri Cohen & Claire Lefebvre (eds.), Handbook of Categorization in Cognitive Science, Elsevier, 433-457.

Outline • The POS induction problem • Clustering and similarity • K-means clustering • Some results of POS induction • Evaluation of clustering

POS categories: definitions • Traditional prescriptive grammar • Nouns, Verbs, Adjectives, Prepositions, etc. • Generative grammar: • Nouns, Verbs, Adjectives, Prepositions, etc. • Chomsky 1970: [ +/-N, +/-V ] • Noun: [ +N, -V ] • Verb: [ -N, +V ] • Adj: [ +N, +V ] • Prep: [ -N, -V ] • Corpus tag sets • Penn Treebank tags (45 for English), other corpora

Open- or closed-class • Open-class • Nouns, Verbs, Adjectives • Freely add new members • Closed-class • Prepositions, Determiners, Conjunctions, etc. • English adverbs: not quite open- or closed-class

Words vs. POS tags as features • Highly specific features are more precise, but probability estimation is more difficult • One solution is backing off • Example from smoothing: back off p(Z | X, Y ) to p(Z | Y ) • POS tags define equivalence classes for words • The use of POS tags in place of words can also be viewed as backing off • Example: p(farmer|DT, JJ) instead of p(farmer|the, happy)

Problems with traditional / generative grammar definitions of POS categories • Categories of traditional prescriptive grammar based on Englishand Latin • Used in generative linguistics, but not all linguistic theories • Typological problems: • Not universal: other languages have diff. sets of categories • Many languages don’t distinguish adjectives from stative intransitive verbs: Choctaw, Slave, Mojave, Hopi, Edo, Yoruba • Also Chinese • Many languages don’t distinguish adjectives from nouns (Quechua, Nahuatl, Greenlandic Eskimo, Australian languages) • Even the Noun/Verb contrast is controversial (Wakashan and Salish families, some Austronesian languages) • Field linguists sometimes disagree about inventory of categories in a language

More problems of POS tags • POS tag sets are not necessarily the best way to group words into equivalence classes • Words in the same POS may have very different statistical properties • Number of POS tags for a language? • Traditional grammars have a small number of word categories • Corpus POS tag sets vary in size • Penn Treebank: 45 + punctuation; CLAWS: 132 • Example of a larger tag set: http://www.natcorp.ox.ac.uk/docs/c7spec.html • Larger number of finer-grained classes may be more desirable in statistical models • Example: improved performance in statistical parsing • “Low density” languages, with limited annotated resources

Results for unlexicalized parsers(outperforms best lexicalized parser) F-measure Naïve Treebank grammar 72.6 Manually split 86.3 (Klein & Manning 2003) Automatically split: 86.7 same # of clusters per tag (Matsuzaki et. al. 2005) Automatically split: 89.7 varying number of clusters per tag (Petrov et. al. 2006)

Discover word classes • Automatically discover word classes that better reflect the statistical properties of the data • Define according to local distribution in sentence • Apply clustering algorithms (unsupervised) • Also useful for languages with limited annotated resources

Empiricist definitions of POS categories • Categories are derived from experience, rather than innate • Solution: define word classes by context • Examples: • class 1: the ___ of • class 2: to ___ • Firth 1957: “Ye shall know a word by its context” • Apply a clustering algorithm to induce word classes

Input to POS induction • Data points: words to be clustered • Features: neighboring words • Example: Feat23= “the word to the left is The” • Data matrix • M[i,j] = # of times word i appears in corpus with feature j featurej wordi

Commonly used features for POS induction • Count frequencies of N-most frequent words to immediate left/right of words to be clustered • Sometimes also 2 words to left/right • N = 200, 500, 1000 • Intuition: N-most frequent words in a corpus are predominantly function words (which tell you about the syntax of the word being clustered) • Prepositions, determiners, modal verbs, etc. • Open-class categories defined by closed-class context

Count co-occurrences of words with contextual features • X: n x m matrix • n training instances (words) • m contextual features (adjacent words) • X[i, j] = fj(Xi), the frequency of feature fj in the context of word Xi • Often normalize each row, so that each row is a probability distribution over features F: contextual features X: training instances (words to be clustered)

Example clustering procedure 1. Data matrix encodes context vectors Example: • 1000 words to be clustered (most-frequent in corpus), based on frequency of 100 most-freq words to immediate left and right (i.e., 200 possible contextual features) • 1000 x 200 matrix Then either 2a or 2b: 2a. Similarity first, clustering afterwards (e.g. agglomerative clustering) • Calculate similarity between every pair of vectors. Result: 1000 x 1000 matrix, indicating similarity between every pair of words being clustered • Then apply clustering algorithm to place the 1000 words into separate classes 2b. Apply clustering and similarity simultaneously (e.g. k-means) • Iteratively apply clustering to place the 1000 words into separate classes • Re-compute similarity at each iteration

Clustering • Want to determine how data points group together into clusters • Data points that are close together in the feature space should belong to the same cluster • We’ll look at the most common clustering algorithms • K-means clustering: specific number of clusters • Agglomerative clustering: number of clusters is not specified in advance

Points in feature space,cluster “similar” points together

One possible clustering: 3 clusters

Another possible clustering: 2 clusters

Another possible clustering: 2 clusters • Solution you get depends on learning bias of your clustering algorithm

Clustering according to similarity • Group together data points that are similar • i.e., rows in matrix with similar feature vectors • These are similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 2, 8, 20, 12, 0, 0, 0, 0, 1, 6, 9, 12 ] • These are not that similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 0, 0, 0, 0, 1, 5, 34, 21, 1, 6, 9, 12 ] • These are least similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 0, 0, 0, 0, 1, 5, 34, 21, 0, 0, 0, 0 ] • Need to measure the distance or similarity between two vectors

Similarity functions • Choice of similarity function depends on: • Output range required for algorithm • How well it works • Computation time for computing the function • (if you prefer to use fancier math) • Also called distance functions, depending on interpretation • Example functions: • Euclidean distance: Lower distance = more similar • Cosine similarity: Closer to zero = more similar • Rank correlation: Higher correlation = more similar • Jensen-Shannon divergence: Lower divergence = more similar

Euclidean distance • Euclidean distance is the straight-line distance between two points x and y • Most commonly used distance function • Euclidean distance ranges between 0 and infinity, not what we want x y

Cosine: angle between two vectors; range is between -1 and 1 http://www.bhpanthers.org/6067182020424/lib/6067182020424/cosine.gif http://media.wiley.com/CurrentProtocols/MB/mb2205/mb2205-fig-0002-1-full.gif

Is there a correlation between IQ and # of hours of TV watched? • (Wikipedia)

Spearman’s rank correlation • Sort according to first set of data, X • State the ranks of X and Y according to original order

Rank correlation ranges between -1 and 1-1: fully negative corr.; +1: fully positive corr. • Measure how far apart the ranks are: di • Spearman’s rank correlation: = .175

Jensen-Shannon divergence • Information-theoretic measure of distance between two probability distributions • Value ranges between 0 and 1 • 1: vectors are maximally similar (i.e., equal) • 0: vectors are maximally dissimilar • http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

Previous: K-nearest neighbors classifier • Pick a value for k. • Training: • Training data consists of labeled data points • Remember all the examples in the training data • Classification of a new data point: • Calculate distance between item and all training points • Find k nearest training data points • Look up labels of these k points, choose majority class

Classification of a new data point: should it be red triangle or blue square? • Classify by labels of nearest neighbors, use Euclidean dist. • If k=3, neighbors are 2 red, 1 blue  red • If k=5, neighbors are 2 red, 3 blue  blue https://kiwi.ecn.purdue.edu/rhea/images/5/58/KNN_Old_Kiwi.jpg

Decision boundary gets smoother as k increasesLeft: k = 1, Right: k = 15 from Hastie et al., The Elements of Statistical Learning

K-means clustering(Unsupervised version of K-nearest neighbors) • Choose k, the number of clusters • Randomly choose k data points as the initial cluster centroids • Iterate until no change in cluster membership: • Assign each data point to its nearest cluster centroid, which produces a clustering of the data points • Recompute each cluster centroid as the mean of the points in the cluster

Centroid of a cluster • A centroid is the center of a cluster of data points • Centroid is not necessarily one of the data points

k-means clustering algorithm (1) • Initialization • Choose a value for k • Choose k random points in the input space • Assign the cluster centroidsμjto these positions

k-means clustering algorithm (2) • Learning. • Repeat until clusters converge: For each datapointxi: • Compute distance to each centroid • Assign the datapoint to the nearest cluster, whose centroid has distance • For each cluster: • Move the position of its centroid to the mean of the points in that cluster (Nj is the number of points in cluster j):

k-means clustering algorithm (3) • Usage (classification) • For each test point: • Compute the distance to each centroid • Assign the datapoint to the nearest cluster, whose centroid has distance

Initialization: randomly select k = 3 points to be cluster centroids

Initialization: assign each data point to its nearest centroid

Recomputecentroids

Assign to nearest centroid

Stop because cluster membership doesn’t change

Optimization criterion • K-means clustering is minimizing the within-cluster scatter • Pick a value for k • For each cluster, compute the distance between every pair of points xi and xi’ • Sum within-cluster scatter over all clusters

Initialization • We may end up with poor results if we make a poor choice of data points for initial cluster centroids

New example: k=3, bad initialization

Recomputecentroids

Algorithm is stuck in current configuration

LING / C SC 439/539 Statistical Natural Language Processing