1 / 68

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 25 4 /17/2013. Recommended reading. http://en.wikipedia.org/wiki/Cluster_analysis

tekli
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 25 • 4/17/2013

  2. Recommended reading • http://en.wikipedia.org/wiki/Cluster_analysis • C. Christodoulopoulos, S. Goldwater, & M. Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come? Proceedings of EMNLP • Marie Labelle. 2005. The acquisition of grammatical categories: a state of the art. In Henri Cohen & Claire Lefebvre (eds.), Handbook of Categorization in Cognitive Science, Elsevier, 433-457.

  3. Outline • The POS induction problem • Clustering and similarity • K-means clustering • Some results of POS induction • Evaluation of clustering

  4. POS categories: definitions • Traditional prescriptive grammar • Nouns, Verbs, Adjectives, Prepositions, etc. • Generative grammar: • Nouns, Verbs, Adjectives, Prepositions, etc. • Chomsky 1970: [ +/-N, +/-V ] • Noun: [ +N, -V ] • Verb: [ -N, +V ] • Adj: [ +N, +V ] • Prep: [ -N, -V ] • Corpus tag sets • Penn Treebank tags (45 for English), other corpora

  5. Open- or closed-class • Open-class • Nouns, Verbs, Adjectives • Freely add new members • Closed-class • Prepositions, Determiners, Conjunctions, etc. • English adverbs: not quite open- or closed-class

  6. Words vs. POS tags as features • Highly specific features are more precise, but probability estimation is more difficult • One solution is backing off • Example from smoothing: back off p(Z | X, Y ) to p(Z | Y ) • POS tags define equivalence classes for words • The use of POS tags in place of words can also be viewed as backing off • Example: p(farmer|DT, JJ) instead of p(farmer|the, happy)

  7. Problems with traditional / generative grammar definitions of POS categories • Categories of traditional prescriptive grammar based on Englishand Latin • Used in generative linguistics, but not all linguistic theories • Typological problems: • Not universal: other languages have diff. sets of categories • Many languages don’t distinguish adjectives from stative intransitive verbs: Choctaw, Slave, Mojave, Hopi, Edo, Yoruba • Also Chinese • Many languages don’t distinguish adjectives from nouns (Quechua, Nahuatl, Greenlandic Eskimo, Australian languages) • Even the Noun/Verb contrast is controversial (Wakashan and Salish families, some Austronesian languages) • Field linguists sometimes disagree about inventory of categories in a language

  8. More problems of POS tags • POS tag sets are not necessarily the best way to group words into equivalence classes • Words in the same POS may have very different statistical properties • Number of POS tags for a language? • Traditional grammars have a small number of word categories • Corpus POS tag sets vary in size • Penn Treebank: 45 + punctuation; CLAWS: 132 • Example of a larger tag set: http://www.natcorp.ox.ac.uk/docs/c7spec.html • Larger number of finer-grained classes may be more desirable in statistical models • Example: improved performance in statistical parsing • “Low density” languages, with limited annotated resources

  9. Results for unlexicalized parsers(outperforms best lexicalized parser) F-measure Naïve Treebank grammar 72.6 Manually split 86.3 (Klein & Manning 2003) Automatically split: 86.7 same # of clusters per tag (Matsuzaki et. al. 2005) Automatically split: 89.7 varying number of clusters per tag (Petrov et. al. 2006)

  10. Discover word classes • Automatically discover word classes that better reflect the statistical properties of the data • Define according to local distribution in sentence • Apply clustering algorithms (unsupervised) • Also useful for languages with limited annotated resources

  11. Empiricist definitions of POS categories • Categories are derived from experience, rather than innate • Solution: define word classes by context • Examples: • class 1: the ___ of • class 2: to ___ • Firth 1957: “Ye shall know a word by its context” • Apply a clustering algorithm to induce word classes

  12. Input to POS induction • Data points: words to be clustered • Features: neighboring words • Example: Feat23= “the word to the left is The” • Data matrix • M[i,j] = # of times word i appears in corpus with feature j featurej wordi

  13. Commonly used features for POS induction • Count frequencies of N-most frequent words to immediate left/right of words to be clustered • Sometimes also 2 words to left/right • N = 200, 500, 1000 • Intuition: N-most frequent words in a corpus are predominantly function words (which tell you about the syntax of the word being clustered) • Prepositions, determiners, modal verbs, etc. • Open-class categories defined by closed-class context

  14. Count co-occurrences of words with contextual features • X: n x m matrix • n training instances (words) • m contextual features (adjacent words) • X[i, j] = fj(Xi), the frequency of feature fj in the context of word Xi • Often normalize each row, so that each row is a probability distribution over features F: contextual features X: training instances (words to be clustered)

  15. Example clustering procedure 1. Data matrix encodes context vectors Example: • 1000 words to be clustered (most-frequent in corpus), based on frequency of 100 most-freq words to immediate left and right (i.e., 200 possible contextual features) • 1000 x 200 matrix Then either 2a or 2b: 2a. Similarity first, clustering afterwards (e.g. agglomerative clustering) • Calculate similarity between every pair of vectors. Result: 1000 x 1000 matrix, indicating similarity between every pair of words being clustered • Then apply clustering algorithm to place the 1000 words into separate classes 2b. Apply clustering and similarity simultaneously (e.g. k-means) • Iteratively apply clustering to place the 1000 words into separate classes • Re-compute similarity at each iteration

  16. Clustering • Want to determine how data points group together into clusters • Data points that are close together in the feature space should belong to the same cluster • We’ll look at the most common clustering algorithms • K-means clustering: specific number of clusters • Agglomerative clustering: number of clusters is not specified in advance

  17. Outline • The POS induction problem • Clustering and similarity • K-means clustering • Some results of POS induction • Evaluation of clustering

  18. Points in feature space,cluster “similar” points together

  19. One possible clustering: 3 clusters

  20. Another possible clustering: 2 clusters

  21. Another possible clustering: 2 clusters • Solution you get depends on learning bias of your clustering algorithm

  22. Clustering according to similarity • Group together data points that are similar • i.e., rows in matrix with similar feature vectors • These are similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 2, 8, 20, 12, 0, 0, 0, 0, 1, 6, 9, 12 ] • These are not that similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 0, 0, 0, 0, 1, 5, 34, 21, 1, 6, 9, 12 ] • These are least similar: • [ 1, 3, 45, 23, 0, 0, 0, 0, 2, 3, 6, 8 ] • [ 0, 0, 0, 0, 1, 5, 34, 21, 0, 0, 0, 0 ] • Need to measure the distance or similarity between two vectors

  23. Similarity functions • Choice of similarity function depends on: • Output range required for algorithm • How well it works • Computation time for computing the function • (if you prefer to use fancier math) • Also called distance functions, depending on interpretation • Example functions: • Euclidean distance: Lower distance = more similar • Cosine similarity: Closer to zero = more similar • Rank correlation: Higher correlation = more similar • Jensen-Shannon divergence: Lower divergence = more similar

  24. Euclidean distance • Euclidean distance is the straight-line distance between two points x and y • Most commonly used distance function • Euclidean distance ranges between 0 and infinity, not what we want x y

  25. Cosine: angle between two vectors; range is between -1 and 1 http://www.bhpanthers.org/6067182020424/lib/6067182020424/cosine.gif http://media.wiley.com/CurrentProtocols/MB/mb2205/mb2205-fig-0002-1-full.gif

  26. Is there a correlation between IQ and # of hours of TV watched? • (Wikipedia)

  27. Spearman’s rank correlation • Sort according to first set of data, X • State the ranks of X and Y according to original order

  28. Rank correlation ranges between -1 and 1-1: fully negative corr.; +1: fully positive corr. • Measure how far apart the ranks are: di • Spearman’s rank correlation: = .175

  29. Jensen-Shannon divergence • Information-theoretic measure of distance between two probability distributions • Value ranges between 0 and 1 • 1: vectors are maximally similar (i.e., equal) • 0: vectors are maximally dissimilar • http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

  30. Outline • The POS induction problem • Clustering and similarity • K-means clustering • Some results of POS induction • Evaluation of clustering

  31. Previous: K-nearest neighbors classifier • Pick a value for k. • Training: • Training data consists of labeled data points • Remember all the examples in the training data • Classification of a new data point: • Calculate distance between item and all training points • Find k nearest training data points • Look up labels of these k points, choose majority class

  32. Classification of a new data point: should it be red triangle or blue square? • Classify by labels of nearest neighbors, use Euclidean dist. • If k=3, neighbors are 2 red, 1 blue  red • If k=5, neighbors are 2 red, 3 blue  blue https://kiwi.ecn.purdue.edu/rhea/images/5/58/KNN_Old_Kiwi.jpg

  33. Decision boundary gets smoother as k increasesLeft: k = 1, Right: k = 15 from Hastie et al., The Elements of Statistical Learning

  34. K-means clustering(Unsupervised version of K-nearest neighbors) • Choose k, the number of clusters • Randomly choose k data points as the initial cluster centroids • Iterate until no change in cluster membership: • Assign each data point to its nearest cluster centroid, which produces a clustering of the data points • Recompute each cluster centroid as the mean of the points in the cluster

  35. Centroid of a cluster • A centroid is the center of a cluster of data points • Centroid is not necessarily one of the data points

  36. w

  37. k-means clustering algorithm (1) • Initialization • Choose a value for k • Choose k random points in the input space • Assign the cluster centroidsμjto these positions

  38. k-means clustering algorithm (2) • Learning. • Repeat until clusters converge: For each datapointxi: • Compute distance to each centroid • Assign the datapoint to the nearest cluster, whose centroid has distance • For each cluster: • Move the position of its centroid to the mean of the points in that cluster (Nj is the number of points in cluster j):

  39. k-means clustering algorithm (3) • Usage (classification) • For each test point: • Compute the distance to each centroid • Assign the datapoint to the nearest cluster, whose centroid has distance

  40. Initialization: randomly select k = 3 points to be cluster centroids

  41. Initialization: assign each data point to its nearest centroid

  42. Recomputecentroids

  43. Assign to nearest centroid

  44. Stop because cluster membership doesn’t change

  45. Optimization criterion • K-means clustering is minimizing the within-cluster scatter • Pick a value for k • For each cluster, compute the distance between every pair of points xi and xi’ • Sum within-cluster scatter over all clusters

  46. Initialization • We may end up with poor results if we make a poor choice of data points for initial cluster centroids

  47. New example: k=3, bad initialization

  48. Recomputecentroids

  49. Algorithm is stuck in current configuration

More Related