720 likes | 931 Views
LING / C SC 439/539 Statistical Natural Language Processing. Lecture 21 4/3 /2013. Recommended reading. Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation.
E N D
LING / C SC 439/539Statistical Natural Language Processing • Lecture 21 • 4/3/2013
Recommended reading • Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation. • Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of COLT. • Best 10-year paper, awarded in 2008 • Thorsten Joachims. 1999. Transductive inference for text classification using Support Vector Machines. ICML. • Best 10-year paper, awarded in 2009 • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. ACL. • Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. EMNLP.
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Data quantity vs. performance • NLP: fully annotated data sets for testing machine learning algorithms • WSJ (1.3 million words) • Brown corpus (1 million words) • Prague dependency treebank (2 million words) • What happens when we train on much larger data sets?
Banko & Brill 2001 • “Confusion set disambiguation” • { principle, principal } { then, than } • { to, two, too } { weather, whether } • Corpus generation • Replace all occurrences by a marker • The school hired a new principal • The school hired a new PRINCIPLE/PRINCIPAL. • Easy to generate data sets for very large corpora • 1 billion word corpus • Task • Algorithm must choose correct word • Similar to word sense disambiguation
Banko & Brill 2001 • Algorithms tested • Winnow • Perceptron • Naive Bayes • Memory-based / nearest-neighbor • Features: • Words within a window • POS n-grams • Word n-grams
Banko & Brill 2001 • Conclusions: • Quantity of data positively effects performance (more data is better) • Relative performance of different algorithms differs depending on amount of training data (this is disturbing; makes standardized test sets for algorithm comparison seem less meaningful)
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Annotated data is expensive($ and time) • (from J. Zhu via A. Blum)
Data, annotation, and performance • Not much labeled data (annotated) • Lots of unlabeled data (unannotated) • Limited level of performance from training on labeled data only • Can we use unlabeled data to improve performance?
Utilizing unlabeled data • Easy to collect unlabeled data • Existing corpora containing billion(s) of words • Internet • Unlabeled data: • Missing the most important information (labels) • But there are other statistical regularities that can be exploited
Amount of supervision • Supervised learning: • Given a sample of object-label pairs (xi , yi), find the predictive relationship between objects and labels • Unsupervised learning: • Discover structures in unlabeled data • Semi-supervised learning: use both labeled and unlabeled data • Supervised learning + additional unlabeled data • Unsupervised learning + additional labeled data (“bootstrapping”)
Semi-supervised learning algorithms and applications • Supervised learning + additional unlabeled data • Transductive SVM • Co-training • Web page classification • Unsupervised learning + additional labeled data (“bootstrapping”) • Yarowsky algorithm; bootstrapping with seeds • Word sense disambiguation • Co-training with decision list • Named Entity Recognition
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Inductive vs. transductive SVM • Inductive • Find max margin hyperplane on training set • Standard SVM algorithm • Transductive • Useful when only a small amount of data is labeled • Goal is really to minimize error on test set • Take testing data into account when finding max margin hyperplane
Inductive vs. transductive SVM • Transductive SVM has better performance than standard SVM Transductive SVM max margin hyperplane Inductive SVM max margin hyperplane Additional unlabeled data points; assign to nearest class from labeled data
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Co-training: Blum & Mitchell 1998 • Combines 2 ideas • Semi-supervised training • Small amount of labeled data • Larger amount of unlabeled data • Use two supervised classifiers simultaneously • Outperforms a single classifier
Example problem • Collected a data set • 1051 web pages from CS departments at 4 universities • Manually labeled as + or - + is a course home page (22% of web pages) - is not a course home page (rest of web pages) • Use Naïve Bayes to classify web pages
Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Features for web page classification • x1: text in hyperlinks (bag of words) <a href = … >CS 100, Fall semester</a> • x2: text in the page (bag of words) <html>Exam #1</html> • Training instances contain both features: x = (x1, x2)
Views • A sufficient set of features is called a view • Each view by itself is sufficient to produce an optimal classifier • For web page example, pages can be classified accurately with either text or hyperlinks • Two views are conditionally independent (given the label) • p(x1|x2, y) = p(x1|y) • p(x2|x1, y) = p(x2|y)
Co-Training algorithm • Start with small portion of labeled data • Train two classifiers from the same pool of data; each classifier is based on a different “view” • Use the two classifiers to label the data set • Data points that are classified with high confidence are added to pool of labeled data • Amount of labeled data gradually increases until it covers the entire data set
Error rate in classifying web pages • Combined classifier • Supervised: combine features with Naïve Bayes: p(cj|x1,2) = p(cj|x1)p(cj|x2) • Co-training: use both page-based and hyperlink-based
Co-training: error rate vs. # iterations Baseline: always predict “not a course web page” Page-based classifier Hyperlink-based classifier
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Decision List • A simple discriminative classifier • Compute argmaxC p(C|F) • Compare: p(C1|f1), p(C2|f1), … p(C1|fn), p(C2|fn) • Choose class based on largest difference in p( Ci | fj ) for a feature fj in the data to be classified
Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • … • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ ) let’s say this has highest prob
Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem
Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage
Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration
Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)
Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)
Which rules are indicative of a category? • Two categories c1 and c2; p(c1|rule) + p(c2|rule) = 1 • Log-likelihood ratio: log( p(c1|rule) / p(c2|rule) ) • If p(c1|rule) = 0.5 and p(c2|rule) = 0.5, doesn’t distinguish log( p(c1 | rule) / p(c2 | rule) ) = 0 • If p(c1|rule) > 0.5 and p(c2|rule) < 0.5, c1 is more likely log( p(c1 | rule) / p(c2 | rule) ) > 0 • If p(c1|rule) < 0.5 and p(c2|rule) > 0.5, c2 is more likely log( p(c1 | rule) / p(c2 | rule) ) < 0
Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest
Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list
Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well
Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.
Limitations of supervised WSD • Practical issue in applying the algorithm to WSD: need a corpus tagged for word senses • If you have a large corpus fully annotated, WSD would be easy enough • But producing such a corpus would be extremely laborious • Senseval and other corpora only provide partial coverage • Another problem: each word is a unique disambiguation problem • Later: apply this algorithm in a semi-supervised setting (Yarowsky 1995)
Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4
Bootstrapping http://www.theproducersperspective.com/wp-content/uploads/2012/09/N_Bootstraps00094.jpg http://www.cerebralmastication.com/wp-content/uploads/2010/06/boot.jpg http://s3.amazonaws.com/spnproduction/photos/000/008/203/8203_2d878faf73_small.png?1342796341 http://desktopreality.com/wp-content/uploads/2012/03/bootstrap.jpg
Yarowsky 1995 • “Unsupervised word sense disambiguation rivaling supervised methods” • Actually semi-supervisedbootstrapping: • Very small amount of human-annotated data • Iterative procedure for label propagation over unlabeled data set
One sense per discourse hypothesis • “Words tend to exhibit only one sense in a given discourse or document”
Step 1: identify all examples of target word • Store contexts in initial untagged training set
Step 2: tag a small number of examples • For each sense, label a small set of training examples by hand • Do this according to seed words (features) • Example, “plant”: • manufacturing vs. life