800 likes | 1.09k Views
LING / C SC 439/539 Statistical Natural Language Processing. Lecture 8 2/6/2013. Recommended Reading. Support Vector Machines Hastie Chapter 12, sections 1-3 http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Sentiment analysis Opinion Mining, Sentiment Analysis, and Opinion Spam Detection
E N D
LING / C SC 439/539Statistical Natural Language Processing Lecture 8 2/6/2013
Recommended Reading • Support Vector Machines • Hastie Chapter 12, sections 1-3 • http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Sentiment analysis • Opinion Mining, Sentiment Analysis, and Opinion Spam Detection • http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Survey of the field in 2008 • http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html • Lots of additional links on course web page
Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry
2 cases of linear inseparability 1. Data is inherently nonlinear 2. Errors or noise in the training data + + + + + + + + + + + + + +
Transform the feature space • Suppose data is inseparable in current dimensions • Apply a transformation to the feature space, adding extra dimension(s) • Data may be linearly separable in new dimensions
Example 1: XORLinearly inseparable in 2 dimensions Dimensions: x1, x2 ( 0, 1 ) ( 1, 1 ) ( 1, 0 ) ( 0, 0 )
XOR: linearly separable in 3 dimensions • Create a third dimension x3 • Convert data points: T(x1, x2) = (x1, x2, x3) • Value in dimension x3: 0 if x1=x2, 1 otherwise ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 1, 1, 0 ) ( 1, 0, 1 ) ( 0, 0, 0 )
XOR: linearly separable in 3 dimensions • Now red points are in front of blue • Separate points with the hyperplane g(X) = x3 - 0.5 ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 0, 1, 0 ) ( 1, 1, 0 ) ( 0, 1, 0 ) Separating hyperplane: g(X) = x3 - 0.5 ( 1, 0, 1 ) ( 0, 0, 0 )
Linearly separable in 2-d after transformation T(x) = (x, x2)
Example 3: not linearly separable in 2-dDecision boundary: x12 + y12 <= 1 from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig
Linearly separable in 3-D after transformation T(x1, x2) = (x12, x22, x1x2) from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig
Transformations of feature space,and linear separability • Transformations are not necessarily linear • E.g. T(x1, x2) = (x12, x22, x1x2) • We can always increase the number of dimensions, in order to find a separating hyperplane • Can run standard perceptron algorithm (or any other classifier) in new feature space • But not necessarily a good idea by itself because: • Could need a huge number of new dimensions in order to be linearly separable • May overfit to noise in data
Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry
Hyperplanes found by perceptron • Algorithm assigns random initial values to the weight vector • Assuming linear separability, the final hyperplane is a function of the training data, learning rate, and initial weight vector • Result: perceptron can find many differentseparating hyperplanes for the same data
Linearly separable data;which hyperplane is the best? + + + + + + + + + + + + + + +
Best separating hyperplane: I • We should think about where new data from each class is likely to be • First answer: the besthyperplane is the one with the max margin between points of the different classes • Maximizes the distance between the center of hyperplane and the nearest points of each class
Compare size of margin of each hyperplaneBest hyperplane has the max margin + + + + + + + + + + + + + + +
But data may be linearly inseparable • Inseparability may be caused by noise or errors in the data • We don’t necessarily want to raise the dimensionality of the data, as the classifier might then overfit to noise in the data • Example: data at right appears to be inherently 2-dimensional • Could convert to 3-d to separate, but that would be overfitting noise + + + + + +
Best separating hyperplane: II • Data may be linearly inseparable • Redefine best hyperplane: • Maximizes the margin between hyperplane and points around it, and • Minimizes number of incorrectly classified points
Linearly inseparable data;best hyperplane has max margin + + + + + + + + + + + + + + + + + +
Best separating hyperplane: III • Also allow for incorrectly classified data points within the margin of the hyperplane + + + + + +
Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry
Support Vector Machines:Vladimir Vapnik (1992) http://slouchingagalmia.blogspot.com/2007/01/all-your-bayes-are-belong-to-us.html
Key ideas of SVM • 1. Transform feature space with a user-specified function, called the kernel function • 2. Discover the max margin hyperplane • Find the support vectors • Also minimize misclassifications • Details of algorithm are beyond the scope of this course; see book or take a machine learning course • But it’s a very popular algorithm so it should be briefly discussed
Some common kernel functions • Polynomial of degree d: K(x, y) = (1 + x∙y)d • Radial basis: K(x, y) = exp( -(x-y)2 / 2σ2 ) • Sigmoid: tanh(k x∙y - σ)
Example: quadratic kernel • Current data vectors: d dimensions • x = (x1, x2, …, xd) • Quadratic kernel: O(d2) dimensions • K(x) = (1 + <xi, xj>)2 = (1, 2x1, …, 2xd, x12,…, xd2, 2x1x2, 2x1x3, …, 2xd-1xd)
Support vectors • Support vectors are points on the margin of the hyperplane
Linearly inseparable case: allow for errors • Application of kernel function doesn’t guarantee that data will be linearly separable • Represent misclassificationsthrough “slack” variables ξi Misclassified points
SVM training • Optimization problem • Want to concurrently: • Minimize the misclassification rate • Maximize distance between hyperplane and its support vectors • Solution involves quadratic programming
SVMs: pros and cons • Pros • Often works better than other, older algorithms • Max margin is a principled way of choosing hyperplane • Cons • Training is computationally intensive • Transformation by kernel function can greatly increase dimensions of data training takes longer • Choice of kernel isn’t obvious for non-trivial problems
SVM and model selection • Available models • Range of possible parameterizations of model • Defined by parameters of SVM, and choice of kernel function • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data some robustness to noisy data • SeparabilityOK • Maximum margin yes • Computational issues • SLOW TRAINING
SVM applets • http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html • http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
… the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classifiers and Sheep Vectors…(Schollkopf & Smola 2002, p. xviii, illustrations by Ana Martin Larranaga)
Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry
Hot research topic: sentiment analysis • Determine subjective information • Whether the author of a document was writing positively or negatively (or was neutral) • Introduced by Pang, Lee, and Vaithyanathan 2002 in the context of movie reviews • Will discuss this paper in detail • Assignment #2 is based on this paper
Applications of sentiment analysis • Assign numerical scores to product reviews • Develop recommender systems • Monitor political debates, determine whether a politician is going to vote for or against a bill • Monitor discussion about candidates in elections
Other kinds of sentiment analysis • Whether a person is being deceptive, or telling the truth • Determine when product reviews are fake • Automated interrogation of criminal suspects • Monitor political sentiment in social media • e.g. Does the author favor Republican or Democrat?
Sentiment analysis in the news • Mining the Web for Feelings, Not Facts (August 23, 2009) • http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html • For $2 a Star, an Online Retailer Gets 5-Star Product Reviews (January 26, 2012) • http://www.nytimes.com/2012/01/27/technology/for-2-a-star-a-retailer-gets-5-star-reviews.html • Software That Listens for Lies (December 3, 2011) • http://www.nytimes.com/2011/12/04/business/lie-detection-software-parses-the-human-voice.html
Sentiment analysis in the news • Facebook Tests Negative Sentiment Analysis Feature For Pages (December 2, 2011) • http://mashable.com/2011/12/02/facebook-negative-sentiment/ • ACLU criticizes Facebook 'election sentiment' tool (January 31, 2012) • http://campaign2012.washingtonexaminer.com/blogs/beltway-confidential/aclu-criticizes-facebook-election-sentiment-tool/350691
Example of a positive book review(from Amazon.com) At our library we have a section for New Arrivals. I happened to find Twilight and was awed by the cover art, I figured anything with such beauty on the cover had to be worth reading. I was so right.Twilight may be marketed as a teen read but speaks to all ages. Bella is someone any woman can relate to and I found myself thinking about her and Edward days after I finished the book. I have read alot of horror romance over the years and Twilight ranks among the highest. Borrow it, Buy it, whatever just make sure you read it!
Example of a negative review(from Amazon.com) I wasn't going to review the novel at all because I simply hated it too much and, well, why spend more time dwelling on it than necessary? But the amount of people who claim her writing is flawless, the story is original and perfect, and the book appeals to all ages just drove me crazy. No, her writing is not flawless. In fact, it's very juvenile for someone who has had as much schooling as Stephenie Meyer. The story itself is completely predictable from the drab and ridiculous Preface, to the very last sentence. And the overall plot of the series? Well, I'm not actually sure there is one.
Training data • Extracted movie reviews from imdb.com • Data can be downloaded • http://www.cs.cornell.edu/people/pabo/movie-review-data/ • 1000 positive reviews, 1000 negative reviews • (expanded from original paper, which had 800 / 800 reviews) • Determine whether review is negative or positive based on numerical score • Example: within a five-star system: 3.5 stars and up is considered positive 2 stars and below is considered negative
What features could one use for sentiment analysis? • Lookup word in dictionary of positive/negative terms • Frequency of words in document • Bigrams w/ negation • Curse words • !!! • (verb, POStag) • Hypothetical or negative words before some other word • (word, ‘but’)
Compare performance of different machine learning algorithms • Algorithms • Naïve Bayes (we’ll see this later) • Maximum Entropy (we’ll see this later) • Support Vector Machine • When developing machine learning system, you should always have a baseline • Baseline: use a simple method to perform task • The system you develop should perform better than baseline
Baseline system • Asked 2 graduate students to think of words that they thought would be indicative of a negative or positive movie review • Human 1 • Positive: dazzling, brilliant, phenomenal, excellent, fantastic • Negative: suck, terrible, awful, unwatchable, hideous • Human 2 • Positive: gripping, mesmerizing, riveting, spectacular, cool, awesome, thrilling, badass, excellent, moving, exciting • Negative: bad, cliched, sucks, boring, stupid, slow
Baseline system • Given a movie review, count how many (token) occurrences of “positive” and “negative” words are in the review • Classify a document as Positive or Negative depending on whether it had more “positive” or “negative” words • Classify as Tie if equal number • Comparing against gold standard, score each review as Correctly classified, Incorrect, or Tie
Baseline results • Doesn’t do so well
Questions about baseline system • Why is accuracy so low? • Could have negation • List of words is very short • Not enough features (e.g. “good”, “great”) • Where are there so many ties? • Didn’t account for negation (“not” + word) • With short word lists, • Many reviews don’t have any occurrences of any of those words leads to ties