LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 8 2/6/2013

Recommended Reading • Support Vector Machines • Hastie Chapter 12, sections 1-3 • http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Sentiment analysis • Opinion Mining, Sentiment Analysis, and Opinion Spam Detection • http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Survey of the field in 2008 • http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html • Lots of additional links on course web page

Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry

2 cases of linear inseparability 1. Data is inherently nonlinear 2. Errors or noise in the training data + + + + + + + + + + + + + +

Transform the feature space • Suppose data is inseparable in current dimensions • Apply a transformation to the feature space, adding extra dimension(s) • Data may be linearly separable in new dimensions

Example 1: XORLinearly inseparable in 2 dimensions Dimensions: x1, x2 ( 0, 1 ) ( 1, 1 ) ( 1, 0 ) ( 0, 0 )

XOR: linearly separable in 3 dimensions • Create a third dimension x3 • Convert data points: T(x1, x2) = (x1, x2, x3) • Value in dimension x3: 0 if x1=x2, 1 otherwise ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 1, 1, 0 ) ( 1, 0, 1 ) ( 0, 0, 0 )

XOR: linearly separable in 3 dimensions • Now red points are in front of blue • Separate points with the hyperplane g(X) = x3 - 0.5 ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 0, 1, 0 ) ( 1, 1, 0 ) ( 0, 1, 0 ) Separating hyperplane: g(X) = x3 - 0.5 ( 1, 0, 1 ) ( 0, 0, 0 )

Example 2:Not linearly separable in 1-d

Linearly separable in 2-d after transformation T(x) = (x, x2)

Example 3: not linearly separable in 2-dDecision boundary: x12 + y12 <= 1 from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig

Linearly separable in 3-D after transformation T(x1, x2) = (x12, x22, x1x2) from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig

Transformations of feature space,and linear separability • Transformations are not necessarily linear • E.g. T(x1, x2) = (x12, x22, x1x2) • We can always increase the number of dimensions, in order to find a separating hyperplane • Can run standard perceptron algorithm (or any other classifier) in new feature space • But not necessarily a good idea by itself because: • Could need a huge number of new dimensions in order to be linearly separable • May overfit to noise in data

Hyperplanes found by perceptron • Algorithm assigns random initial values to the weight vector • Assuming linear separability, the final hyperplane is a function of the training data, learning rate, and initial weight vector • Result: perceptron can find many differentseparating hyperplanes for the same data

Each hyperplane separates the training data

Linearly separable data;which hyperplane is the best? + + + + + + + + + + + + + + +

Best separating hyperplane: I • We should think about where new data from each class is likely to be • First answer: the besthyperplane is the one with the max margin between points of the different classes • Maximizes the distance between the center of hyperplane and the nearest points of each class

Compare size of margin of each hyperplaneBest hyperplane has the max margin + + + + + + + + + + + + + + +

But data may be linearly inseparable • Inseparability may be caused by noise or errors in the data • We don’t necessarily want to raise the dimensionality of the data, as the classifier might then overfit to noise in the data • Example: data at right appears to be inherently 2-dimensional • Could convert to 3-d to separate, but that would be overfitting noise + + + + + +

Best separating hyperplane: II • Data may be linearly inseparable • Redefine best hyperplane: • Maximizes the margin between hyperplane and points around it, and • Minimizes number of incorrectly classified points

Linearly inseparable data;best hyperplane has max margin + + + + + + + + + + + + + + + + + +

Best separating hyperplane: III • Also allow for incorrectly classified data points within the margin of the hyperplane + + + + + +

Support Vector Machines:Vladimir Vapnik (1992) http://slouchingagalmia.blogspot.com/2007/01/all-your-bayes-are-belong-to-us.html

Key ideas of SVM • 1. Transform feature space with a user-specified function, called the kernel function • 2. Discover the max margin hyperplane • Find the support vectors • Also minimize misclassifications • Details of algorithm are beyond the scope of this course; see book or take a machine learning course • But it’s a very popular algorithm so it should be briefly discussed

Some common kernel functions • Polynomial of degree d: K(x, y) = (1 + x∙y)d • Radial basis: K(x, y) = exp( -(x-y)2 / 2σ2 ) • Sigmoid: tanh(k x∙y - σ)

Example: quadratic kernel • Current data vectors: d dimensions • x = (x1, x2, …, xd) • Quadratic kernel: O(d2) dimensions • K(x) = (1 + <xi, xj>)2 = (1, 2x1, …, 2xd, x12,…, xd2, 2x1x2, 2x1x3, …, 2xd-1xd)

Support vectors • Support vectors are points on the margin of the hyperplane

Linearly inseparable case: allow for errors • Application of kernel function doesn’t guarantee that data will be linearly separable • Represent misclassificationsthrough “slack” variables ξi Misclassified points

SVM training • Optimization problem • Want to concurrently: • Minimize the misclassification rate • Maximize distance between hyperplane and its support vectors • Solution involves quadratic programming

SVMs: pros and cons • Pros • Often works better than other, older algorithms • Max margin is a principled way of choosing hyperplane • Cons • Training is computationally intensive • Transformation by kernel function can greatly increase dimensions of data  training takes longer • Choice of kernel isn’t obvious for non-trivial problems

SVM and model selection • Available models • Range of possible parameterizations of model • Defined by parameters of SVM, and choice of kernel function • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data some robustness to noisy data • SeparabilityOK • Maximum margin yes • Computational issues • SLOW TRAINING

SVM applets • http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html • http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

… the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classifiers and Sheep Vectors…(Schollkopf & Smola 2002, p. xviii, illustrations by Ana Martin Larranaga)

Hot research topic: sentiment analysis • Determine subjective information • Whether the author of a document was writing positively or negatively (or was neutral) • Introduced by Pang, Lee, and Vaithyanathan 2002 in the context of movie reviews • Will discuss this paper in detail • Assignment #2 is based on this paper

Applications of sentiment analysis • Assign numerical scores to product reviews • Develop recommender systems • Monitor political debates, determine whether a politician is going to vote for or against a bill • Monitor discussion about candidates in elections

Other kinds of sentiment analysis • Whether a person is being deceptive, or telling the truth • Determine when product reviews are fake • Automated interrogation of criminal suspects • Monitor political sentiment in social media • e.g. Does the author favor Republican or Democrat?

Sentiment analysis in the news • Mining the Web for Feelings, Not Facts (August 23, 2009) • http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html • For $2 a Star, an Online Retailer Gets 5-Star Product Reviews (January 26, 2012) • http://www.nytimes.com/2012/01/27/technology/for-2-a-star-a-retailer-gets-5-star-reviews.html • Software That Listens for Lies (December 3, 2011) • http://www.nytimes.com/2011/12/04/business/lie-detection-software-parses-the-human-voice.html

Sentiment analysis in the news • Facebook Tests Negative Sentiment Analysis Feature For Pages (December 2, 2011) • http://mashable.com/2011/12/02/facebook-negative-sentiment/ • ACLU criticizes Facebook 'election sentiment' tool (January 31, 2012) • http://campaign2012.washingtonexaminer.com/blogs/beltway-confidential/aclu-criticizes-facebook-election-sentiment-tool/350691

Example of a positive book review(from Amazon.com) At our library we have a section for New Arrivals. I happened to find Twilight and was awed by the cover art, I figured anything with such beauty on the cover had to be worth reading. I was so right.Twilight may be marketed as a teen read but speaks to all ages. Bella is someone any woman can relate to and I found myself thinking about her and Edward days after I finished the book. I have read alot of horror romance over the years and Twilight ranks among the highest. Borrow it, Buy it, whatever just make sure you read it!

Example of a negative review(from Amazon.com) I wasn't going to review the novel at all because I simply hated it too much and, well, why spend more time dwelling on it than necessary? But the amount of people who claim her writing is flawless, the story is original and perfect, and the book appeals to all ages just drove me crazy. No, her writing is not flawless. In fact, it's very juvenile for someone who has had as much schooling as Stephenie Meyer. The story itself is completely predictable from the drab and ridiculous Preface, to the very last sentence. And the overall plot of the series? Well, I'm not actually sure there is one.

Training data • Extracted movie reviews from imdb.com • Data can be downloaded • http://www.cs.cornell.edu/people/pabo/movie-review-data/ • 1000 positive reviews, 1000 negative reviews • (expanded from original paper, which had 800 / 800 reviews) • Determine whether review is negative or positive based on numerical score • Example: within a five-star system: 3.5 stars and up is considered positive 2 stars and below is considered negative

What features could one use for sentiment analysis? • Lookup word in dictionary of positive/negative terms • Frequency of words in document • Bigrams w/ negation • Curse words • !!! • (verb, POStag) • Hypothetical or negative words before some other word • (word, ‘but’)

Compare performance of different machine learning algorithms • Algorithms • Naïve Bayes (we’ll see this later) • Maximum Entropy (we’ll see this later) • Support Vector Machine • When developing machine learning system, you should always have a baseline • Baseline: use a simple method to perform task • The system you develop should perform better than baseline

Baseline system • Asked 2 graduate students to think of words that they thought would be indicative of a negative or positive movie review • Human 1 • Positive: dazzling, brilliant, phenomenal, excellent, fantastic • Negative: suck, terrible, awful, unwatchable, hideous • Human 2 • Positive: gripping, mesmerizing, riveting, spectacular, cool, awesome, thrilling, badass, excellent, moving, exciting • Negative: bad, cliched, sucks, boring, stupid, slow

Baseline system • Given a movie review, count how many (token) occurrences of “positive” and “negative” words are in the review • Classify a document as Positive or Negative depending on whether it had more “positive” or “negative” words • Classify as Tie if equal number • Comparing against gold standard, score each review as Correctly classified, Incorrect, or Tie

Baseline results • Doesn’t do so well

Questions about baseline system • Why is accuracy so low? • Could have negation • List of words is very short • Not enough features (e.g. “good”, “great”) • Where are there so many ties? • Didn’t account for negation (“not” + word) • With short word lists, • Many reviews don’t have any occurrences of any of those words  leads to ties

LING / C SC 439/539 Statistical Natural Language Processing