1 / 80

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 8 2/6/2013. Recommended Reading. Support Vector Machines Hastie Chapter 12, sections 1-3 http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Sentiment analysis Opinion Mining, Sentiment Analysis, and Opinion Spam Detection

braeden
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 8 2/6/2013

  2. Recommended Reading • Support Vector Machines • Hastie Chapter 12, sections 1-3 • http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Sentiment analysis • Opinion Mining, Sentiment Analysis, and Opinion Spam Detection • http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Survey of the field in 2008 • http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html • Lots of additional links on course web page

  3. Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry

  4. 2 cases of linear inseparability 1. Data is inherently nonlinear 2. Errors or noise in the training data + + + + + + + + + + + + + +

  5. Transform the feature space • Suppose data is inseparable in current dimensions • Apply a transformation to the feature space, adding extra dimension(s) • Data may be linearly separable in new dimensions

  6. Example 1: XORLinearly inseparable in 2 dimensions Dimensions: x1, x2 ( 0, 1 ) ( 1, 1 ) ( 1, 0 ) ( 0, 0 )

  7. XOR: linearly separable in 3 dimensions • Create a third dimension x3 • Convert data points: T(x1, x2) = (x1, x2, x3) • Value in dimension x3: 0 if x1=x2, 1 otherwise ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 1, 1, 0 ) ( 1, 0, 1 ) ( 0, 0, 0 )

  8. XOR: linearly separable in 3 dimensions • Now red points are in front of blue • Separate points with the hyperplane g(X) = x3 - 0.5 ( 0, 1, 1 ) Dimensions: x1, x2, x3 ( 0, 1, 0 ) ( 1, 1, 0 ) ( 0, 1, 0 ) Separating hyperplane: g(X) = x3 - 0.5 ( 1, 0, 1 ) ( 0, 0, 0 )

  9. Example 2:Not linearly separable in 1-d

  10. Linearly separable in 2-d after transformation T(x) = (x, x2)

  11. Example 3: not linearly separable in 2-dDecision boundary: x12 + y12 <= 1 from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig

  12. Linearly separable in 3-D after transformation T(x1, x2) = (x12, x22, x1x2) from Artificial Intelligence: A Modern Approach (Third edition) by S. Russell and P. Norvig

  13. Transformations of feature space,and linear separability • Transformations are not necessarily linear • E.g. T(x1, x2) = (x12, x22, x1x2) • We can always increase the number of dimensions, in order to find a separating hyperplane • Can run standard perceptron algorithm (or any other classifier) in new feature space • But not necessarily a good idea by itself because: • Could need a huge number of new dimensions in order to be linearly separable • May overfit to noise in data

  14. Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry

  15. Hyperplanes found by perceptron • Algorithm assigns random initial values to the weight vector • Assuming linear separability, the final hyperplane is a function of the training data, learning rate, and initial weight vector • Result: perceptron can find many differentseparating hyperplanes for the same data

  16. Each hyperplane separates the training data

  17. Linearly separable data;which hyperplane is the best? + + + + + + + + + + + + + + +

  18. Best separating hyperplane: I • We should think about where new data from each class is likely to be • First answer: the besthyperplane is the one with the max margin between points of the different classes • Maximizes the distance between the center of hyperplane and the nearest points of each class

  19. Compare size of margin of each hyperplaneBest hyperplane has the max margin + + + + + + + + + + + + + + +

  20. But data may be linearly inseparable • Inseparability may be caused by noise or errors in the data • We don’t necessarily want to raise the dimensionality of the data, as the classifier might then overfit to noise in the data • Example: data at right appears to be inherently 2-dimensional • Could convert to 3-d to separate, but that would be overfitting noise + + + + + +

  21. Best separating hyperplane: II • Data may be linearly inseparable • Redefine best hyperplane: • Maximizes the margin between hyperplane and points around it, and • Minimizes number of incorrectly classified points

  22. Linearly inseparable data;best hyperplane has max margin + + + + + + + + + + + + + + + + + +

  23. Best separating hyperplane: III • Also allow for incorrectly classified data points within the margin of the hyperplane + + + + + +

  24. Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry

  25. Support Vector Machines:Vladimir Vapnik (1992) http://slouchingagalmia.blogspot.com/2007/01/all-your-bayes-are-belong-to-us.html

  26. Key ideas of SVM • 1. Transform feature space with a user-specified function, called the kernel function • 2. Discover the max margin hyperplane • Find the support vectors • Also minimize misclassifications • Details of algorithm are beyond the scope of this course; see book or take a machine learning course • But it’s a very popular algorithm so it should be briefly discussed

  27. Some common kernel functions • Polynomial of degree d: K(x, y) = (1 + x∙y)d • Radial basis: K(x, y) = exp( -(x-y)2 / 2σ2 ) • Sigmoid: tanh(k x∙y - σ)

  28. Example: quadratic kernel • Current data vectors: d dimensions • x = (x1, x2, …, xd) • Quadratic kernel: O(d2) dimensions • K(x) = (1 + <xi, xj>)2 = (1, 2x1, …, 2xd, x12,…, xd2, 2x1x2, 2x1x3, …, 2xd-1xd)

  29. Support vectors • Support vectors are points on the margin of the hyperplane

  30. Linearly inseparable case: allow for errors • Application of kernel function doesn’t guarantee that data will be linearly separable • Represent misclassificationsthrough “slack” variables ξi Misclassified points

  31. SVM training • Optimization problem • Want to concurrently: • Minimize the misclassification rate • Maximize distance between hyperplane and its support vectors • Solution involves quadratic programming

  32. SVMs: pros and cons • Pros • Often works better than other, older algorithms • Max margin is a principled way of choosing hyperplane • Cons • Training is computationally intensive • Transformation by kernel function can greatly increase dimensions of data  training takes longer • Choice of kernel isn’t obvious for non-trivial problems

  33. SVM and model selection • Available models • Range of possible parameterizations of model • Defined by parameters of SVM, and choice of kernel function • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data some robustness to noisy data • SeparabilityOK • Maximum margin yes • Computational issues • SLOW TRAINING

  34. SVM applets • http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html • http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

  35. … the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classifiers and Sheep Vectors…(Schollkopf & Smola 2002, p. xviii, illustrations by Ana Martin Larranaga)

  36. Outline • Transformation of feature space • Max Margin classification • Support Vector Machines • Sentiment Analysis • Programming assignment #2 • Stylometry

  37. Hot research topic: sentiment analysis • Determine subjective information • Whether the author of a document was writing positively or negatively (or was neutral) • Introduced by Pang, Lee, and Vaithyanathan 2002 in the context of movie reviews • Will discuss this paper in detail • Assignment #2 is based on this paper

  38. Applications of sentiment analysis • Assign numerical scores to product reviews • Develop recommender systems • Monitor political debates, determine whether a politician is going to vote for or against a bill • Monitor discussion about candidates in elections

  39. Other kinds of sentiment analysis • Whether a person is being deceptive, or telling the truth • Determine when product reviews are fake • Automated interrogation of criminal suspects • Monitor political sentiment in social media • e.g. Does the author favor Republican or Democrat?

  40. Sentiment analysis in the news • Mining the Web for Feelings, Not Facts (August 23, 2009) • http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html • For $2 a Star, an Online Retailer Gets 5-Star Product Reviews (January 26, 2012) • http://www.nytimes.com/2012/01/27/technology/for-2-a-star-a-retailer-gets-5-star-reviews.html • Software That Listens for Lies (December 3, 2011) • http://www.nytimes.com/2011/12/04/business/lie-detection-software-parses-the-human-voice.html

  41. Sentiment analysis in the news • Facebook Tests Negative Sentiment Analysis Feature For Pages (December 2, 2011) • http://mashable.com/2011/12/02/facebook-negative-sentiment/ • ACLU criticizes Facebook 'election sentiment' tool (January 31, 2012) • http://campaign2012.washingtonexaminer.com/blogs/beltway-confidential/aclu-criticizes-facebook-election-sentiment-tool/350691

  42. Example of a positive book review(from Amazon.com) At our library we have a section for New Arrivals. I happened to find Twilight and was awed by the cover art, I figured anything with such beauty on the cover had to be worth reading. I was so right.Twilight may be marketed as a teen read but speaks to all ages. Bella is someone any woman can relate to and I found myself thinking about her and Edward days after I finished the book. I have read alot of horror romance over the years and Twilight ranks among the highest. Borrow it, Buy it, whatever just make sure you read it!

  43. Example of a negative review(from Amazon.com) I wasn't going to review the novel at all because I simply hated it too much and, well, why spend more time dwelling on it than necessary? But the amount of people who claim her writing is flawless, the story is original and perfect, and the book appeals to all ages just drove me crazy. No, her writing is not flawless. In fact, it's very juvenile for someone who has had as much schooling as Stephenie Meyer. The story itself is completely predictable from the drab and ridiculous Preface, to the very last sentence. And the overall plot of the series? Well, I'm not actually sure there is one.

  44. Training data • Extracted movie reviews from imdb.com • Data can be downloaded • http://www.cs.cornell.edu/people/pabo/movie-review-data/ • 1000 positive reviews, 1000 negative reviews • (expanded from original paper, which had 800 / 800 reviews) • Determine whether review is negative or positive based on numerical score • Example: within a five-star system: 3.5 stars and up is considered positive 2 stars and below is considered negative

  45. What features could one use for sentiment analysis? • Lookup word in dictionary of positive/negative terms • Frequency of words in document • Bigrams w/ negation • Curse words • !!! • (verb, POStag) • Hypothetical or negative words before some other word • (word, ‘but’)

  46. Compare performance of different machine learning algorithms • Algorithms • Naïve Bayes (we’ll see this later) • Maximum Entropy (we’ll see this later) • Support Vector Machine • When developing machine learning system, you should always have a baseline • Baseline: use a simple method to perform task • The system you develop should perform better than baseline

  47. Baseline system • Asked 2 graduate students to think of words that they thought would be indicative of a negative or positive movie review • Human 1 • Positive: dazzling, brilliant, phenomenal, excellent, fantastic • Negative: suck, terrible, awful, unwatchable, hideous • Human 2 • Positive: gripping, mesmerizing, riveting, spectacular, cool, awesome, thrilling, badass, excellent, moving, exciting • Negative: bad, cliched, sucks, boring, stupid, slow

  48. Baseline system • Given a movie review, count how many (token) occurrences of “positive” and “negative” words are in the review • Classify a document as Positive or Negative depending on whether it had more “positive” or “negative” words • Classify as Tie if equal number • Comparing against gold standard, score each review as Correctly classified, Incorrect, or Tie

  49. Baseline results • Doesn’t do so well

  50. Questions about baseline system • Why is accuracy so low? • Could have negation • List of words is very short • Not enough features (e.g. “good”, “great”) • Where are there so many ties? • Didn’t account for negation (“not” + word) • With short word lists, • Many reviews don’t have any occurrences of any of those words  leads to ties

More Related