1 / 92

I256 Applied Natural Language Processing Fall 2009

Learn about defining classes, labeling text, extracting features, and choosing classifiers like Naive Bayes, Maximum Entropy, and Neural Networks. Understand various algorithms for binary and multi-class classification to assign labels to entities. Discover the importance of good object representations and data types in machine learning. Real-world examples like fraud detection, sentiment classification, and more.

chongc
Download Presentation

I256 Applied Natural Language Processing Fall 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I256 Applied Natural Language ProcessingFall 2009 Lecture 11 Classification Barbara Rosario

  2. Announcements • Next Thursday: project ideas • Assignment 3 was due today • Solutions and grades for assignment 2 by end of this week • Assignment 4 out; due in 2 weeks (October 20) • Project proposals (1, 2 pages) due October 15 (more on this on Thursday)

  3. Classification • Define classes/categories • Label text • Extract features • Choose a classifier • Naive Bayes Classifier • NN (i.e. perceptron) • Maximum Entropy • …. • Train it • Use it to classify new examples

  4. Today • Algorithms for Classification • Binary classification • Linear and not linear • Multi-Class classification • Linear and not linear • Examples of classification algorithms for each case

  5. Classification • In classification problems, each entity in some domain can be placed in one of a discrete set of categories: yes/no, friend/foe, good/bad/indifferent, blue/red/green, etc. • Given a training set of labeled entities, develop a “rule” for assigning labels to entities in a test set • Many variations on this theme: • binary classification • multi-category classification • non-exclusive categories • ranking • Manycriteria to assess rules and their predictions • overall errors • costs associated with different kinds of errors

  6. Algorithms for Classification • It's possible to treat these learning methods as black boxes. • But there's a lot to be learned from taking a closer look • An understanding of these methods can help guide our choice for the appropriate learning method (binary or multi-class for example)

  7. Representation of Objects • Each object to be classified is represented as a pair (x, y): • where x is a description of the object • where y is a label • Success or failure of a machine learning classifier often depends on choosing good descriptions of objects • the choice of description can also be viewed as a learning problem (feature selections) • but good human intuitions are often needed here

  8. Data Types • text and hypertext <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Welcome to FairmontNET</title> </head> <STYLE type="text/css"> .stdtext {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: #1F3D4E;} .stdtext_wh {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE;} </STYLE> <body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" bgcolor="BLACK"> <TABLE cellpadding="0" cellspacing="0" width="100%" border="0"> <TR> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD> <TD><img src="/TFN/en/CDA/Images/common/labels/decorative.gif"></td> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD> </TR> </TABLE> <tr> <td align="right" valign="middle"><IMG src="/TFN/en/CDA/Images/common/labels/centrino_logo_blk.gif"></td> </tr> </body> </html>

  9. Data Types • network layout: graph • And many many others….

  10. Example: Digit Recognition Input: images / pixel grids Output: a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, AspectRatio, NumLoops … Current state-of-the-art: Human-level performance 0 1 2 1 ??

  11. Text Classification tasks Assign the correct class label for a given input/object In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Examples: Problem Object Label’s categories Tagging Word POS Sense Disambiguation Word The word’s senses Information retrieval Document Relevant/not relevant Sentiment classification Document Positive/negative Text categorization Document Topics/classes Author identification Document Authors Language identification Document Language Adapted from: Foundations of Statistical NLP (Manning et al)

  12. Other Examples of Real-World Classification Tasks Fraud detection (input: account activity, classes: fraud / no fraud) Web page spam detection (input: HTML/rendered page, classes: spam / ham) Speech recognition and speaker recognition (input: waveform, classes: phonemes or words) Medical diagnosis (input: symptoms, classes: diseases) Automatic essay grader (input: document, classes: grades) Customer service email routing and foldering Link prediction in social networks … many many more Classification is an important commercial technology

  13. Training and Validation Data: labeled instances, e.g. emails marked spam/ham Training set Validation set Test set Training Estimate parameters on training set Tune features on validation set Report results on test set Anything short of this yields over-optimistic claims Evaluation Many different metrics Ideally, the criteria used to train the classifier should be closely related to those used to evaluate the classifier Statistical issues Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not generalizing well Error bars: want realistic (conservative) estimates of accuracy Training Data Validation Data Test Data

  14. Class1 Class2 Intuitive Picture of the Problem

  15. Some Issues • There may be a simple separator (e.g., a straight line in 2D or a hyperplane in general) or there may not • There may be “noise” of various kinds • There may be “overlap” • Some classifiers explicitly represent separators (e.g., straight lines), while for other classifiers the separation is done implicitly • Some classifiers just make a decision as to which class an object is in; others estimate class probabilities

  16. Methods • Binary vs. multi class classification • Linear vs. non linear

  17. Methods • Linear Models: • Perceptron & Winnow (neural networks) • Large margin classifier • Support Vector Machine (SVM) • Probabilistic models: • Naïve Bayes • Maximum Entropy Models • Decision Models: • Decision Trees • Instance-based methods: • Nearest neighbor

  18. Binary Classification: examples • Spam filtering (spam, not spam) • Customer service message classification (urgent vs. not urgent) • Information retrieval (relevant, not relevant) • Sentiment classification (positive, negative) • Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

  19. Binary Classification • Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class • Task: Train the classifier and predict the class for a new data item • Geometrically: find a separator

  20. Linear versus Non Linear algorithms • Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary

  21. Class1 Linear Decision boundary Class2 Linearly separable data

  22. Class1 Class2 Non linearly separable data

  23. Class1 Class2 Non linearly separable data Non LinearClassifier

  24. Linear versus Non Linear algorithms • Linear or Non linear separable data? • We can find out only empirically • Linear algorithms (algorithms that find a linear decision boundary) • When we think the data is linearly separable • Advantages • Simpler, less parameters • Disadvantages • High dimensional data (like for NLP) is usually not linearly separable • Examples: Perceptron, Winnow, large margin • Note: we can use linear algorithms also for non linear problems (see Kernel methods)

  25. Linear versus Non Linear algorithms • Non Linear algorithms • When the data is non linearly separable • Advantages • More accurate • Disadvantages • More complicated, more parameters • Example: Kernel methods • Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later)

  26. Simple linear algorithms • Perceptron and Winnow algorithm • Linear • Binary classification • Online (process data sequentially, one data point at the time) • Mistake driven • Simple single layer Neural Networks

  27. Linear binary classification • Data: {(xi,yi)}i=1...n • x in Rd (x is a vector in d-dimensional space)  feature vector • y in {-1,+1}  label (class, category) • Question: • Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error • classification rule: • y = sign(wx + b) which means: • if wx + b > 0 then y = +1 • if wx + b < 0 then y = -1 • From Gert Lanckriet, Statistical Learning Theory Tutorial

  28. Linear binary classification • Find a good hyperplane (w,b) in Rd+1 that correctly classifies data points as much as possible • In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) • From Gert Lanckriet, Statistical Learning Theory Tutorial

  29. wk+1 Wk+1 x + b = 0 Perceptron algorithm • Initialize: w1 = 0 • Updating rule For each data point x • If class(x) != decision(x,w) • then wk+1 wk + yixi k  k + 1 • else wk+1 wk • Function decision(x, w) • If wx + b > 0 return +1 • Else return -1 wk +1 0 -1 wk x + b = 0 • From Gert Lanckriet, Statistical Learning Theory Tutorial

  30. Perceptron algorithm • Online: can adjust to changing target, over time • Advantages • Simple and computationally efficient • Guaranteed to learn a linearly separable problem (convergence, global optimum) • Limitations • Only linear separations • Only converges for linearly separable data • Not really “efficient with many features” • From Gert Lanckriet, Statistical Learning Theory Tutorial

  31. Winnow algorithm • Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) • Linear, binary classification • Update-rule: again error-driven, but multiplicative (instead of additive)

  32. wk+1 Wk+1 x + b = 0 Winnow algorithm • Initialize: w1 = 0 • Updating rule For each data point x • If class(x) != decision(x,w) • then wk+1 wk + yixi  Perceptron wk+1 wk *exp(yixi)  Winnow k  k + 1 • else wk+1 wk • Function decision(x, w) • If wx + b > 0 return +1 • Else return -1 wk +1 0 -1 wk x + b= 0 • From Gert Lanckriet, Statistical Learning Theory Tutorial

  33. Perceptron vs. Winnow • Assume • N available features • only K relevant features, with K<<N • Perceptron: number of mistakes: O( K N) • Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces Often used in NLP • From Gert Lanckriet, Statistical Learning Theory Tutorial

  34. Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP Perceptron vs. Winnow

  35. Large margin classifier • Another family of linear algorithms • Intuition (Vapnik, 1965) • If the classes are linearly separable: • Separate the data • Place hyper-plane “far” from the data: large margin • Statistical results guarantee good generalization BAD • From Gert Lanckriet, Statistical Learning Theory Tutorial

  36. Large margin classifier • Intuition (Vapnik, 1965) if linearly separable: • Separate the data • Place hyperplane “far” from the data: large margin • Statistical results guarantee good generalization GOOD  Maximal Margin Classifier • From Gert Lanckriet, Statistical Learning Theory Tutorial

  37. Large margin classifier If not linearly separable • Allow some errors • Still, try to place hyperplane “far” from each class • From Gert Lanckriet, Statistical Learning Theory Tutorial

  38. Large Margin Classifiers • Advantages • Theoretically better (better error bounds) • Limitations • Computationally more expensive, large quadratic programming

  39. Non Linear problem

  40. Non Linear problem

  41. Non Linear problem • Kernel methods • A family of non-linear algorithms • Transform the non linear problem in a linear one (in a different feature space) • Use linear algorithms to solve the linear problem in the new space From Gert Lanckriet, Statistical Learning Theory Tutorial

  42. Main intuition of Kernel methods • (Copy here from black board)

  43. wT(x)+b=0 (X)=[x2 z2 xz] f(x) = sign(w1x2+w2z2+w3xz +b) Basic principle kernel methods  : Rd RD (D >> d) X=[x z] • From Gert Lanckriet, Statistical Learning Theory Tutorial

  44. Basic principle kernel methods • Linear separability: more likely in high dimensions • Mapping:  maps input into high-dimensional feature space • Classifier: construct linear classifier in high-dimensional feature space • Motivation: appropriate choice of  leads to linear separability • We can do this efficiently! • From Gert Lanckriet, Statistical Learning Theory Tutorial

  45. Basic principle kernel methods • We can use the linear algorithms seen before (for example, perceptron) for classification in the higher dimensional space

  46. Multi-class classification • Given: some data items that belong to one of M possible classes • Task: Train the classifier and predict the class for a new data item • Geometrically: harder problem, no more simple geometry

  47. Multi-class classification

  48. Multi-class classification: Examples • Author identification • Language identification • Text categorization (topics)

  49. (Some) Algorithms for multi-class classification • Linear • Parallel class separators: Decision Trees • Non parallel class separators: Naïve Bayes and Maximum Entropy • Non Linear • K-nearest neighbors

  50. Linear, parallel class separators (ex: Decision Trees)

More Related