Text Classification

Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me

Outline • Problem definition and applications • Very Quick Intro to Machine Learning and Classification • Learning bounds • Bias-variance tradeoff, No free lunch theorem • Maximum Entropy Models • Other Classification Techniques • Representations • Vector Space Model (and variations) • Feature Selection • Dimensionality Reduction • Representations and independence assumptions • Sparsity and smoothing

Spam or not Spam? • Most people who’ve ever used email have developed a hatred of spam • In the days before Gmail (and still today), you could get hundreds of spam messages per day. • “Spam Filters” were developed to automatically classify, with high (but not perfect) accuracy, which messages are spam and which aren’t.

Terminology and Definitions Let D be an input space e.g., the space of all possible English documents Let C be an output space e.g., {S, N} for Spam and Not-Spam Let F be the space of all possible functions f:DC A hypothesis space for D and C is any subset H of F. NLP

Loss Function: Measuring “Accuracy” A loss function is a function L: H x D x C  [0,1] Given a hypothesis h, document d, and class c, L(h,d,c) returns the error or loss of h when making a prediction on d. Simple Example: L(h,d,c) = 0 if h(d)=c, and 1 otherwise. This is called 0-1 loss.

Machine Learning Problem

Example Text Mining Applications • News topic classification (e.g., Google News) C={politics,sports,business,health,tech,…} • “SafeSearch” filtering C={pornography, not pornography} • Language classification C={English,Spanish,Chinese,…} • Sentiment classification C={positive review,negative review} • Email sorting C={spam,meeting reminders,invitations, …} – user-defined!

Outline • Problem definition and applications • Very Quick Intro to Machine Learning/Classification • Learning bounds • Bias-variance tradeoff, No free lunch theorem • Maximum Entropy Models • Other Classification Techniques • Representations • Vector Space Model (and variations) • Feature Selection • Dimensionality Reduction • Representations and independence assumptions • Sparsity and smoothing

Concrete Example Let C = {“Spam”, “Not Spam”} or {S,N} Let H be the set of conjunctive rules, like: “if document d contains ‘free credit score’ AND ‘click here’  Spam”

A Simple Learning Algorithm • Pick a class c (S or N) • Find the term t that correlates best with c • Construct a rule r: “If d contains t c” • Repeatedly find more terms that correlate with c • Add the new terms to r, until the accuracy stops improving on the training data.

4 Things Everyone Should KnowAbout Machine Learning • Assumptions • Generalization Bounds and Occam’s Razor • Bias-Variance Tradeoff • No Free Lunch

1. Assumptions Machine learning traditionally makes two important (and often unrealistic) assumptions. • There is a probability distribution P (not necessarily known, but it’s assumed to exist) from which all examples d are drawn (training and test examples). • Each example is drawn independently from this distribution. Together, these are known as ‘i.i.d.’: independent and identically distributed.

Why are the assumptions important? Basically, it’s hard to make a prediction about a document if all of your training examples are totally different. With these assumptions, you’re saying it’s very unlikely (with enough training data) that you’ll see a test example that’s totally different from all of your training data.

2. Generalization Bounds Theorem: Generalization Bound by Vapnik-Chervonenkis: With probability 1-δ over the choice of training data, Here, v is the VC-dimension of the hypothesis space. If the hypothesis space is complex, v is big. If it’s simple, v is small.

2. Bounds and Occam’s Razor Occam’s Razor: All other things being equal, the simplest explanation is the best. Generalization bounds lend some theoretical credence to this old rule-of-thumb.

3. Bias and Variance • Bias: The built-in tendency of a learning machine or hypothesis class to find a hypothesis in a pre-determined region of the space of all possible classifiers. e.g., our rule hypotheses are biased towards axis-parallel lines • Variance: The degree to which a learning algorithm is sensitive to small changes in the training data. • If a small change in training data causes a large change in the resulting classifier, then the learning algorithm has “high variance”.

3. Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is.

4. No Free Lunch Theorem Simply put, this famous theorem says: If your learning machine has no bias at all, then it’s impossible to learn anything. The proof is simple, but out of the scope of this lecture. You should check it out.

Outline • Problem definition and applications • Very Quick Intro to Machine Learning and Classification • Bias-variance tradeoff • No free lunch theorem • Maximum Entropy Models • Other Classification Techniques • Representations • Vector Space Model (and variations) • Feature Selection • Dimensionality Reduction • Representations and independence assumptions • Sparsity and smoothing

Machine Learning Techniques for NLP • NLP people tend to favor certain kinds of learning machines: • Maximum entropy (or log-linear, or logistic regression, or logit) models (gaining in popularity lately) • Bayesian networks (directed graphical models, like Naïve Bayes) • Support vector machines (but only for certain things, like text classification and information extraction)

Hypothesis Class A maximum entropy/log-linear model (ME) is any function with this form: “Log-linear”: If you take the log, it’s a linear function. Normalization function:

Feature Functions The functions fi are called feature functions (or sometimes just features). These must be defined by the person designing the learning machine. Example: fi(c,d) = [If c=S, count of how often “free” appears in d. Otherwise, 0.]

Parameters The λiare called the parameters of the model. During training, the learning algorithm tries to find the best value for the λi.

Example ME Hypothesis

Why is it “Maximum Entropy”? Before we get into how to train one of these, let’s get an idea of why people use it. The basic intuition is from Occam’s Razor: we want to find the “simplest” probability distribution P(c | d) that explains the training data. Note that this also introduces bias: we’re biasing our search towards “simple” distributions. But what makes a distribution “simple”?

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. Examples: Entropy of a deterministic event: H(1,0) = -1 log 1 – 0 log 0 = (-1) * (0) - 0 log 0 = 0

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. Examples: Entropy of flipping a coin: H(1/2,1/2) = -1/2 log 1/2 – 1/2 log 1/2 = -(1/2) * (-1) - (1/2) * (-1) = 1

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. Examples: Entropy of rolling a six-sided die: H(1/6,…1/6) = -1/6 log 1/6 – … - 1/6 log 1/6 = -1/6 * -2.53 - … - 1/6 * -2.53 = 2.53

Entropy Entropy of a biased coin flip: Let P(Heads) represent the probability that the biased coin lands on Heads. Maximum Entropy Setting for P(Heads): P(Heads) = P(not Heads). If event X has N possible outcomes, the maximum entropy setting for p(x1),p(x2),…,p(xN) is p(x1)=p(x2)=…=p(xN)=1/N.

Occam’s Razor for Distributions Given a set of empirical expectations of the form E<c,d> in Trainfi(c,d) Find a distribution P(c | d) such that - it provides the same expectations (matches the training data) E<c,d>~P(c|d) fi(c,d) = E<c,d> in Trainfi(c,d) - maximizes the entropy H(P) (Occam’s Razor bias)

Theorem The maximum entropy distribution for P(c|d), subject to the constraints E<c,d>~P(c|d) fi(c,d) = E<c,d> in Trainfi(c,d) must have log-linear form. Thus, max-ent models have to be log-linear models.

Training a ME model Training is an optimization problem: find the value for λ that maximizes the conditional log-likelihood of the training data:

Quiz Let’s assume that I give you some feature functions fi(d,c). • What is the hypothesis class H of MaxEnt models using the feature functions fi? • What is the loss function L? NLP

Training a ME model Optimization is normally performed using some form of gradient descent: 0) Initialize λ0 to 0 1) Compute the gradient: ∇CLL 2) Take a step in the direction of the gradient: λi+1 = λi + α∇CLL 3) Repeat until CLL doesn’t improve: stop when |CLL(λi+1) – CLL(λi)| < ε

Gradient Descent: Geometry Graph of f(x,y)= -(cos2x + cos2y)2 On the bottom plane, the gradient of f is projected as a vector field. NLP

Training a ME model Computing the gradient:

Overfitting Test Error Rate Train Training iteration NLP

Regularization Regularizing an objective (or loss) function is the act of penalizing certain subsets of a hypothesis class. Typically the penalty is based on prior beliefs, like simpler models are better than more complex ones. NLP

Regularizing MaxEnt Models Add a penalty term to the objective function: Both L1 and L2 Regularization penalize hypotheses with weights far from zero. The α term is called the “regularization parameter”. It’s typically set using a grid search. (L1 Regularization) (L2 Regularization)

Training a Regularized ME model Add regularization terms to the existing gradient of the CLL: Note that when λi is positive, the contribution of the regularizer to the gradient is negative, and vice versa. (This is for L2 Regularization. For L1, the partials don’t exist at zero, so a more complicated procedure is required.)

Outline • Problem definition and applications • Very Quick Intro to Machine Learning and Classification • Bias-variance tradeoff • No free lunch theorem • Maximum Entropy Models • Other Classification Techniques • Representations • Vector Space Model (and variations) • Feature Selection • Dimensionality Reduction • Representations and independence assumptions • Sparsity and smoothing

Classification Techniques • Book mentions three: • Naïve Bayes • k-Nearest Neighbor • Support Vector Machines • Others (besides ME): • Rule-based systems • Decision lists (e.g., Ripper) • Decision trees (e.g. C4.5) • Perceptron and Neural Networks

Bayes Rule Which is shorthand for:

For code, seewww.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”

Text Classification