Linear Models for Classification : Probabilistic Methods

Linear Models for Classification: Probabilistic Methods Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Recall, Linear Methods for Classification Problem Definition: Given the training data {xn,tn}, find a linear model for each class yk(x) to partition the feature space into decision regions • Deterministic Models: • Discriminant Functions • Fisher Discriminant function • Perceptron

Probabilistic Approaches for Classification • Generative Models: • Inference : Model p(x/Ck) and p(Ck) • Decision : Model p(Ck/x) • Discriminative Models • Model p(Ck/x) directly • Use the functional form of the generalized linear model explicitly • Determine the parameters directly using Maximum Likelihood

simple[2] logistic function may be defined by the formula Logistic Sigmoid Function • Comes from population growth • Prob distribution function of Normal R.V. İs Logistic sigmoid • İf class conditional densities are Normal, posteriors become logistic sigmoid

Posterior Probabilities can be formulated by • 2-Class: Logistic sigmoid acting on a linear function of x • K-Class: Softmax transformation of a linear function of x Then, • The parameters of the densities as well as the class priors can be determined using Maximum Likelihood

Probabilistic Generative Models: 2-Class • Recall, given • Posterior can be expresses by Logistic Sigmoid • a is called logit function

Probabilistic Generative Models K-Class • Posterior can be expresses by Softmax function or normalized exponential • Multi-class generalisation of logistic sigmoid:

Probabilistic Generative ModelsGaussian Class Conditionals for 2-Class • Assume same covariance matrix ∑, • Note • The quadratic terms in x from the exponents are cancelled. • The resulting decision boundary is linear in input space. • The prior only shifts the decision boundary, i.e. parallel contour.

Probabilistic Generative Models: Gaussian Class Conditionals for K-classes • When, covariance matrix is the same, decision boundaries are linear. • When, each class-condition density have its own covariance matrix, ak becomes quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic Generative Models-Maximum Likelihood Solution- • Two classes • Given

Q: Find P(C1) = π and P(C2) = 1- π and parameters of p(Ck/x): μ1, μ2 and 

Probabilistic Generative Models-Maximum Likelihood SolutionLet P(C1) = π and P(C2) = 1- π

Probabilistic Generative Models-Maximize log likelihood w r to. π ,μ1 μ2. ∑ .

Probabilistic Generative Models-Discrete Features- • Discrete feature values • When we have D inputs, the table size grows exponentially with the number of featuresto a 2D size table. • . • Naïve Bayes assumption, conditioned on the class Ck • Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

For both Gaussian distributed and discrete inputs • The posterior class probabilities are given by • Generalized linear models with logistic sigmoid or • softmax activation functions.

Probabilistic Generative Models-Exponential Family- • Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a general form

Probabilistic Generative ModelsExponential Family- • 2- Classes: Logistic Function • The subclass for which u(x) = x. • K-Classes: Softmax function. Linear with respect to x again.

Probabilistic Discriminative Models • Goal: Find p(Ck/x) directly • No inferrence step • Discriminative Training: Max likelihood p(Ck/x) • İmproves prediction performance when p(x/Ck) is poorly estimated

Fixed basis functions: x • Assume fixed nonlinear transformation • Transform inputs using a vector of basis functions • The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Posterior probability of a class for two-class problem: • The number of adjustable parameters (M-dimensional, 2-class) • 2 Gaussian class conditional densities (generative model) • 2M parameters for means • M(M+1)/2 parameters for (shared) covariance matrix • Grows quadratically with M • Logistic regression (discriminative model) • M parameters for • Grows linearly with M

Determining the parameters using Likelihood function: • Take negative log likelihood: Cross-entropy error function • Recall,cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The gradient of the error function w.r.t. W • The same form as the linear regressionpredictiontarget value

Iterative Reweighted Least Squares • Recall, Linear regression models in ch.3 • ML solution on the assumption of a Gaussian noise leads to a close-form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. • Logistic regression model • No longer a closed-form solution • But the error function is concave and has a unique minimum • Efficient iterative technique can be used • The Newton-Raphson update to minimize a function E(w) • Where H is the Hessian matrix, the second derivatives of E(w) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Iterative reweighted least squares (Cont’d) • CASE 1: SSE function: • Newton-Raphson update: • CASE 2: Cross-entropy error function: • Newton-Rhapson update: (iterative reweighted least squares) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Multiclass logistic regerssion • Posterior probability for multiclass classification • We can use ML to determine the parameters directly. • Likelihood function using 1-of-K coding scheme • Cross-entropy error function for the multiclass classification

Multiclass logistic regression (Cont’d) • The derivative of the error function • Same form, the product of error times the basis function. • The Hessian matrix • IRLS algorithm can also be used for a batch processing (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generalized Linear Models • Recall, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables. • However this is not the case for all choices of class-conditional density • It might be worth exploring other types of discriminative probabilistic model (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generalized Linear Model: 2 Classes For example: For each input, we evaluate an=wTΦn θ

Noisy Threshold model • Corresponding activation function when θ is drawn from p(θ), mixture of Gaussian

Probit Function • Sigmoidal shape • The generalized linear model based on a probit activation function is known as probit regression.

Canonical link functions • Recall, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. • Logistic regression model with sigmoid activation function • Logistic regression model with softmax activation function • This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function.

Canonical link functions (Cont’d) • Consider the exponential family, Conditional distributions of the target variable • Log likelihood: • The derivative of the log likelihood: where • The canonical link function: then (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Laplace approximation • Goal:Find a Gaussian approximation to a non-Gaussian density, centered on the mode z0of the distribution. • Suppose: p(z)= (1 /Z)f(z) , non Gaussian • Taylor expansion,arround mode z0,of the logarithm of the target function: • Resulting approximated Gaussian distribution:

Laplace approximation forp(z) ∝ exp(-z2/2)σ(20z +4) • Left: the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. • Right:The negative logarithms of the corresponding curves (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Model comparison and BIC • Laplace approximation to the normalization constant Z • This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. • Consider a set of models having parameters • The log of model evidence can be approximated as • Further approximation with some more assumption: Bayesian Information Criterion (BIC) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Logistic Regression • Exact Bayesian inference is intractable. • Gaussian prior: • Posterior: • Log of posterior: • Laplace approximation of posterior distribution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distribution • Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w) where • a is a marginal distribution of a Gaussian which is also Gaussian (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distribution • Resulting variational approximation to the predictive distribution • To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function Then where • Finally we get (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Linear Models for Classification : Probabilistic Methods

Linear Models for Classification : Probabilistic Methods

Presentation Transcript

Linear Methods for Classification

Lecture 8,9 – Linear Methods for Classification

Probabilistic Models

Lecture 3. Linear Models for Classification

Generalized Linear Models Classification

Linear Classification Models: Generative

Classification: Linear Models

Probabilistic Models

Probabilistic Graphical Models for Semi-Supervised Traffic Classification

Linear Models for Classification

Linear Classification with discriminative models

LINEAR CLASSIFICATION METHODS

Linear Models for Classification

Classification: Linear Models

Linear Programming Models: Graphical Methods

Multivariate linear models for regression and classification Outline:

Chapter 4 Linear Models for Classification

Recall , in linear methods for classification and regression

Linear Methods For Classification Chapter 4

Probabilistic Models

Probabilistic Models

Linear Methods for Classification