420 likes | 576 Views
Linear Models for Classification : Probabilistic Methods. Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Recall, Linear Methods for Classification.
E N D
Linear Models for Classification: Probabilistic Methods Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Recall, Linear Methods for Classification Problem Definition: Given the training data {xn,tn}, find a linear model for each class yk(x) to partition the feature space into decision regions • Deterministic Models: • Discriminant Functions • Fisher Discriminant function • Perceptron
Probabilistic Approaches for Classification • Generative Models: • Inference : Model p(x/Ck) and p(Ck) • Decision : Model p(Ck/x) • Discriminative Models • Model p(Ck/x) directly • Use the functional form of the generalized linear model explicitly • Determine the parameters directly using Maximum Likelihood
simple[2] logistic function may be defined by the formula Logistic Sigmoid Function • Comes from population growth • Prob distribution function of Normal R.V. İs Logistic sigmoid • İf class conditional densities are Normal, posteriors become logistic sigmoid
Posterior Probabilities can be formulated by • 2-Class: Logistic sigmoid acting on a linear function of x • K-Class: Softmax transformation of a linear function of x Then, • The parameters of the densities as well as the class priors can be determined using Maximum Likelihood
Probabilistic Generative Models: 2-Class • Recall, given • Posterior can be expresses by Logistic Sigmoid • a is called logit function
Probabilistic Generative Models K-Class • Posterior can be expresses by Softmax function or normalized exponential • Multi-class generalisation of logistic sigmoid:
Probabilistic Generative ModelsGaussian Class Conditionals for 2-Class • Assume same covariance matrix ∑, • Note • The quadratic terms in x from the exponents are cancelled. • The resulting decision boundary is linear in input space. • The prior only shifts the decision boundary, i.e. parallel contour.
Probabilistic Generative Models: Gaussian Class Conditionals for K-classes • When, covariance matrix is the same, decision boundaries are linear. • When, each class-condition density have its own covariance matrix, ak becomes quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Maximum Likelihood Solution- • Two classes • Given
Q: Find P(C1) = π and P(C2) = 1- π and parameters of p(Ck/x): μ1, μ2 and
Probabilistic Generative Models-Maximum Likelihood SolutionLet P(C1) = π and P(C2) = 1- π
Probabilistic Generative Models-Maximize log likelihood w r to. π ,μ1 μ2. ∑ .
Probabilistic Generative Models-Discrete Features- • Discrete feature values • When we have D inputs, the table size grows exponentially with the number of featuresto a 2D size table. • . • Naïve Bayes assumption, conditioned on the class Ck • Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
For both Gaussian distributed and discrete inputs • The posterior class probabilities are given by • Generalized linear models with logistic sigmoid or • softmax activation functions.
Probabilistic Generative Models-Exponential Family- • Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a general form
Probabilistic Generative ModelsExponential Family- • 2- Classes: Logistic Function • The subclass for which u(x) = x. • K-Classes: Softmax function. Linear with respect to x again.
Probabilistic Discriminative Models • Goal: Find p(Ck/x) directly • No inferrence step • Discriminative Training: Max likelihood p(Ck/x) • İmproves prediction performance when p(x/Ck) is poorly estimated
Fixed basis functions: x • Assume fixed nonlinear transformation • Transform inputs using a vector of basis functions • The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Posterior probability of a class for two-class problem: • The number of adjustable parameters (M-dimensional, 2-class) • 2 Gaussian class conditional densities (generative model) • 2M parameters for means • M(M+1)/2 parameters for (shared) covariance matrix • Grows quadratically with M • Logistic regression (discriminative model) • M parameters for • Grows linearly with M
Determining the parameters using Likelihood function: • Take negative log likelihood: Cross-entropy error function • Recall,cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
The gradient of the error function w.r.t. W • The same form as the linear regressionpredictiontarget value
Iterative Reweighted Least Squares • Recall, Linear regression models in ch.3 • ML solution on the assumption of a Gaussian noise leads to a close-form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. • Logistic regression model • No longer a closed-form solution • But the error function is concave and has a unique minimum • Efficient iterative technique can be used • The Newton-Raphson update to minimize a function E(w) • Where H is the Hessian matrix, the second derivatives of E(w) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Iterative reweighted least squares (Cont’d) • CASE 1: SSE function: • Newton-Raphson update: • CASE 2: Cross-entropy error function: • Newton-Rhapson update: (iterative reweighted least squares) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Multiclass logistic regerssion • Posterior probability for multiclass classification • We can use ML to determine the parameters directly. • Likelihood function using 1-of-K coding scheme • Cross-entropy error function for the multiclass classification
Multiclass logistic regression (Cont’d) • The derivative of the error function • Same form, the product of error times the basis function. • The Hessian matrix • IRLS algorithm can also be used for a batch processing (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Generalized Linear Models • Recall, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables. • However this is not the case for all choices of class-conditional density • It might be worth exploring other types of discriminative probabilistic model (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Generalized Linear Model: 2 Classes For example: For each input, we evaluate an=wTΦn θ
Noisy Threshold model • Corresponding activation function when θ is drawn from p(θ), mixture of Gaussian
Probit Function • Sigmoidal shape • The generalized linear model based on a probit activation function is known as probit regression.
Canonical link functions • Recall, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. • Logistic regression model with sigmoid activation function • Logistic regression model with softmax activation function • This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function.
Canonical link functions (Cont’d) • Consider the exponential family, Conditional distributions of the target variable • Log likelihood: • The derivative of the log likelihood: where • The canonical link function: then (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
The Laplace approximation • Goal:Find a Gaussian approximation to a non-Gaussian density, centered on the mode z0of the distribution. • Suppose: p(z)= (1 /Z)f(z) , non Gaussian • Taylor expansion,arround mode z0,of the logarithm of the target function: • Resulting approximated Gaussian distribution:
Laplace approximation forp(z) ∝ exp(-z2/2)σ(20z +4) • Left: the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. • Right:The negative logarithms of the corresponding curves (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Model comparison and BIC • Laplace approximation to the normalization constant Z • This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. • Consider a set of models having parameters • The log of model evidence can be approximated as • Further approximation with some more assumption: Bayesian Information Criterion (BIC) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayesian Logistic Regression • Exact Bayesian inference is intractable. • Gaussian prior: • Posterior: • Log of posterior: • Laplace approximation of posterior distribution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Predictive distribution • Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w) where • a is a marginal distribution of a Gaussian which is also Gaussian (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Predictive distribution • Resulting variational approximation to the predictive distribution • To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function Then where • Finally we get (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/