480 likes | 1.16k Views
Linear methods for regression. Hege Leite Størvold Tirsdag 12.04.05. Linear regression models. Assumes that the regression function E(Y|X) is linear Linear models are old tools but … Still very useful Simple Allow an easy interpretation of regressors effects
E N D
Linear methods for regression Hege Leite Størvold Tirsdag 12.04.05
Linear regression models • Assumes that the regression function E(Y|X) is linear • Linear models are old tools but … • Still very useful • Simple • Allow an easy interpretation of regressors effects • Very wide since Xi’s can be any function of other variables (quantitative or qualitative) • Useful to understand because most other methods are generalizations of them.
Matrix Notation • X is n (p+1) of input vectors • y is the n-vector of outputs (labels) • is the (p+1)-vector of parameters
Lesast squares estimation • The linear regression model has the form the βj’s are unknown parameters or coefficients. • Typically we have a set of training data (x1,y1), …, (xn,yn) from which we want to estimate the parameters β. • The most popular estimation method is least squares
Linear regression and least squares • Least Squares: find solution, , by minimizing the residual sum of squares (RSS): • Training samples are random, independent draws • OR, yi’s are conditionally independent given xi Reasonable criterion when…
Geometrical view of least squares • Simply find the best linear fit to the data • ei is the residual of observation i One covariate Two covariates
Solving Least Squares • Derivative of a Quadratic Product • Then, • Setting the First Derivative to Zero:
The normal equations • Assuming that (XTX) is non-singular, the normal equations gives the unique least squares solution: • Least squares predicitons • When (XTX) is singular the least squares coefficients are no longer uniquely defined. • Some kind of alternative strategy is needed to obtain a solution: • Recoding and/or dropping redundant columns • Filtering • Control fit by regularization
Geometrical interpretation of least squares estimates Predicted outcomes ŷ are the orthogonal projection of y onto the columnspace of X (that spans a subspace of Rn).
Properties of least squares estimators • If Yi are independent, X fixed and Var(Yi) = σ2 constant, then • If, in addition Yi=f(Xi) + ε with ε ~ N(0,σ2), then
Properties of the least squares solution • To test the hypothesis that a particular coefficient βj = 0 we calculate • Under the null hypotesis that βj = 0, zj will be distributed as tn-p-1 and hence a large absolute value of zj will reject the null hypothesis • A (1-2α) confidence interval for βj can be formed by: • F-test can be used to test the nullity of a vector of parameters vj is the jth diagonal element of (XTX)-1
Full model Reduced/alternative model Significance of Many Parameters • We may want to test many features at once • Comparing model M0 with k parameters to an alternative model MA with m parameters from M0 (m<k) • Use the F statistic:
Example: Prostate cancer • Response: level of prostate antigen Regressors: 8 clinical measures useful for men receiving prostatectomy • Results from linear fit:
Gram-Schmidt Procedure • Initialize z0 = x0 = 1 • For j = 1 to p For k = 0 to j-1, regress xj on the zk’s so that Then compute the next residual • Let Z = [z0 z1 … zp] and be upper triangular with entries kj X = Z = ZD-1D = QR where D is diagonal with Djj =|| zj || (univariate least squares estimates) O(Np2)
Technique for Multiple Regression • Computing directly has poor numeric properties • QR Decomposition of X • Decompose X = QR where • Q is N (p+1) orthogonal matrix (QTQ = I(p+1)) • R is an (p+1) (p+1) upper triangular matrix • Then 1) Compute QTy 2) Solve R = QTy by back-substitution
Multiple outputs • Suppose we want to predict multiple outputs Y1,Y2,…,YK from multiple inputs X1,X2,…,Xp. Assume a linear model for each output: here • Y is the nxK response matrix, with ik entry yik • X is the nx(p+1) input matrix • B is the (p+1)xK matrix of parameters • E is the nxKmatrix of errors • With n training cases the model can be written in matrix notation Y=XB+E
Multiple outputs cont. • A straightforward generalization of the univariate loss function is • And the least squares estimates have the same form as before: the coefficients for the k’th outcome are just the least squares estimates of the single output regression of yk on x0,x1,…,xp • If the errors ε=(ε1,…., εK) are correlated a modified model might be more appropriate (details in textbook)
Why Least Squares? • Gauss-Markov Theorem: • The least squares estimates have the smallest variance among all linear unbiased estimates • However, there may exist a biased estimator with lower mean square error this is zero for least squares
Subset selection and Coefficient Shrinkage • Two reasons why we are not happy with least squares • Prediction accuracy: LS estimates often provide predictions with low bias, but high variance. • Interpretation: when the number of regressors i too high, the model is difficult to interpret. One seek to find a smaller set of regressors with higher effects. • We will consider a numer of approaches to variable selection and coefficient shrinkage.
Subset selection and shrinkage: Motivation • Bias – variance trade off: • Goal: choose a model to minimize error • Method: sacrifice a little bit of bias to reduce the variance. • Better interpretation: find the strongest factors from the input space.
Subset selection • Goal: to eliminate unnecessary variables from the model. • We will consider three approaches: • Best subset regression • Choose subset of size k that gives lowest RSS. • Forward stepwise selection • Continually add features with the largest F-ratio • Backward stepwise selection • Remove features with small F-ratio Greedy techniques – not guaranteed to find the best model
Best subset regression • For each find the subset of size k that gives the smallest RSS. • Leaps and bounds procedure works with p≤ 40. • How to choose k? Choose model that minimizes prediction error (not a topic here). • When p is large searching through all subsets is not feasible. Can we seek a good path through subsets instead?
Forward Stepwise selection • Method: • Start with intercept model. • Sequentially include variable that most improve the RSS(β) based on the F statistic: • Stop when no new variable improves fit significantly
Backward Stepwise selection • Method: • Start with full model • Sequentially delete predictors that produces the smallest value of the F statistic, i.e. increases RSS(β) least. • Stop when each predictor in the model produces a significant value of the F statistic • Hybrids between forward and backward stepwise selection exists
Subset selection • Produces model that is interpretable and has possibly lower prediction error • Forces some dimensions of X to zero, thus probably decrease Var(β) • Optimal subset must be chosen to minimize predicion error (model selection: not a topic here)
Shrinkage methods • Use additional penalties/constraints to reduce coefficients • Shrinkage methods are more continous than stepwise selection and therefore don’t suffer as much from variability • Two examples: • Ridge Regression • Minimize least squares s.t. • The Lasso • Minimize least squares s.t.
Ridge regression • Shrinks the regression coefficients by imposing a penalty on their size • Complexity parameter λ controls amount of shrinkage equivalently One-to-one corresponence between s and λ
Properties of ridge regression • Solution by matrix notation: Addition of λ>0 to the diagonal of XTX before inversion makes the problem nonsingular even if X is not of full rank. • The size constraint prevents coefficient estimates of highly correlated variables show high variance. • Quadratic penalty makes the ridge solution a linear function of y.
Properties of ridge regression cont. • Can also be motivated through bayesian statistics by choosing an appropriate prior for β. • Does no automatic variable selection • Ridge existence theorem states that there exists a λ>0 so that • Optimal complexity parameter must be estimated
Example Complexity parameter of the model: Effective degrees of freedom The parameters are continously shrunken towards zero
Singular value decomposition (SVD) • The SVD of the matrix has the form where and are orthogonal matrices and D=diag(d1,…..,dr) • are the non-zero singular values of X. • r ≤ min(n,p) is the rank of X • The eigenvectors vi are called the principal components of X.
Linear regression by SVD • A general solution to y=Xβ can be written as • The filter factors ωi determines the extent of shrinking, 0≤ ωi ≤1, or stretching, ωi >1, along the singular directions ui • For the OLS solution ωi =1, i=1,…,p, i.e. all the directions ui contribute equally
Ridge regression by SVD • I ridge regression the filter factors are given by • Shrinks the OLS estimator in every direction depending on λ and the corresponding di. • The directions with low variance (small singular values) are the directions shrunken the most by ridge • Assumption: y vary most in the directions of high variance
The Lasso • A shinkage method like ridge, but with important differences • The lasso estimate • The L1 penalty makes the solution nonlinear in y • quadratic programming needed to compute the solutions. • Sufficient shrinkage will cause some coefficients to be exactly zero, so it acts like a subset selection method.
Example Coefficients plottet against Note that the lasso profiles hit zero, while those for ridge do not.
A unifying view • We can view these linear regression techniques under a common framework • includes bias, q indicates a prior distribution on • =0: least squares • >0, q=0: subset selection (counts number of nonzero parameters) • >0, q=1: the lasso • >0, q=2: ridge regression
Methods using derived input directions • Goal: Using linear combinations of inputs as inputs in the regression • Includes • Principal Components Regression • Regress on M < p principal components of X • Partial Least Squares • Regress on M < p directions of X weighted by y • The methods differ in the way the linear combinations of the input variables are constructed.
PCR • Use linear combinations zm=X v as new features • vj is the principal component (column of V) corresponding to the jth largest element of D, e.g.the directions of maximal sample variance • For some M ≤ p form the derived input vectors [z1…zM] = [Xv1……XvM] • Regress y on z1…zM, gives the solution where
PCR continued • The m’th principal component direction vm solves: • Filter factors become e.g. it discards the p-M smallest eigenvalue components from the OLS solution. • If p=M it gives the OLS solution
Comparison of PCR and ridge Shrinkage and trucation patterns as a function of the principal component index
PLS • Idea: find the directions that have high variance and have correlation with y • In construction of each zm the inputs are weighted by the strength of their univariate effect on y • Pseudo-algoritm: • Set and • For m=1,….,p • Find m’th PLS component • Regress y on zm • Update y • Orthogonalize each xj(m) with respect to zm • PLS solution:
PLS cont. • Nonlinear function of y, because y is used to find the linear components • As with PCR M=p gives OLS estimate, while M<p directions produces a reduced regression. • The m’th PLS direction is found by using the that maximizes the covariation between the input and output variable: where S is the sample covariance matrix of the xj.
PLS cont. • Filter factors for PLS become where θ1≥… ≥θ are the Ritz values (not defined here). • Note that some ωi>1, but it can be shown that PLS shrinks the OLS solution, • It can also be shown that the sequence of PLS components for m=1,2,…,p represents the conjugate gradient sequence for computing the OLS solution.
Consider the general solution: Comparison of the methods • Ridge shrinks all directions, but shinks the low-variance directions most • PCR leaves M high variance directions alone, and discards the rest. • PLS tends to shrink the low-variance directions, but can inflate some of the higher variance directions
PCR 2 PLS Least Squares ridge Best subset lasso β1 4 Comparison of the methods Consider an example with two correlated inputs x1 and x2, ρ=0.5. Assume true regression coefficients β1=4 and β2=2 Coefficient profiles for the different methods as the tuning parameters are varied: β2