Logistic Regression

Logistic Regression Prof. navneetGoyal Department of computer Science Bits-pilani, pilani campus

Logistic Regression Source of figure unknown

Logistic Regression Linear Regression – relationship between a continuous response variable and a set of predictor variables What if the response variable is categorical rather than continuous? Linear regression is not suitable in such cases Logistic regression is a solution Similar to linear regression in many ways Logistic Regression – relationship between a categorical response variable and a set of predictor variables Binary or dichotomous variables

Logistic Regression • Predictor variables (categorical or continuous) are used to predict the prob. of outcomes of binary response variables • Logistic regression assigns probabilities to the values of the response variable • Consider the Multiple Linear Regression Model: yi= β0 + β1xi1 + β2xi2 +…+ βkxik +εi where the response variable y is dichotomous, taking on only one of two values: y = 1 if a success = 0 if a failure

Logistic Regression The response variable, y, is really just a Bernoulli trial, with E(y) = π where π = probability of a success on any given trial π can only take on values between 0 and 1 Thus, the Multiple Regression Model π = E(y|x) = β0 + β1xi1 + β2xi2 +…+ βkxik (since error term has mean zero) is not appropriate for a dichotomous response variable, since this model assumes π can take on any value, but in fact it can only take on values between 0 and 1.

Logistic Regression When the response variable is dichotomous, a more appropriate linear model is the Logistic Regression Model: The ratio: is called the odds

Logistic Regression Logistic regression can be used only with two types of target variables: 1. A categorical target variable that has exactly two categories (i.e., a binary or dichotomous variable). 2. A continuous target variable that has values in the range 0.0 to 1.0 representing probability values or proportions.

Example of Logistic Regression consider a study whose goal is to model the response to a drug as a function of the dose of the drug administered. The target (dependent) variable, Response, has a value 1 if the patient is successfully treated by the drug and 0 if the treatment is not successful. Thus the general form of the model is: Response = f(dose) The input data for Response will have the value 1 if the drug is effective and 0 if the drug is not effective. Value of Response predicted by the model represents the probability of achieving an effective outcome, P(Response=1|Dose). As with all probability values, it is in the range 0.0 to 1.0.

Why not simply use linear regression? No limits on the values predicted by a linear regression, so the predicted response might be less than 0 or greater than 1 – clearly nonsensical as a response probability. The response usually is not a linear function of the dosage. If a minute amount of the drug is administered, no patients will respond. Doubling the dose to a larger but still minute amount will not yield any positive response. But as the dosage is increases a threshold will be reached where the drug begins to become effective. Incremental increases in the dosage above the threshold usually will elicit an increasingly positive effect. Eventually a saturation level is reached, and beyond that point increasing the dosage does not increase the response.

Dose Response Curve Notice that all of the Response values are 0 or 1. The Dose varies from 0 to 25. Below a dose of 9 all of the Response values are 0. Above a dose of 10 all of the response values are 1.

The following table shows the relationship, for 64 infants, between Also shown in the table are Why not simply use linear regression?

Why not simply use linear regression?

Logistic Regression