Objectives of this class

Objectives of this class • By the end of the class you should be able to: • Explain why OLS should not be used when the dependent variable is discrete • Understand and use logit and probit models • Use multinomial, ordered and count-data models in the correct situations • Implement tobit and interval regression models, explaining why OLS should not be used when the dependent variable is truncated • Handle duration data and estimate the Cox proportional hazards model

3. When the dependent variable is not continuous and unbounded • 3.1 Why not OLS? • 3.2 The basic idea underlying logit models • 3.3 Estimating logit models • 3.4 Multinomial models • 3.5 Ordinal dependent variables • 3.6 Count data models • 3.7 Tobit models and interval regression • 3.8 Duration models

3.1 Why not OLS? • A variable is “categorical” if it takes discrete values. • For example, a dummy variable is categorical because it takes two possible values (one, or zero). • In some situations, we may want to estimate a model in which the dependent variable is categorical. • For example, we may want to know why some companies choose large audit firms while other companies choose small audit firms • The dependent variable (big6) is categorical as it takes two discrete values, zero or one. • Why should we not use OLS to estimate the model when the dependent variable is categorical?

3.1 Why not OLS? • We believe that company size is an important determinant of the choice of auditor. • Suppose we try to graph the relationship between big6 and company size. • The observations lie only on two horizontal lines (where big6=0 and big6=1) • If larger companies are more likely to choose big6 auditors, the number of observations on the 1-line should be further to the right than the number on the 0-line • use "J:\phd\Fees.dta", clear • gen lnta=ln(totalassets) • scatter big6 lnta, msize(tiny)

3.1 Why not OLS? • This graph is not very informative because the observations lie directly on top of each other, hiding the number of observations. • You can use the jitter() option to provide a more informative graph. • jitter() adds a small random number to each observation, thus showing observations that were previously hidden under other data points. The number in brackets is from 1 to 30 and controls the size of the random number • scatter big6 lnta, msize(tiny) jitter(30) • graph twoway (scatter big6 lnta, msize(tiny) jitter(30)) (lfit big6 lnta)

3.1 Why not OLS? • The graph shows one major problem with using linear regression for dichotomous dependent variables: • the predicted values of big6 can be < 0 or > +1, • yet the big6 probability should lie between 0 and 1, • given sufficiently small or large values of X, a model that uses a straight line to represent probabilities will inevitably produce values that are negative or greater than one.

3.1 Why not OLS? • A second major problem is that linear regression provides errors that do not have a constant variance • This violates the assumption of OLS that the errors are homoscedastic.For example: • The residuals are

3.1 Why not OLS? • The variance of the residuals is • The variance of the residuals is larger as the predicted values approach 0.5. • Since the variance of the residuals is a function of the predicted values, the residuals do not have a constant variance. • Because of this heteroscedasticity, the standard errors of the coefficients are biased.

Class exercise 3a • Using the regress command, estimate an OLS model where the dependent variable is big6 and the independent variable is lnta. • Using the predict command, obtain the predicted big6 values and the predicted residuals. • Draw a scatterplot of the residuals against the predicted values of big6. • Do you notice any pattern between the residuals and the fitted values? • Why does this pattern exist?

Class exercise 3a • reg big6 lnta • predict big6hat • predict res, resid • scatter res big6hat • There is a clear pattern between the residuals and the predicted values of big6 • The residuals lie on two linear lines because the values of big6 are either zero or one

3.1 Why not OLS? • To summarize, we would have two statistical problems if we use OLS when the dependent variable is categorical: • The predicted values can be negative or greater than one • The standard errors are biased because the residuals are heteroscedastic. • Instead of OLS, we can use a logit model

3.2 The basic idea underlying logit models • The values calculated with linear regression are not subject to any restrictions, so any values between - and + may emerge. • We need to create a variable that: • has an infinite range, • reflects the likelihood of choosing a big6 auditor versus a non-big6 auditor. • This variable is known as the log odds ratio.

The “odds ratio” is as follows: • The “log” odds ratio is obtained by taking natural logs of the odds ratio.

3.2 The basic idea • Col. 1 shows the probability of a company choosing a big6 auditor • Note that the probabilities lie between 0 and +1. • Cols. 2 & 3 shows the odds ratios • Note that the odds ratios lie between 0 and +. • Using the odds ratio solves the problem that the linear predicted values may exceed +1. • However, we still have the problem that the linear predicted values may be negative • We solve this problem by using the natural log of the odds ratio (which are also called “logits”) • Col. 4 shows the logits • Note that the logits lie between - and +. • As the big6 probability approaches zero, the logits approach -. • As the big6 probability approaches one, the logits approach +. • Note also that the logits are symmetric (e.g., when the big6 probability is 0.5, the logit = zero; when the big6 probability is 0.4, the logit = -0.41; when the big6 probability is 0.6, the logit = +0.41).

3.2 The basic idea • The logit can take any value between - and +, and it is symmetric. • The logit is therefore suitable for use as a dependent variable. • Writing the logit (i.e., the log of the odds ratio) as L, it is easy to transform the logits back into probabilities:

3.2 The basic idea • The logit model uses a linear combination of independent variables to predict the values of the unobserved logit, L. • L = a0+ a1 X1 + a2 X2 + e • There is a one-to-one mapping between values of the continuous L variable and values of the dummy variable • big6 = 1 if 0 <L < + • big6 = 0 if - < L  0

3.2 The basic idea • The interpretation of the coefficients is the same as in OLS • L = a0+ a1 X1 + a2 X2 + e • For example when X1 increases by one unit, the predicted values of the logit (L) increase by a1 • The coefficients for the logit model are estimated by maximizing the likelihood function.

The likelihood function is as follows: • The coefficients (a0+ a1 X1 + a2 X2) are estimated such that they maximize the value of the likelihood function. • Unlike OLS, there is no analytical solution to characterize formulas for the estimated coefficients. • Instead, the likelihood function is maximized using iterative algorithms (this is known as “maximum likelihood” estimation).

3.3 Estimating logit models and interpreting the results • There are two commands in STATA for estimating the logit model where the dependent variable is binary: • logit reports the values of the estimated coefficients • logistic reports the odds ratios • Typically, accounting researchers report the coefficient estimates rather than the odds ratios, so we will be using the logit command.

Example • Suppose we wish to test the effect of the company’s age and size on its choice between a big6 or non-big6 auditor • gen fye=date(yearend, "MDY") • format fye %d • gen year=year(fye) • gen age= year-incorporationyear • sum age, detail • replace age=0 if age<0 • logit big6 lnta age • In many respects, the output from the logit model looks similar to what we obtain from OLS regression.

3.3 Estimating logit models

The coefficient on lnta tells us that the log odds of hiring a big6 auditor increase by 0.58 if lnta increases by one unit. • Usually, we are mainly interested in the signs and statistical significance of the coefficients • We find that larger companies and younger companies are significantly more likely to hire big6 auditors.

When using maximum likelihood estimation, there is typically no closed-form mathematical solution to obtain the coefficient estimates. • Instead, an iterative procedure must be used that tries a sequence of different coefficient values. • As the algorithm gets closer to the solution, the value of the likelihood function increases by smaller and smaller increments.

The first and last values of the likelihood function are of most interest. The larger the difference between these two values, the greater the explanatory power of the independent variables in explaining the dependent variable. • The pseudo-R2 is similar to the R-squared in the OLS model as it tells you how high is the model’s explanatory power. • pseudo-R2 = (ln(L0) - ln(LN)) / ln(L0) • = (-175224+146215) / -175224

Besides the pseudo-R2, the likelihood-ratio is another indicator of the model’s explanatory power • Chi2 = -2(ln(L0) - ln(LN)) = -2*(-175224+146215) = 58018 • As with the F value in linear regression, you can use the likelihood-ratio statistic to test the hypothesis that the independent variables have no explanatory power (i.e., all coefficients except the intercept are zero). • The probability that this hypothesis is true is reported in the line “Prob > chi2”. In our example we can reject this hypothesis because the probability is virtually zero.

3.3 Estimating logit models • Just as with OLS, we can use the robust option to correct for any heteroscedasticity and we can use the cluster() option to control for correlated errors • logit big6 lnta age • logit big6 lnta age, robust • logit big6 lnta age, robust cluster(companyid)

3.3 Estimating logit models • The predict command generates a new variable that contains the predicted probability of choosing a big6 auditor for every observation in the sample • logit big6 lnta age, robust cluster(companyid) • drop big6hat • predict big6hat • sum big6hat, detail • Note that these predicted probabilities lie within the range 0, +1

3.3 Estimating logit models • Note that the predicted probabilities are not the same as the predicted logit values • gen big6hat1 = _b[_cons]+_b[lnta]*lnta + _b[age]*age • sum big6hat1, detail • The predicted logit values lie in the range - to +. • We can easily obtain the predicted probabilities using the logit values • replace big6hat1=exp(big6hat1)/(1+exp(big6hat1)) • sum big6hat big6hat1

3.3 Estimating logit models • Alternatively, we can predict the logit values using the ,xb option • drop big6hat big6hat1 • logit big6 lnta age, robust cluster(companyid) • predict big6hat • predict big6hat1, xb • replace big6hat1=exp(big6hat1)/(1+exp(big6hat1)) • sum big6hat big6hat1

3.3 Estimating logit models • Just as with OLS models, we can report the economic significance of the coefficients. • For example, we can calculate the change in the predicted probability of hiring a big6 auditor as the company’s age increases from 10 to 20 years old: • logit big6 lnta age, robust cluster(companyid) • gen big10 = exp(_b[_cons]+_b[lnta]*lnta + _b[age]*10) / (1+(exp(_b[_cons]+_b[lnta]*lnta + _b[age]*10))) • gen big20 = exp(_b[_cons]+_b[lnta]*lnta + _b[age]*20) / (1+(exp(_b[_cons]+_b[lnta]*lnta + _b[age]*20))) • sum big10 big20

Class exercise 3b • Calculate the change in the predicted probability of hiring a big6 auditor as the company’s age increases by one standard deviation around the mean • Hint #1: remember that you can obtain the mean and standard deviations using the sum command and the return codes r(). To see the list of return codes, type return list • Hint #2: if you get error messages it is likely because you have the wrong number of brackets e.g., ((xx)

Class exercise 3b • logit big6 lnta age, robust cluster(companyid) • sum age, detail • gen big_young = exp(_b[_cons]+_b[lnta]*lnta + _b[age]*(r(mean)-0.5*r(sd))) / (1+(exp(_b[_cons]+_b[lnta]*lnta + _b[age]*(r(mean)-0.5*r(sd))))) • gen big_old = exp(_b[_cons]+_b[lnta]*lnta + _b[age]*(r(mean)+0.5*r(sd))) / (1+(exp(_b[_cons]+_b[lnta]*lnta + _b[age]*(r(mean)+0.5*r(sd))))) • sum big_youngbig_old

3.3 An alternative to logit • In the logit model, we predict P(Y = 1) using a linear combination of X variables • To ensure the predicted P(Y = 1) values lie between 0 and +1, we used a logit transformation • An alternative is the probit transformation, which is used in probit models • The main difference is that the logit uses a logarithmic distribution whereas the probit uses a normal distribution

3.3 An alternative to logit • In the logit model, the likelihood function is where • In the probit model, the likelihood function is where  is the cumulative normal distribution function

3.3 An alternative to logit • The coefficients of the probit model are also estimated using maximum likelihood • Usually, the predicted probabilities of probit models are very close to those of logit models • The coefficients tend to be larger in probit models but the levels of statistical significance are often similar • capture drop big6hat big6hat1 • logit big6 lnta age, robust cluster(companyid) • predict big6hat • probit big6 lnta age, robust cluster(companyid) • predict big6hat1 • pwcorr big6hat big6hat1

3.4 Multinomial models • Multinomial models are used when: • the dependent variable takes on three or more categories and • the categories are not ranked • For example, the dependent variable might be your method of transport to university • bicycle, car, bus, train, walk (here there are five categories ) • there is no particular “ranking” from best to worst (bicycle and walking may be cheaper and more healthy but car may be quicker; it may not be obvious whether the car or train is quicker and cheaper so these choices cannot be ranked)

3.4 Multinomial models • In our dataset the companytype variable has six different categories • Suppose we are interested in three categories • private, public, and publicly traded • companytype = 1, 6 if private company, • companytype = 4 if public but not traded on a market, • companytype = 2, 3, 5 if company is publicly traded on a market. • gen cotype1=0 if companytype==1 | companytype==6 • replace cotype1=1 if companytype==4 • replace cotype1=2 if companytype==2 | companytype==3 | companytype==5 • Our dependent variable (cotype1) now has three possible values • In a multinomial model, we predict the probability of each of these three outcomes

3.4 Multinomial models • Alternatively, you could create a binary variable for each of the three categories • gen private=0 • replace private=1 if cotype1==0 • gen public_nontraded=0 • replace public_nontraded=1 if cotype1==1 • gen public_traded=0 • replace public_traded=1 if cotype1==2 • And then estimate logit models using each binary variable as the dependent variable • logit private lnta, robust cluster(companyid) • predict private_hat • logitpublic_nontradedlnta, robust cluster(companyid) • predict public_nontraded_hat • logitpublic_tradedlnta, robust cluster(companyid) • predict public_traded_hat

3.4 Multinomial models • A problem with this approach is that the predicted probabilities from the three logit models do not sum to one • They should sum to one because there are only three categories (private, public non-traded, public traded) • gen sum_prob= private_hat+ public_nontraded_hat+ public_traded_hat • sum sum_prob, detail • This problem arises because the three logit models are estimated in an unconnected way • Instead we need to estimate the models jointly such that the predicted probabilities sum to one

3.4 Multinomial models • Recall that when the dependent variable has two possible outcomes (e.g., big6 = 1, 0), there is one equation estimated • The big6 = 0 observations are used as a benchmark for evaluating why companies choose big6 auditors

3.4 Multinomial models • Similarly, when the dependent variable has three possible outcomes (cotype1 = 0, 1, 2), there are two equations estimated. • One of the three outcomes is used as a benchmark for evaluating what determines the other two outcomes. • More generally, if the dependent variable has N possible outcomes (cotype1 = 0, 1, 2, …, N), there are N-1 equations estimated. • It does not matter which outcome we choose to be the benchmark. • By default, STATA chooses the most frequent outcome as the benchmark, but you can override this if you wish.

3.4 Multinomial models (mlogit) • The STATA command for the multinomial logit model is mlogit • In early versions of STATA (e.g., STATA 8) there was no option to estimate a multinomial probit model. A multinomial probit model is now available in STATA 10 (mprobit). • The maximum likelihood algorithms for the multinomial probit model are complicated. As a result, the multinomial probit can be time-consuming to estimate especially when the dependent variable has several categories or the sample is large. • mprobit cotype1 lnta, robust cluster(companyid) • Because mprobit is so time-consuming, I am going to stick with mlogit for the sake of demonstration. • mlogit cotype1 lnta, robust cluster(companyid)

3.4 Multinomial models (mlogit) • There are now two sets of coefficient estimates • The first set contains the coefficients of the equation for public non-traded companies (cotype1=1) • The second set contains the coefficients of the equation for public traded companies (cotype1=2) • The coefficients of the equation for private companies (cotype1=0) are set at zero, because STATA chose this to be the benchmark group.

3.4 Multinomial models (mlogit) • The coefficients need to be interpreted carefully because private companies comprise the benchmark group. • The results show that larger companies are significantly more likely to be in the public non-traded category than in the private category (i.e., 1 vs. 0). • Also, larger companies are significantly more likely to be in the public traded category than in the private category (i.e., 2 vs. 0). • Suppose we wish to test whether larger companies are significantly more likely to be in category 2 (public traded) versus category 1 (public non-traded)

3.4 Multinomial models (test, baseoutcome()) • After running themlogitcommand, we can test whether the coefficients in the two equations are equal • test [Equation no. = Equation no.]: Variable name • mlogit cotype1 lnta, robust cluster(companyid) • test [1=2]: lnta • test [1=2]: _cons • The results indicate that larger companies are significantly more likely to be in category 2 (public traded) than in category 1 (public non-traded). • Having performed this test it is now valid to conclude that larger companies are significantly more likely to be in the public traded category (i.e., we have compared 2 vs. 1 and 2 vs. 0) • We can easily change the benchmark comparison group using the , baseoutcome() option • mlogit cotype1 lnta, baseoutcome(1) robust cluster(companyid)

Class exercise 3c • Estimate the multinomial model using outcome 2 (public traded) as the base outcome. • Test whether the lnta coefficients are the same for private versus public non-traded companies. • Why are the signs of the lnta coefficients negative whereas they were positive when outcome 0 is the base category? • Does the negative lnta coefficient for outcome 1 imply that larger companies are less likely to be public non-traded?

Class exercise 3c • mlogit cotype1 lnta, baseoutcome(2) robust cluster(companyid) • test [0=1]: lnta • The lnta coefficients are negative because the comparison group (i.e., outcome 2) contains the largest companies. • The negative lnta coefficient for outcome 1 implies that public non-traded companies are smaller than public traded. It does not imply that public non-traded companies are smaller than private (in fact the opposite is true).

3.5 Ordinal dependent variables • Multinomial models are used when the values of the dependent variable do not have an ordinal ranking • For example, it does not make economic sense to rank public traded companies higher or lower than private companies. • The values of cotype1 are simply used to identify different types of company. • Therefore, we use the multinomial logit model when the dependent variable is cotype1.

3.5 Ordinal dependent variables • In other cases, it may make sense for the dependent variable to have an ordinal ranking • For example, a professor marks an exam taken by five students and ranks the students in order of their marks: • 1 = top, 2 = second place, …, 5 = bottom of the class • Ordinal dependent variables are common when researchers are using survey data about people’s perceptions • How concerned are you about crime in Guangzhou? • 1 = very concerned, 2 = quite concerned, 3 = not concerned. • Credit rating agencies assess the credit worthiness of companies • Moody’s rating scales: AAA, Aa1, Aa2, Aa3, A1, A2, A3, Baa1, Baa2, Baa3, Ba1, Ba2, Ba3, B1, B2, B3, Caa1, Caa2, Caa3, Ca, C • Blume et al. (1998) use an ordered probit model to examine whether rating agencies are using more stringent standards in assigning ratings.

Objectives of this class

Objectives of this class

Presentation Transcript

Class Objectives

Class Objectives

Class Objectives

Class Objectives:

Objectives of this presentation

CLASS OBJECTIVES

Class objectives:

Class Objectives

Objectives of this Experience

Class objectives:

Our Objectives For This class

Objectives of this class

Objectives of this presentation

Objectives of this presentation

Objectives of this session

Objectives of this presentation

Objectives of this Workshop

Objectives of This Session

Objectives of this workshop

Objectives of this Chapter

Objectives of This Session