TR 555 Statistics “Refresher” Lecture 3: Models

TR 555 Statistics “Refresher”Lecture 3: Models

References • Penn State University, Dept. of Statistics • Statistical Education Resource Kit • a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses. • Page maintained by Laura J. Simon, Sept. 2003 • Tom Maze, stat course prepared for KDOT, 2003 • Statistical and Econometric Methods for Transportation Data Analysis byWashington, Karlaftis and Mannering, Chapman and Hall, 2003 • Online Documentation: Scientific Approaches to Transportation Research - NCHRP 20-45 Scientific Approaches to Transportation Research, http://gulliver.trb.org/publications/nchrp/cd-22/start.htm accessed September 23, 2003

Outline • ANOVA • Linear Regression Analysis • Poisson Regression • Probit and Logit Models

One-Way Analysis of Variance … to compare 2 or more population means

Does learning method affect student’s exam scores? • Consider 3 methods: • standard • osmosis • shock therapy • Convince 15 students to take part. Assign 5 students randomly to each method. • Wait eight weeks. Then, test students to get exam scores.

Suppose … Study #1 Is there a reasonable conclusion?

“Analysis of Variance” The variation between the group means and the grand mean is larger than the variation within each of the groups.

“Analysis of Variance” The variation between the group means and the grand mean is smaller than the variation within each of the groups.

Analysis of Variance • A division of the overall variability in data values in order to compare means. • Overall (or “total”) variability is divided into two components: • the variability “between” groups, and • the variability “within” groups • Summarized in an “ANOVA” table.

Assumptions of ANOVA • Distributions are normal (see normal tests, last lecture (plots, Chi2,KS …) • Variances are approx equal … • For more than 2 factor levels, use Bartlett’s or Hartley’s test • If assumptions are “significantly” violated use the Kruskal-Wallis test in lieu of ANOVA

General ANOVA Table “F” means “F test statistic” One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) P-Value “Source” means “find the components of variation in this column” “DF” means “degrees of freedom” “SS” means “sums of squares” “MS” means “mean squared”

General ANOVA Table One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) “Factor” means “Variability between groups” or “Variability due to the factor of interest” “Error” means “Variability within groups” or “unexplained random variation” “Total” means “Total variation from the grand mean”

General ANOVA Table N = number of total data values. t = number of groups (or “factor levels”) One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) From F-distribution with t-1 numerator and N-t denominator d.f. MSB = SS(Between)/(t-1) MSE = SS(Error)/(N-t) N-1 = (t-1) + (N-t) SS(Total) = SS(Between) + SS(Error)

ANOVA Table for Study #1 One-way Analysis of Variance Source DF SS MS F P Factor 2 2510.5 1255.3 93.44 0.000 Error 12 161.2 13.4 Total 14 2671.7 1255.2 = 2510.5/2 13.4 = 161.2/12 14 = 2 + 12 93.44 = 1255.3/13.4 2671.7 = 2510.5 + 161.2

Recall Study #3

ANOVA Table for Study #3 One-way Analysis of Variance Source DF SS MS F P Factor 2 80.1 40.1 0.46 0.643 Error 12 1050.8 87.6 Total 14 1130.9 The P-value is pretty large so cannot reject the null hypothesis. There is insufficient evidence to conclude that the average exam scores differ for the three learning methods.

Does distance it takes to stop car at 60 mph depend on tire brand? • Brand1 Brand2 Brand3 Brand4 Brand5 • 194 189 185 183 195 • 184 204 183 193 197 • 189 190 186 184 194 • 189 190 183 186 202 • 188 189 179 194 200 • 186 207 191 199 211 • 195 203 188 196 203 • 186 193 196 188 206 • 183 181 189 193 202 • 188 206 194 196 195

Comparison of Five Tire BrandsStopping Distance at 60 mph

Sample Descriptive Statistics Brand N MEAN SD 1 10 188.20 3.88 2 10 195.20 9.02 3 10 187.40 5.27 4 10 191.20 5.55 5 10 200.50 5.44

Hypotheses • The null hypothesis is that the group population means are all the same. That is: • H0: 1 = 2 = 3 = 4 = 5 • The alternative hypothesis is that at least one group population mean differs from the others. That is: • HA: at least one i differs from the others

Analysis of Variance Analysis of Variance for comparing all 5 brands Source DF SS MS FP Brand 4 1174.8 293.7 7.95 0.000 Error 45 1661.7 36.9 Total 49 2836.5 The P-value is small (0.000, to three decimal places) so reject the null hypothesis. There is sufficient evidence to conclude that at least one brand is different from the others.

Another Transportation Example

Another Transportation Example (cont)

Another Transportation Example (cont) important

Regression Analysis

Purpose • Model a continuous Y (dependent variable)on a vector of Xs (explanatory or independent variables, aka covariates) … • What causes Y? • What is the future of Y? • Can we control Y? • How does a change in an X affect Y? • Note: causation is important in specifying model

Examples of regression models • Trips per household per day related to household demographics, land use, access to the network, etc. • Arterial crashes related to accesses per mile, traffic volume, minutes of delays, stops per vehicle, etc. • Crashes related to facility design such as lane width, shoulder width, degree of horizontal curvature, etc. • What other regression relationships are commonly used in Transportation?

Assumptions of linear regression • Dependent variable is continuous (if not, use poisson, binomial or logistic regression) 1. The dependent variable varies linearly with the independent variable (you can linearize, but not always appropriate) 2. The dependent variable is randomly sampled from the population of interest 3. Changes in the dependent variable are caused by changes in the independent variable

Assumptions of linear regression 4. There is uncertainty in the relationships, reflected as error terms 5. The errors must be normally distributed with mean zero and constant variance (homoskedastic) or the distribution must be identified (if you want to use inference) 6. Independent variable is measured without error • Errors are not autocorrelated (over time, same person, etc.) • Errors are independent of X values • Xs are independent (or at least not too co-dependent) • All effect variables are in the model (no exogeneity) • No endogeneity exists (Y influences X, e.g., frequency of ice related crashes influences presence of ice on roadway signs)

If assumptions are violated … • Non-normal errors (5) • Transform, use poisson or other, bootstrap or monte carlo to define actual distribution • Non-linearity (1) • Transform (careful of other assumptions!) • Non-constant variance (heteroskedastic) (5) • Use weighted, ridge or generalized regression • Correlations across time • Use time series (e.g., ARIMA) • Non-random errors • Instumental techniques, proxy, structural models

Regression theory • First specify a relationship Yi = b0 + b1X1,j+ b2X2,j + …+bm-1Xp-1,j+ ei • Yi = the ith the dependent variable • b1, b2, ..bp-1are the partial effectives of the independent variables (covariates or coefficients) • X1,X2,…Xp are the independent values of the explanatory variables • eiis the random error term with mean of zero and ei and ej are uncorrelated

http://www.cogs.susx.ac.uk/users/andyf/teaching/pg

First steps Note: estimates are made based on minimizing square error in the Y direction • Propose a model form y=f(x’s) • Can include interaction terms (e.g., X1*X2) • Plot data (Y vs each X) • Identify linear relationships • Identify data issues • Transform X data to linearize if needed • Estimate the model

Estimate the model

Goodness of fit • R squared = (SST-SSE)/SST = 1-SSE/SST • Always goes up when adding variables • AKA Pearson Correlation Cooficient • Adjusted R squared = 1-(n-1)/(n-p)*SSE/SST • Where n is the sample size and p is the number of x variables (parameters) • Goes down when adding insignificant variables • Only use R squared to compare models SSE SST

Full vs. reduced models • Full model uses all variables • Reduced model uses one or more less variables thought not to contribute (or have problems) • To test the hypothesis that additional variables in the full model have beta of zero (e.g., meaningless), use F = SSER-SSEF DFR-DFF SSER DFF If F<F from table (1-α, DFR-DFF, DFF) then conclude H0

Example FARS Data

One possible model How to Read the Output From Simple Linear Regression Analyses http://www.tufts.edu/~gdallal/slrout.htm =coeff/standard error * Fatal Crashes = 48.13122 + 0.014259 (VMT) *The Standard Errors are the standard errors of the regression coefficients. They can be used for hypothesis testing and constructing confidence intervals. For example, say the standard error of a coefficient is 0.219. A 95% confidence interval for the regression coefficient for the coefficient is constructed as (mean ± k 0.219), where k is the appropriate percentile of the t distribution with degrees of freedom equal to the Error DF from the ANOVA table. If say,the degrees of freedom is 60, the multiplier is 2.00. Thus, the confidence interval is given by (3.016 ± 2.00 (0.219)). If the sample size were huge, the error degrees of freedom would be larger and the multiplier would become the familiar 1.96.

Model is good enough? • Although we expect VMT and fatal crashes to be related – we know its not that simple • Other factors that can include • Percentage of VMT on rural highways • Percentage of VMT on highways of different classes • Weather conditions • Response of medical services

Adding another variable • Include percentage of VMT on rural highways Fatal Crashes = -113.2 + 0.0142 (VMT) + 297.4 (% of VMT on rural roads) What does an R Square of 0.93 mean?

Using rural variable alone … • Fatal crashes and percent of VMT in Rural area • What is up with this? Fatal Crashes = 1526 + -1485 (% of VMT on rural roads)

Regression example • Fatal Crash rate vs percent rural VMT Fatal Crash rate = 0.017 + 0.009 (% of VMT on rural roads) Why can we only account for 18% of the independent variable variance?

Percentage of Rural VMT versus Crash Rate

Taking % VMT rural into account • Add a dummy variable (0 or 1) for each of the following ranges of % rural VMT. • 0% to 20% • 20% to 40% • 40% to 60% • 60% to 100% is not done to avoid perfect dependence between variables and is taken into account by intercept constant.

Result of dummy model Fatal Crashes = 83.44 + 0.015(total VMT) + -323.9(1 when rural VMT l.t. 20%) + -47.99(1 when rural vmt l.t. 40%) + -41.49(1 when rural VMT l.t. 60%)

Alternative specification Fatal Crashes = 39.71 + 0.015(total VMT) + 43.92(1 when rural VMT gt 60%) + -279.3(1 when rural VMT lt 20%)

Final specification Fatal Crashes = 59.44 + 0.015(total VMT) + -287.9(1when rural VMT < 20%)

Assumption Checks • Non-linearity of regression function • Hetroscedasticity of error terms • Lack of independence of error terms • Extreme influence of outlying observations • Non-normality of error terms • Omission of important variables • Multi-collinearity of independent variables • Poorly measured independent variables

TR 555 Statistics “Refresher” Lecture 3: Models

TR 555 Statistics “Refresher” Lecture 3: Models

Presentation Transcript