450 likes | 493 Views
Statistical Software Programming. STAT 6360 –Statistical Software Programming. SAS PROCs for Statistical Analysis Next we will cover several PROCs that implement basic statistical methods:
E N D
STAT 6360 –Statistical Software Programming SAS PROCs for Statistical Analysis Next we will cover several PROCs that implement basic statistical methods: • PROCUNIVARIATE – computes summary statistics and explores distributional properties for one variable at a time. • PROCTTEST – Conducts paired, one-, and two-sample t tests. • PROCCORR – Pearson, Spearman and partial correlations. • PROCREG – Simple and multiple linear regression analysis. • PROCANOVA – Analysis of variance for balanced designs. • PROCGLM – Any classical linear model including multiple regression, analysis of variance and analysis of covariance. • We’ll also cover the ODS Graphics from these PROCs. • Other PROCs for simple statistical methods include FREQ and MEANS, but we’ve discussed these.
STAT 6360 –Statistical Software Programming PROC UNIVARIATE Syntax: PROCUNIVARIATEdata=dsname <options>; VARvarlist; <plot statements> RUN; • A BY or CLASS statement can be used to do separate analyses within groups. • Univariate analyses done for each variable in varlist. • Plot statements include • CDFPLOT – plots the empirical cumulative distribution function with theoretical overlays. • HISTOGRAM • PPPLOT – empirical CDF versus fitted CDF. • PROBPLOT – sample quantiles versus theoretical quantiles (probabilities are plotted). • QQPLOT – sample quantiles versus theoretical quantiles. • The last three of these are closely related and each compare empirical distribution of the data to a theoretical reference distribution. Typically, QQPLOT is most useful.
STAT 6360 –Statistical Software Programming PROC UNIVARIATE • The plot statements produce high resolution graphs using ODS Graphics if ODS Graphics are “ON”. Otherwise, older plotting techniques based on SAS/GRAPH are used. • The PLOT option on the PROC statement produces a histogram, box plot and normal Q-Q plot rendered in hi-res or low-res graphics depending on whether ODS Graphics are ON or OFF. • Other useful options of PROC statement: • NORMAL – requests tests of normality. • MU0= – requests tests of hypothesis that population mean equals the specified null value. Yields t, sign and signed-rank tests. • CIBASIC – requests normal theory CIs for population mean, SD and variance. • There is also an OUTPUT statement that has syntax and purpose almost identical to OUTPUT statement in PROC MEANS.
STAT 6360 –Statistical Software Programming PROC TTEST Syntax for Paired t-test: PROCTTESTdata=dsname; PAIRED var1*var2; RUN; • For paired test, response from each member of pair is in a separate variable (var1, var2). Syntax for One-Sample t-test: PROCTTESTdata=dsname H0=null_val; VAR response_var; RUN; Syntax for Two-Sample t-test: PROCTTESTdata=dsname H0=null_val; CLASS group_var; VAR response_var; RUN; • null_val is usually 0 in two-sample case.
STAT 6360 –Statistical Software Programming PROC TTEST • There are options for one-sided tests, different significance levels (default is 0.05), and equivalence tests. • Paired t-test is really a one-sample t-test where response variable is the difference within a pair. • Assumptions of one-sample t-test: • Response (difference, in paired case) is normally distributed, • Response is independent from one subject (pair) to the next. • Because of normality assumption, ODS Graphs include normal Q-Q plot. • Two-sample t-test assumes: • Response is independent from one subject to the next and from one sample to the next. • Response is normally distributed. • If we assume same variance in the two-samples, we use classical t-test with pooled estimate of the common variance. • Without that assumption, we can use Satterthwaite’s version of t-test.
STAT 6360 –Statistical Software Programming Choice of Test:
STAT 6360 –Statistical Software Programming T-Test Examples Example 2 – FEV1 (again), A Paired T-test. • We did this previously by analyzing the Post minus Pre difference variable using PROCMEANS. • Same approach can be done with PROCUNIVARIATE. • Here we need Posttest and Pretest responses as separate variables. • Same results as before (of course). Normal Q-Q plot looks ok. Example 3 – Pizza Diameters • First, we consider this a two-sample problem and test whether pizzas from Dominoes and Eagles Boys have the same mean diameter. • Both versions of the t-test are highly significant but the significant difference in variance implies the Satterthwaite version is more appropriate. • But look at histograms and Q-Q plots! For dominoes especially, these suggest that the t-test assumptions are inappropriate. The naïve conclusion is that the data are not normal, but the real problem is that the data are not i.i.d.! • Recall there are three crust types for each chain. Side-by-side box plots show they have different distributions. Therefore we repeat the t-test separately by crust type. • Domino’s mean diameter is significantly smaller in each case!
STAT 6360 –Statistical Software Programming PROC CORR Syntax: PROCCORR data=dsname<options>; VARvarlist1; <WITHvarlist2;> <PARTIALvarlist3;> RUN; • Options include: • SPEARMAN – requests Spearman (rank) correlations instead of Pearson correlations (the default) • COV – requests covariances (variance-covariance matrix is given) • FISHER<(options)> - conducts inferences (e.g., confidence intervals) for population correlation coefficient based on Fisher’s Z-transformation. • OUTP= / OUTS= -these options specify output datasets for Pearson & Spearman correlations, respectively. • PLOTS<(options)>=SCATTER | MATRIX – this option requests either pairwise scatter plots for each pair of variables, or a scatter plot matrix for all variables. ODS Graphics must be ON.
STAT 6360 –Statistical Software Programming PROC CORR • VAR statement is required. If there is no WITH statement, then a matrix containing correlations between each pair of variables is given. • Entries in matrix are • sample correlation, • p-value for test of hypothesis that pop. correlation equals 0 • Number of observations for which both variables are non-missing. Note this can differ from one pair of variables to another. • Summary statistics for each variable are also given. • See Example 4 in Lec7Examps.sas which generates correlations among life expectancy, people/TV, people/doctor in largest population nations. • The latter two variables are highly non-normal, so Spearman correlations more appropriate.
STAT 6360 –Statistical Software Programming PROC CORR • If there is a WITH statement, then a rectangular array of pair-wise correlations between VAR variables and WITH variables is given. • See Example 5 in Lec7Examps.sas. • Here we use a dataset containing 21 body dimension measurements plus age, height weight and gender from 507 men and women who were regular exercisers. The first 9 variables are skeletal measurements, while the next 12 variables are “girth” measurements. • First we compute Pearson correlations between the first 5 skeletal variables and the first 5 girth variables. • For illustration, I requested a scatter plot of the 1st girth variable (shoulder girth) vs the 1st skeletal variable (biacromial diameter) and included a 95% prediction ellipse. I also requested a scatter plot matrix for all pairs of variables. • Next we compute partial Pearson correlations, controlling for age, height and gender.
STAT 6360 –Statistical Software Programming *(and do inference on)
STAT 6360 –Statistical Software Programming * Linear effects, to be precise.
STAT 6360 –Statistical Software Programming PROC REG • Options on MODEL statement include: • NOINT – Omit the intercept from the model. • CLB – requests confidence intervals on regression parameters. • CLM / CLI – requests confidence intervals for the mean (CLM) or prediction intervals for the response (CLI) at each value of the explanatory variables. • INFLUENCE – requests influence and leverage statistics. • P / R – requests predicted values / residuals for each observation be printed. • DW – requests Durbin-Watson statistic and p-value. Checks for serial correlation. • ALPHA= - sets significance level for tests and confidence/prediction intervals. • SELECTION= - chooses a method for automatic selection of the explanatory variables to include in the model. Default is NONE, meaning include all variables on the MODEL statement. Other options include FORWARD, BACKWARD, STEPWISE, MAXR, MINR, which all search for the best fitting model without trying all possible models, and RSQUARE, ADJRSQ, and CP which try all possible models and pick the winner based on different criteria. • Other options specify details of the selection method, ask for non-default results to be printed, or ask for certain results to be output to SAS datasets.
STAT 6360 –Statistical Software Programming Example #6 - Car Mileage and Weight • Heavier cars guzzle more fuel, right? Let’s see if this is true. • We start with a simple linear regression of city_mpg on weight. The simple option gives summary statistics for each variable, but the output that appears by default is as follows: Test that coefficients of predictors are all = 0. Model explains 71.09% of variance in city_mpg Est’d mean mileage when car weighs 0 lbs (not interesting!)
STAT 6360 –Statistical Software Programming Example #6 - Car Mileage and Weight • When there is >1 explanatory variable, the fitted curve plot is not given by default, but if the predictors are all functions of a single variable, it is still useful and can be requested with the plots=prediction(x= ) plot request. • This plots the fitted curve versus the variable specified as x. A residual plot (versus x) is also given with (optionally) a loess curve to help show any pattern. • This plot looks good for model M2 in our example. • Predicted values (and residuals) can be requested with the R option on the MODEL statement. This gives a prediction for each observation in the dataset. What if you want a prediction at a new value of the explanatory variables? • This can be done by including a new observation in the dataset with the values of the x-variable(s) of interest and with y=. (missing). • Such observations will be excluded when fitting the model, but included when computing predictions.
STAT 6360 –Statistical Software Programming Example #6 - Car Mileage and Weight • We illustrate this technique to predict the mileage of a car weighing 4500 lbs. This is well outside the range of weights in the dataset, so this illustration also shows the danger of extrapolation. • We add the data from a 4500 lb car in a new dataset stdcars2 and refit model M2 as model M2a and use the OUTPUT statement to get fitted values, residuals etc. Syntax:OUTPUTOUT=dsnamekeyword1=varname1 keyword2=varname2 … ; • This creates a dataset that includes the original data, plus quantities such a predicted values (keyword P), raw residuals (keyword R), internally and externally studentized residuals (keywords STUDENT and RSTUDENT, resp’ly), upper and lower limits for the estimated mean (keywords LCLM and UCLM), upper and lower limits for the predicted response (keywords LCL and UCL), etc. • We plot the fitted curve versus weight including the new prediction. • Notice that extrapolated curve goes up! This is an artifact of using a quadratic, which is the curve of a parabola. It fits well within the data range, but shouldn’t be extrapolated! • We also identify the extreme outlier as a Honda Civic.
STAT 6360 –Statistical Software Programming PROC ANOVA • Use ONLY for one-way model, or balanced multi-factor models. In fact, this PROC is completely subsumed by PROC GLM, so there is little reason to use PROC ANOVAat all! Basic Syntax for One-way Model: PROCANOVA data=dsname<options>; CLASS factor ; MODEL response = factor </ options>; RUN; • Options on PROC statement include: • PLOTS= – Only one plot available: a graph of side-by-side boxplots for each level of factor. This is produced automatically if ODS Graphics is ON. Use PLOTS=none to suppress it.
STAT 6360 –Statistical Software Programming Example #7 – Pizza Diameters • When illustrating the t-test, we compared mean pizza diameters between Domino’s and Eagle Boys for each crust type. • Now compare mean diameter across 6 different pizza types: D’s Thick, D’s Medium, D’s Thin, EB’s Thick, EB’s Medium & EB’s Thin. • That is, we fit a one-way anova model with pizza type (6 levels) as our factor. • The CLASS statement generates a summary of the factors we have defined (pizzatype in our case, with 6 levels). • Then comes the ANOVA Table: All effects in the model are merged in this line and tested simultaneously (akin to overall regression test), then broken apart by effect below Reject hypoth. that mean is same for 6 pizza types.
STAT 6360 –Statistical Software Programming PROC ANOVA • The MEANS statement requests estimates and other inferences on marginal or joint means: Syntax:MEANS effect </options> ; • Here, effect specifies a factor (e.g., A) or combination of factors (e.g., A*B).Means are estimated at each level or combination of levels of the factors. Options include: • ALPHA= sets significance level for tests, confidence intervals. • HOVTEST= - requests one of several tests of hypothesis that variance is equal (homoscedasticity) across levels of the factor. I recommend judging homoscedasticity based on residual plots, but Brown-Forsyth (BF) is the best test. • CLM asks for CIs for the mean. • CLDIFF asks for CIs for pairwise differences among means. • T / BON / TUKEY / SCHEFFE (others) – ask for various types of multiple comparison adjustments for pairwise difference tests and intervals. • WELCH – requests Welch’s F test, an alternative to the standard overall ANOVA F test of equal means that is valid when variance differs across treatments.
STAT 6360 –Statistical Software Programming PROC GLM • Can fit ANOVA, ANCOVA, regression models. Doesn’t require balanced data. • I recommend always using PROCGLM instead of PROCANOVA. • Basic Syntax: PROCGLM data=dsname<options>; CLASS factor1 factor2 … ; MODEL response = effects </ options>; RUN; The effects specification determines the structure of the model for the mean response. It can involve continuous explanatory variables (covariates such as X1, X2,…) and/or factors (A, B, C).
STAT 6360 –Statistical Software Programming PROC GLM Types of effects: • Covariate effects (as in regression) - specified as a continuous variable written by itself: X1 • Polynomial effects – specified with an asterisk: X1*X1 or X1*X2 • Corresponds to including a continuous predictor equal to the product of covariates. • Note quadratic effect of X1 is X1*X1, not X1**2. • Factor main effects – specified as a factor (class variable) written by itself: A • Factor interactions: A*B or A*B*C • Here, asterisk denotes interaction, not product. • Continuous by factor interaction: X1*A • Means continuous effect of X1 at each level of A. • Nested factor effects: B(A) or C(B*A) • 1st example specifies that the levels of B differ at each level of A (B nested within A).
STAT 6360 –Statistical Software Programming PROC GLM – Common Models • * denotes interaction, () denotes nesting. In addition the vertical bar (|) is a shorthand: • A|B means A B A*B. A|B|C means A B C A*B A*C B*C A*B*C
STAT 6360 –Statistical Software Programming PROC GLM MODEL Statement Options: • NOINT – omits the intercept from the model. • E.g., if used in the one-way ANOVA model it becomes the cell means model instead of effects model. • SOLUTION – asks for regression parameter estimates (not printed by default because of less interest in ANOVA models). • SS1 / SS2 / SS3 / SS4 – asks for Type I, II, III, IV sums of squares, respectively. Type I (sequential SSs) & Type III (marginal SSs) printed by default. Distinction important in unbalanced multi-way models. • CLI / CLM – asks for PI for response and CI for mean response, respectively. • CLPARM – asks for CIs for regression parameter estimates. • ALPHA= - allows significance level other than 0.05 (the default). • P – asks for predicted values and residuals.
STAT 6360 –Statistical Software Programming PROC GLM • In Example #7 we refit the one-way model with PROCGLM. The 1st three lines of code are identical to PROCANOVA but we replace the MEANS statement with LSMEANS. Difference between MEANS and LSMEANS: • MEANS computes simple sample means irrespective of the model. • LSMEANS computes means using the least-squares estimators derived based on the current model. • For one-way models and balanced multi-factor models, these coincide, but for unbalanced data and/or complex models, they do not. • Always use LSMEANS for estimation and inference on means! • Syntax of the two statements is similar, but some options differ in syntax, other options exist for one but not the other statement. Useful Options for LSMEANS: • CL – computes CIs on means, and CIs on pairwise mean differences if those are requested with PDIFF or TDIFF option. • STDERR – ask for standard errors of means (always report these, not SDs).
STAT 6360 –Statistical Software Programming PROC GLM – LSMEANS Statement Additional Useful Options for LSMEANS: • PDIFF= - ask for pairwise differences between all means (ALL), between each mean and a reference mean (CONTROL, if two-sided alternative; CONTROLL or CONTROLU if one-sided alternative), or between each mean and the average mean (ANOM). • ADJUST= - Each type of PDIFF request has a default multiple comparison adjustment, but these adjustments can be specified explicitly with this option. Choices: BON, DUNNETT, SCHEFFE, TUKEY, T (no adjustment), others. • AT= In models with covariates, this allows the mean responses at each value of a factor to be estimated (and compared) at a fixed value of the covariate(s). • SLICE= - in models with multiple factors, allows testing of simple effects (e.g., tests for differences across the levels of factor A within a given level of factor B). • OUT= - specifies an output dataset for the estimated means, mean differences, etc. Useful for plotting or further analysis.
STAT 6360 –Statistical Software Programming Other Results from Example #7 • The ESTIMATE statements show that Domino’s pizzas have significantly smaller mean diameter for each crust type and when averaging over all three crust types. • With ODS Graphics ON and when the model is a one-way anova model, a side-by-side box plot graph is produced (same as PROCANOVA). • The LSMEANS statement produces: • Estimates of the mean diameter for each pizza type and an ODS Graphics plot of the six estimated means. • A matrix of Tukey HSD-adjusted p-values for each pairwise comparison and a plot of the pairwise differences (significant in blue, non-significant in red). These results come from the PDIFF=ALL & ADJUST=TUKEY options. • Confidence intervals for each mean and each pairwise difference (from the CL option).
STAT 6360 –Statistical Software Programming Example #8 – The Data
STAT 6360 –Statistical Software Programming Typically not of much interest. More concerned with individual effects below. Notice Type I and III SSs don’t agree. Use Type III for designed experiments. Always start by checking interaction. In this case, it is significant which means main effects may not be meaningful. Check interaction plot to understand the nature of the interaction.
STAT 6360 –Statistical Software Programming Exposure duration effects not all equal for exposed group. (Duration matters when exposed.) Exposed & control groups differ when duration =14 days.
STAT 6360 –Statistical Software Programming Example #8 – Creating Output Datasets • For illustration purposes, I have written the estimated least-squares means to output datasets in two different ways: • First, using the out= option on the LSMEANS statement; • Second using the ODS OUTPUT destination. Syntax: ODSOUTPUTtable_name=dsname; • This asks for a particular set of results, specified by the keyword table_name, to be written to a SAS dataset called dsname. • For each PROC, the SAS documentation provides a list of possible choices for table_name. This list can be found in the Details section of the documentation for the PROC. • E.g., for PROCGLM, here are some of the possible choices: • LSMeans, LSMeanCL, Contrasts, FitStatistics, ClassLevels, ModelANOVA, OverallANOVA, ParameterEstimates, … • There exists ODS output table names for almost every portion of the results that you could possibly want to write to a dataset.