670 likes | 1.25k Views
Population Health Surveys Bootstrap Hands-on Workshop. Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca. Purpose of the presentation.
E N D
Population Health SurveysBootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca
Purpose of the presentation • Justify the use, understand the theory, and get familiar with the bootstrap technique • Demystify all illusions about using the bootstrap technique for variance estimation
Outline • Context • NPHS \ CCHS Complex survey design • Variance estimation \ Bootstrap 101 • Data support \ using the bootvar program • Why bootstrap? • CV lookup tables • Historical info about variance estimation for NPHS • Variance estimation with other software programs • Future for STC Health Surveys (re. bootstrap)
Context • A data user is interested in producing some results 1- Compute an estimate (total, ratio, etc.) 2- Compute the precision of the estimate (variance, coefficient of variation (CV), etc.)
Context 1- Compute an estimate • Is not a problem! • Use the provided survey weight with NPHS/CCHS files
Context 1- Compute an estimate (cont’d) • Why use the survey weight? • Conclusion: ALWAYS USE THE WEIGHTS
Context 2- Compute the precision of an estimate • Is a problem!!
Context 2- Compute the precision of the estimate (cont’d) • Scaled weights: • Scaled weight = weight / mean(weight) • Used to overcome problems with the computation of the variance for some statistics in SAS • Reference: paper from G.Roberts & al.
Context 2- Compute the precision of the estimate (cont’d) • Why such a difference? Answer: The complex survey design is the main cause (other factors to be discussed later) Note: CCHS and NPHS have slightly different frames but are both considered as complex survey designs
Complex survey design 1- Each province is divided into strata Province A Stratum #1 Stratum #2
Complex survey design 2- Selection of clusters within each stratum Province A Stratum #1 Stratum #2
Complex survey design 3- Selection of households within each cluster Province A Stratum #1 Stratum #2
Complex survey design • How does the sample design affect the precision of estimates? • Stratification decreases variability (more precise) • Clustering increases variability (less precise) • Overall, the multistage design has the effect of increasing variability (less precise than SRS)
Complex survey design • So why use a multistage cluster sample design anyway? • Pros: • Efficient for interviewing (less travel, less costly) • Better coverage of the entire region of interest • Cons: • Problems for variance estimation
Bootstrap Method • Variance estimation with complex multistage cluster sample design: • Exact formula for variance estimation is too complex; use of an approximate approach required • NOTE: taking account for the design in variance estimation is as crucial as using the sampling weights for the estimation of a statistic
Bootstrap Method • Approximate methods for variance estimation: • Taylor linearization • Re-sampling methods: • Balanced Repeated Replication • Jackknife • Bootstrap
Bootstrap Method • Principle: • You want to estimate how precise is your estimation of the number of smokers in Canada • You could draw 500 totally new samples, and compare the 500 estimations you would get from these samples. The variance of these 500 estimations would indicate the precision. • Problem: drawing 500 new samples is $$$ • Solution: Use your sample as a population, and take many smaller subsamples from it.
Bootstrap 101 • How Bootstrap weights are created(the secret is finally revealed!!!) T = 40Var = (Bi -B)2 / 499
Bootstrap 101 • How Bootstrap replicates are built (cont’d) • The “real” recipe 1- Subsampling of clusters (SRS) within strata 2- Apply (initial design) weight 3- Adjust weight for selection of n-1 among n 4- Apply all standard adjustments (nonresponse, share, etc.) 5- Post-stratification to population counts
Bootstrap 101 • How Bootstrap replicates are built (cont’d) • The bootstrap method intends to mimic the same approach used for the sampling and weighting processes • Be careful: some software programs say they include the bootstrap technique; what they really do is to skip steps #4 and #5, and use directly the final weight in step #2
Bootstrap 101 • STC Methodologists create the bootstrap weight files. • Can you create your own bootstrap wgt file? NoWhy? Because to do so you need to know: • The design information, i.e. strata, clusters (to generate the bootstrap subsamples) • The definition of all adjustment classes (including post-stratification)
Bootstrap 101 • The bootstrap wgt files are: • Available for all file (except PUMF - confidentiality) • Distributed with the data files in separate files • The bootstrap wgt files contain: • IDs (REALUKEY/SAMPLEID, PERSONID) • Final sampling weight (WTxx) • 500 Bootstrap weights (BSW1--BSW500)
Bootstrap - Support • NPHS/CCHS provides data users with SAS & SPSS macro programs to compute bootstrap variances • Macros simplifying computation of bootstrap variance estimates for totals, ratio, differences of ratios, regressions (linear and logistic), and basic generealized linear models • Come with documentation & examples • French and English • referred as “bootvar”
Example: Step by Step • Let’s get to work! • Goal: Interested in estimating the number of diabetics (total) • NPHS 1998-99 Dummy file (see information sheet)
Example: Step by Step STEP #2 Compute your variances with bootvar STEP #1 Create your « analysis data file » • Read NPHS\CCHS data file • Prepare dummy variables necessary for your analysis • Keep only necessary variables (include geography desired) • Run the analysis to get point estimates only (not necessary but recommended) • Location of INPUT files: • Your « analysis data file » • The bootstrap weights file • Geography desired • Number of bootstrap weights to use • Specify the desired analysis • Totals, ratios, diff of ratios • Regression (linear & logit) • Generalized linear modeling
Example: Step by Step • Step #1: On your own(but can use the examples provided as a starting point) • Step #2: Use the provided Bootvar program
STEP #1 • Read input file • Create dummy variables • Keep only necessary variables • Run the analysis to get point estimates • Create dummy variables • For qualitative/categorical variables, we need to identify which value(s) we are interested in. This is done through the creation of a dummy variable • Dummy variable = 1 for characteristic of interest = 0 otherwise
STEP #1 • Create dummy variable: example #1 • During the past 12 months, how often did you drink alcoholic beverages? (ALC8_2) 1=Less than once a month 2=Once a month 3=2 to 3 times a month 4=Once a week 5=2 to 3 times a week 6=4 to 6 times a week 7=Every day • Interested in categories 1 to 4 (once a week or less) • DRINK = 1 if ALC8_2 is 1,2,3 or 4 = 0 otherwise
STEP #1 • Create dummy variable: example #2 Diabetes (CCC8_1J) Sex (DHC8_SEX) 1=Yes 1=Male 2=No 2=Female 6=Not applicable 7=Don’t know 9=Not stated • Interested in “males having diabetes” • mdiab = 1 if CCC8_1J = 1 and SEX =1 = 0 otherwise
STEP #1 • Create dummy variable: example #2 • How to use the dummy variable to get an estimate • Total: In SAS: Proc freq; tables mdiab; weight wt56; run;
STEP #1 • Create dummy variable: example #2 • How to use the dummy variable to get an estimate • Ratio:
STEP #1 • See example in SPSS
STEP #1 • Now your turn! (exercise #1) • Add asthma (CCC8_1C) to the table • Use existing program (step1.sas) and add SPSS codes to create a dummy variable for asthma; and then get the results
Step #2: Bootvar Program • Created by methodologists in 1997(first used with NPHS cycle 2 data) • Version 1.0 • one single program (over 1,000 lines of codes) • divided into 4 sections • users have to adapt the program to their requests; changes in 3 sections • SAS: bootvar.sas / bootvarf.sasSPSS: beta version available only on request (bvr_b.sps)
Step #2: Bootvar Program • Version 2.0 • Justifications: • Compatible with SAS 8+ • Centralize the codes where modifications have to be done by the user • Can use with both NPHS and CCHS data files • Now consists of 2 programs • Contains the codes users need to modify for their requests • Contains the codes users do not have to modify (macros)
Step #2: Bootvar Program • Version 2.0 • SAS version: • bootvare_v20.sas / bootvarf_v20.sas • macroe_v20.sas / macrof_v20.sas • SPSS version: • bootvare_v21.sps / bootvarf_v21.sps • macroe_v21.sps / macrof_v21.sps
STEP #2: Use of bootvar • Point estimates have already been obtained, let us now estimate the sampling variability of those estimates Go through the bootvar program (bootvare_v21.sps)
STEP #2: Use of bootvar • See example in SPSS
STEP #2 • Now your turn! (exercise #2) • Compute confidence intervals for asthma • Use bootvare_v21.sps and adjust it to obtain desired results(use the already set up step2.sps program for this exercise)
Bootstrap - More • Why 500 bootstrap weights? • Size of file (for dissemination) • Time of computation (for an average PC) • Accuracy • Use more bootstrap weights? • Faster PC • Accuracy for small domains and more complex analysis methods
Bootstrap - More • Confidentiality revealed from the bootstrap weights
Bootstrap - More • Confidentiality revealed from the bootstrap weights (cont’d) • How PUMF users estimate their exact variances? • Remote access • Provide dummy file (same structure as master files but contain dummy data) • Test programs and send by e-mail • Research Data Centre • Regional Offices
Why Bootstrap? • Other techniques examined: Taylor, Jackknife • Taylor: • Need to define a linear equation for each statistic examined • Jackknife: • Can not disseminate because of confidentiality • Number of replicates depends on the number of strata (large number of strata in 1996 makes it impossible to disseminate)
Why Bootstrap? • Bootstrap: • Handle more easily survey design with many strata • Sets of 500 bootstrap weights can be distributed to data users • Recommended (over the jackknife) for estimating the variance of nonsmooth functions like quantiles, LICO • Reference: “Bootstrap Variance Estimation for the National Population Health Survey”,D.Yeo, H.Mantel, and T.-P. Liu. 1999, ASA Conference.
Bootvar: exercise #3 • Results for diabetes broken down by sex and province
Bootvar: Tricks • If you need to create a dummy variable for a characteristic based on many variables: • Example: Males with diabetes • First, create dummy variables for each individual variable (males, diabetes) • Then, create the dummy variable for the characteristic by multiplying the individual dummy variables
Bootvar: Tricks • Example: • Males = 1,0 (MALES) • Diabetes = 1,0 (DIAB) • Males having diabetes (MDIAB) = MALES * DIAB = *
Bootvar: Tricks • Use the REGION parameter in bootvar to specify a “stratification” variable (doesn’t have to be a geographic variable!) • Example: REGION = sex will produce results by sex
CV look-up tables • What is it? • Approximate sampling variability tables • Produced for Canada, each province, and by age groups for Canada (also by Health Regions for cycle 2) • Useful only for categorical estimates • Totals & ratios only