1 / 64

Population Health Surveys Bootstrap Hands-on Workshop

Population Health Surveys Bootstrap Hands-on Workshop. Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca. Purpose of the presentation.

imala
Download Presentation

Population Health Surveys Bootstrap Hands-on Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Population Health SurveysBootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca

  2. Purpose of the presentation • Justify the use, understand the theory, and get familiar with the bootstrap technique • Demystify all illusions about using the bootstrap technique for variance estimation

  3. Outline • Context • NPHS \ CCHS Complex survey design • Variance estimation \ Bootstrap 101 • Data support \ using the bootvar program • Why bootstrap? • CV lookup tables • Historical info about variance estimation for NPHS • Variance estimation with other software programs • Future for STC Health Surveys (re. bootstrap)

  4. Context • A data user is interested in producing some results 1- Compute an estimate (total, ratio, etc.) 2- Compute the precision of the estimate (variance, coefficient of variation (CV), etc.)

  5. Context 1- Compute an estimate • Is not a problem! • Use the provided survey weight with NPHS/CCHS files

  6. Context 1- Compute an estimate (cont’d) • Why use the survey weight? • Conclusion: ALWAYS USE THE WEIGHTS

  7. Context 2- Compute the precision of an estimate • Is a problem!!

  8. Context 2- Compute the precision of the estimate (cont’d) • Scaled weights: • Scaled weight = weight / mean(weight) • Used to overcome problems with the computation of the variance for some statistics in SAS • Reference: paper from G.Roberts & al.

  9. Context 2- Compute the precision of the estimate (cont’d) • Why such a difference? Answer: The complex survey design is the main cause (other factors to be discussed later) Note: CCHS and NPHS have slightly different frames but are both considered as complex survey designs

  10. Complex survey design 1- Each province is divided into strata Province A Stratum #1 Stratum #2

  11. Complex survey design 2- Selection of clusters within each stratum Province A Stratum #1 Stratum #2

  12. Complex survey design 3- Selection of households within each cluster Province A Stratum #1        Stratum #2       

  13. Complex survey design • How does the sample design affect the precision of estimates? • Stratification decreases variability (more precise) • Clustering increases variability (less precise) • Overall, the multistage design has the effect of increasing variability (less precise than SRS)

  14. Complex survey design • So why use a multistage cluster sample design anyway? • Pros: • Efficient for interviewing (less travel, less costly) • Better coverage of the entire region of interest • Cons: • Problems for variance estimation

  15. Bootstrap Method • Variance estimation with complex multistage cluster sample design: • Exact formula for variance estimation is too complex; use of an approximate approach required • NOTE: taking account for the design in variance estimation is as crucial as using the sampling weights for the estimation of a statistic

  16. Bootstrap Method • Approximate methods for variance estimation: • Taylor linearization • Re-sampling methods: • Balanced Repeated Replication • Jackknife • Bootstrap

  17. Bootstrap Method • Principle: • You want to estimate how precise is your estimation of the number of smokers in Canada • You could draw 500 totally new samples, and compare the 500 estimations you would get from these samples. The variance of these 500 estimations would indicate the precision. • Problem: drawing 500 new samples is $$$ • Solution: Use your sample as a population, and take many smaller subsamples from it.

  18. Bootstrap 101 • How Bootstrap weights are created(the secret is finally revealed!!!) T = 40Var =  (Bi -B)2 / 499

  19. Bootstrap 101 • How Bootstrap replicates are built (cont’d) • The “real” recipe 1- Subsampling of clusters (SRS) within strata 2- Apply (initial design) weight 3- Adjust weight for selection of n-1 among n 4- Apply all standard adjustments (nonresponse, share, etc.) 5- Post-stratification to population counts

  20. Bootstrap 101 • How Bootstrap replicates are built (cont’d) • The bootstrap method intends to mimic the same approach used for the sampling and weighting processes • Be careful: some software programs say they include the bootstrap technique; what they really do is to skip steps #4 and #5, and use directly the final weight in step #2

  21. Bootstrap 101 • STC Methodologists create the bootstrap weight files. • Can you create your own bootstrap wgt file? NoWhy? Because to do so you need to know: • The design information, i.e. strata, clusters (to generate the bootstrap subsamples) • The definition of all adjustment classes (including post-stratification)

  22. Bootstrap 101 • The bootstrap wgt files are: • Available for all file (except PUMF - confidentiality) • Distributed with the data files in separate files • The bootstrap wgt files contain: • IDs (REALUKEY/SAMPLEID, PERSONID) • Final sampling weight (WTxx) • 500 Bootstrap weights (BSW1--BSW500)

  23. Bootstrap - Support • NPHS/CCHS provides data users with SAS & SPSS macro programs to compute bootstrap variances • Macros simplifying computation of bootstrap variance estimates for totals, ratio, differences of ratios, regressions (linear and logistic), and basic generealized linear models • Come with documentation & examples • French and English • referred as “bootvar”

  24. Example: Step by Step • Let’s get to work! • Goal: Interested in estimating the number of diabetics (total) • NPHS 1998-99 Dummy file (see information sheet)

  25. Example: Step by Step STEP #2 Compute your variances with bootvar STEP #1 Create your « analysis data file » • Read NPHS\CCHS data file • Prepare dummy variables necessary for your analysis • Keep only necessary variables (include geography desired) • Run the analysis to get point estimates only (not necessary but recommended) • Location of INPUT files: • Your « analysis data file » • The bootstrap weights file • Geography desired • Number of bootstrap weights to use • Specify the desired analysis • Totals, ratios, diff of ratios • Regression (linear & logit) • Generalized linear modeling

  26. Example: Step by Step • Step #1: On your own(but can use the examples provided as a starting point) • Step #2: Use the provided Bootvar program

  27. STEP #1 • Read input file • Create dummy variables • Keep only necessary variables • Run the analysis to get point estimates • Create dummy variables • For qualitative/categorical variables, we need to identify which value(s) we are interested in. This is done through the creation of a dummy variable • Dummy variable = 1 for characteristic of interest = 0 otherwise

  28. STEP #1 • Create dummy variable: example #1 • During the past 12 months, how often did you drink alcoholic beverages? (ALC8_2) 1=Less than once a month 2=Once a month 3=2 to 3 times a month 4=Once a week 5=2 to 3 times a week 6=4 to 6 times a week 7=Every day • Interested in categories 1 to 4 (once a week or less) • DRINK = 1 if ALC8_2 is 1,2,3 or 4 = 0 otherwise

  29. STEP #1 • Create dummy variable: example #2 Diabetes (CCC8_1J) Sex (DHC8_SEX) 1=Yes 1=Male 2=No 2=Female 6=Not applicable 7=Don’t know 9=Not stated • Interested in “males having diabetes” • mdiab = 1 if CCC8_1J = 1 and SEX =1 = 0 otherwise

  30. STEP #1 • Create dummy variable: example #2 • How to use the dummy variable to get an estimate • Total: In SAS: Proc freq; tables mdiab; weight wt56; run;

  31. STEP #1 • Create dummy variable: example #2 • How to use the dummy variable to get an estimate • Ratio:

  32. STEP #1 • See example in SPSS

  33. STEP #1 • Now your turn! (exercise #1) • Add asthma (CCC8_1C) to the table • Use existing program (step1.sas) and add SPSS codes to create a dummy variable for asthma; and then get the results

  34. Step #2: Bootvar Program • Created by methodologists in 1997(first used with NPHS cycle 2 data) • Version 1.0 • one single program (over 1,000 lines of codes) • divided into 4 sections • users have to adapt the program to their requests; changes in 3 sections • SAS: bootvar.sas / bootvarf.sasSPSS: beta version available only on request (bvr_b.sps)

  35. Step #2: Bootvar Program • Version 2.0 • Justifications: • Compatible with SAS 8+ • Centralize the codes where modifications have to be done by the user • Can use with both NPHS and CCHS data files • Now consists of 2 programs • Contains the codes users need to modify for their requests • Contains the codes users do not have to modify (macros)

  36. Step #2: Bootvar Program • Version 2.0 • SAS version: • bootvare_v20.sas / bootvarf_v20.sas • macroe_v20.sas / macrof_v20.sas • SPSS version: • bootvare_v21.sps / bootvarf_v21.sps • macroe_v21.sps / macrof_v21.sps

  37. STEP #2: Use of bootvar • Point estimates have already been obtained, let us now estimate the sampling variability of those estimates  Go through the bootvar program (bootvare_v21.sps)

  38. STEP #2: Use of bootvar • See example in SPSS

  39. STEP #2 • Now your turn! (exercise #2) • Compute confidence intervals for asthma • Use bootvare_v21.sps and adjust it to obtain desired results(use the already set up step2.sps program for this exercise)

  40. Bootstrap - More • Why 500 bootstrap weights? • Size of file (for dissemination) • Time of computation (for an average PC) • Accuracy • Use more bootstrap weights? • Faster PC • Accuracy for small domains and more complex analysis methods

  41. Bootstrap - More • Confidentiality revealed from the bootstrap weights

  42. Bootstrap - More • Confidentiality revealed from the bootstrap weights (cont’d) • How PUMF users estimate their exact variances? • Remote access • Provide dummy file (same structure as master files but contain dummy data) • Test programs and send by e-mail • Research Data Centre • Regional Offices

  43. Why Bootstrap? • Other techniques examined: Taylor, Jackknife • Taylor: • Need to define a linear equation for each statistic examined • Jackknife: • Can not disseminate because of confidentiality • Number of replicates depends on the number of strata (large number of strata in 1996 makes it impossible to disseminate)

  44. Why Bootstrap? • Bootstrap: • Handle more easily survey design with many strata • Sets of 500 bootstrap weights can be distributed to data users • Recommended (over the jackknife) for estimating the variance of nonsmooth functions like quantiles, LICO • Reference: “Bootstrap Variance Estimation for the National Population Health Survey”,D.Yeo, H.Mantel, and T.-P. Liu. 1999, ASA Conference.

  45. Bootvar: exercise #3 • Results for diabetes broken down by sex and province

  46. Bootvar: Tricks • If you need to create a dummy variable for a characteristic based on many variables: • Example: Males with diabetes • First, create dummy variables for each individual variable (males, diabetes) • Then, create the dummy variable for the characteristic by multiplying the individual dummy variables

  47. Bootvar: Tricks • Example: • Males = 1,0 (MALES) • Diabetes = 1,0 (DIAB) • Males having diabetes (MDIAB) = MALES * DIAB = *

  48. Bootvar: Tricks • Use the REGION parameter in bootvar to specify a “stratification” variable (doesn’t have to be a geographic variable!) • Example: REGION = sex will produce results by sex

  49. CV look-up tables • What is it? • Approximate sampling variability tables • Produced for Canada, each province, and by age groups for Canada (also by Health Regions for cycle 2) • Useful only for categorical estimates • Totals & ratios only

  50. CV look-up tables

More Related