1 / 19

Research on Improvements to Current SIPP Imputation Methods

Research on Improvements to Current SIPP Imputation Methods. ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson. Census Imputation Research Plan. Few changes made to actual production imputation methods in many years

Download Presentation

Research on Improvements to Current SIPP Imputation Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson

  2. Census Imputation Research Plan • Few changes made to actual production imputation methods in many years • With redesign of the SIPP, this is an opportunity to consider what changes might be made • New committee formed with members from content, data processing, sampling, and statistical methodology divisions • Incremental approach: test new methods and consider short list of variables that might be substantially improved

  3. Proposed Improvements • Model-based approach • Use administrative data to mitigate problems caused when survey data are not “missing at random” • Multiple imputation

  4. Model-based Approach • Hot-deck depends on a donor matrix with reasonable cell sizes • Small cells must sometimes be collapsed • Collapsing cells creates a more heterogeneous group of donors • Hot-deck can’t take account of variables that are dropped in order to combine cells

  5. Model-based Approach: Research • Consider an imputation method that uses a linear regression to impute missing values • Stratify sample by set of characteristics, run regressions for each sub-group that is large enough • Sub-groups that are too small are combined • Variables that are dropped from stratification list are added as explanatory variables in the regression

  6. Example • Earnings imputation • Stratify by age, gender, race, education, industry, and disability • Including disability may cause some small cells • Perhaps combine sub-groups of disabled and not-disabled white women in their fifties • For this sub-group, include disability status as explanatory variable in regression of earnings on SIPP characteristics

  7. Data Not “Missing At Random” • All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data • Assume relationship between X1, X2, X3 and Y can be estimated • Assume if Y is missing, X1, X2, and X3 are good predictors • However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed

  8. Data Not “Missing At Random”: Research • We can evaluate the magnitude of this problem and mitigate the impact on imputation using administrative data • Information from an outside source can help account for unobservable (in the survey) differences between people

  9. Example: 2004 SIPP panel • 2004 Annual earnings at two main jobs • Earnings at each job are imputed on a monthly basis • Sum across jobs and then across months to get annual earnings • Create count of number of imputed months in the year (range from 0-12) • If either job has imputed earnings, count the full month as imputed

  10. Example: 2004 SIPP panel (cont.) • Split SIPP respondents into groups 1. No months of imputed or missing data 2. 1-4 months of imputed data (no missing) 3. 5-8 months of imputed data (no missing) 4. 9-12 months of imputed data (no missing) • Match earnings report from W-2 records summed for all employers

  11. Example: 2004 SIPP panel (cont.) • If earnings are missing at random, relationship between admin. earnings and other SIPP variables should be the same for all four groups • Test • regress admin. earnings on SIPP demographic variables separately for each group • predict earnings for each group using each set of coefficients (four predicted values per group) • compare each prediction to actual admin. earnings • if coefficients are good predictors, difference should be zero on average

  12. Example: Results

  13. Example: Results

  14. Multiple Imputation • Since the 1970s, Donald Rubin has argued that imputation adds variability to user-calculated statistics • Traditional methods impute only once • User has no way to account for variability • Multiple imputation allows the user to calculate variance that includes a piece due to imputation

  15. Multiple Imputation: Example • How might variance estimates change when switch from single to multiple imputation? • Consider random variable X with mean of .5 • Generate 1000 random samples by taking draws for 80 people • 20 people have missing data for X

  16. Multiple Imputation: Example (cont.) • Impute missing data using 2 methods: • single implicate/hot deck – every observed value has equal prob. of being donor • multiple imputation/Bayesian Bootstrap – prob. of being donor changes across implicates but centered around 1/n; create 32 implicates • Calculate mean and 95% confidence interval for all 1000 random samples

  17. Multiple Imputation: Example (cont.) • Case of 1 implicate • 95% confidence interval contains the true value 88% of the time • Case of multiple implicates • Calculate variance of mean using Rubin formula • 95% confidence interval contains the true value 96.5% of the time • What does this mean? • Statistical hypotheses will be rejected too often using single imputation methods because variance estimates are too small

  18. Examples of Census Research on Imputation Methods • Generalized Additive Model (GAM) • Predictive Mean Matching • Bayesian Bootstrap • Sequential Regression Multiple Imputation (SRMI)

  19. Questions for Panel Discussion • General thoughts and suggestions on model-based imputation? • Suggest specific models? • Which variables should we prioritize? • Would SIPP user community be willing/able to handle multiple implicates?

More Related