140 likes | 159 Views
Learn how to handle missing data in census surveys, including reasons for missing data, types of missing data, and treatment methods. Options include doing nothing, using complete records, weighting methods, imputation techniques, and probability estimates.
E N D
Treatment of Missing Data Pres. 8 United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
Treatment of Missing Data United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010 Why are some data missed? Refusals Item non-response Time constraints Paucity of resources Lax enumerators Units not found Insufficient data for matching, etc.
Treatment of Missing Data United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010 Four types of missing data Unit missing data - Household non-interview Item missing data - When some information for household or person is available and some information is not available Unresolved match or residence status – When match or residence status in P-sample could not be determined for PES Estimation Unresolved enumeration status – When correct or erroneous enumeration status in E-sample could not be determined for PES estimation
How to treat missing data ? A. doing nothing B. use only the complete records C. use a weighting method D. impute a missing value E. probability for unresolved status United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
A. Doing nothing If missing data are very few, it may not have significant effect on data usages and one can ignore them Requires to work with an incomplete dataset with missing data United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
B. Use only the complete records • Easy but risky option. The subset of respondents may be: • Non representative of the total population under study • Estimates may be seriously biased, unless non-response doesn’t depend on any of the variables of interest • This option can be envisaged only for a rapid descriptive analysis United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
C. Use a weighting method • Unit non-response: • Increase the respondents’ weight to compensate for the non-respondents. The objective is to produce roughly unbiased estimates • Item non-response: • Possible to use reweighting methods but the main disadvantage is to have different weights for the same record (one for each of the variables). That’s why it is generally not used for item non-response United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
D. Imputation The process of imputation changes one or more responses or missing values in a record or several records to ensure internally coherent records result Before using any imputation method, the best strategy is to start with manual study of responses; imputation can then handle the remaining unresolved edit failures Two methods of imputation: Cold Deck and Hot Deck Cold Deck Imputation: Used mainly for missing or unknown values (not for inconsistent/invalid values) Values are imputed on a proportional basis from a distribution of valid responses (e.g., from previous census) In doing so, cold deck draws values from a fixed (but possibly outdated) distribution of values United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
D. Imputation (contd.) Hot Deck or Dynamic Imputation: Used for both missing data and inconsistent/invalid items Uses one or more variables to estimate the likely response based on data about individuals with similar characteristics The “donor set” (or imputation matrix) constantly changes through updating; therefore, imputations dynamically change during the process of editing all the records Thus, hot deck draws from a distribution that dynamically changes with each imputation and eventually (through modifications) “approaches” the distribution of current data set Caution: if the different items for a particular record have unknown values, hot deck may not use the same “donor” to impute for both missing values; in this case, it is preferable to use the same donor for both items United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
E. Probability for unresolved status • Unresolved match or residence status in P-sample: • Estimate probabilities of match (residence) status • Form cells/groups to estimate probabilities • Each cell be homogenous with respect to probability to be estimated • Different/hetrogenous Probabilities between cells/groups • Use reasons for field follow-up to form cells • Unresolved enumeration status in E-sample: • Estimate probabilities of correct enumeration • Form cells/groups to estimate probabilities • Each cell be homogenous with respect to probability to be estimated • Different/hetrogenous probabilities between cells/groups • Use reasons for field follow-up to form cells United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
E. Probability for unresolved status • Example: • Estimate probability of match (residence) status for a cell • Total cases sent for field follow-up = 100 • Number of cases resolved after field follow-up = 80 • Number of matched cases out of 80 =48 • Number of nonmatched cases out of 80 = 80-48 = 32 • Probability of match for an unresolved case is = 48/80 = 0.60 • Probability of nonmatch for an unresolved case is = 32/80 =0.40 • Unresolved enumeration status in E-sample: • Estimate probabilities of correct enumeration for a cell • Same methodology as described for match status United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
Summary results of missing data operations Essential to evaluation, process planning and management: i) number of cases of each type of error; ii) unit non-response rates; iii) non-response rates for each item; iv) imputation rates for each item; v) unresolved status by type, …. Important to generate edit trail showing all data changes and substituted values with their tallies If original value of data is changed in any way; flags should be added onto each item that is changed or imputed This information is critical for planning of future censuses; e.g., As a means to investigate age threshold below which female with “child ever born” triggers a query edit and to decide if threshold should be modified for future rounds United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
A useful reference Handbook on Population and Housing Census Editing Rev. 1 Available on the UNSD website and currently under printing United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010
Thank You! United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan, 21-24 November, 2010