1 / 52

Data Management Approaches: handling, storage, & missingness

Data Management Approaches: handling, storage, & missingness. Renée El-Gabalawy PhD July 24 2019 2019 Department of Psychiatry Summer School.

wray
Download Presentation

Data Management Approaches: handling, storage, & missingness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Approaches: handling, storage, & missingness Renée El-Gabalawy PhD July 24 2019 2019 Department of Psychiatry Summer School

  2. You are interested in understanding the relationship between cognitive functioning before surgery, psychiatric status, and functional brain abnormalities on post-operative delirium. Prior to running a large trial, you complete a feasibility study. Cognitive testing Surgery Psychiatric self-report POD fMRI Study Example

  3. Creating a Successful Data Plan STEP 1:Create specific data collection plan • Aim to recruit 15 pts (no power analysis because feasibility study) from pre-anesthesia clinic who are undergoing high risk surgery • Provide informed consent, have patients fill out 5 self-report measures, book fMRI time (within 2 weeks prior to surgery) • Patients undergo neuropsychological assessment (1 hour) and fMRI (1 hour) • Perioperative monitoring and management? • Post-operative 0 thru 5 days delirium assessments

  4. Creating a Successful Data Plan STEP 2: Assess feasibility and required personnel • Pre-anesthesia recruitment (1 personnel) • Pre-operative fMRI (2-3 personnel) and neuropsychological assessment (1 hire with clinical expertise) • Perioperative management • Post-operative assessments (1 hire CAM trained, nursing staff) Considerations: • blinding to avoid bias (neuropsych hire ≠ CAM hire) • required expertise • budget (what funds are available to hire people? pay for MRI? provide honorarium?) • Increase validity of post-op assessment (multiple measures? chart review?)

  5. Creating a Successful Data Plan STEP 3: Apply for ethics • Have all testing batteries prepared • Personnel identified • Organization strategy in place • Data management plan in place • Each pt is coded with unique number. List of pts and numbers kept on password protected computer • Each pt has file with number. All pt materials put in file. Files kept in locked laboratory (study personnel have key) • Files = consent form, pre-op battery, neuropsych testing, periopinfo, all CAM assessments Important note: organization can make or break a study

  6. Data Entry and Management STEP 4: Conduct study & select method of data entry and management (ongoing or post-study entry?) Important elements to data entry: • Use excel/SPSS (or other data management platform) with clear labels • Use numerical values to identify patients • Create data dictionary where variables and values are clearly defined • Decide on a value for missing data (888, 99, 9?) • Only include text in data file when patients can report open-ended response

  7. 4. Variable label should differentiate longitudinal data Important points to remember: 1. Make sure variable label matches with label in dataset 2. Clearly identify continuous vs. categorical vs. text data 3. Make sure missing value does not overlap with real values 5. Identify question in questionnaire associated with each variable

  8. Management  Analysis • With a well organized data file you can: • Perform basic statistics in excel • Graph your variables in excel • Transfer the data easily to statistical software (SPSS) for analysis • Clearly identify missings and analyze accordingly • Send the data and data dictionary to a statistician for analysis

  9. Missing data

  10. How does data go missing? • You lose it • Participants not responding to certain questions • Order/fatigue • Sensitivity/refusal • Skip-outs • Attrition • Data entry errors • Statistical coding errors

  11. Why don’t we think/talk about missings in epidemiological analyses? • You should • Many variables have <5% missing (thus, “ignorable”) • Weighting procedures partially account for missing (“missing by design”) • Weights adjust for non-response

  12. Methods to reduce missings in primary data collection • Select shorter scales • Avoid long, confusing questions • Carefully consider alternate ways of asking sensitive questions and include surrogate questions • Change ordering of scales (e.g., have 3 different versions) • Check for completeness immediately • Have second person review data entry

  13. Understanding missingness • Understand reasons and patterns for missing • Understand distribution of missings • Use this information to select the best method of analysis and imputation (if appropriate)

  14. Defining missing values Missing Completely at Random (MCAR): Missings are unrelated to the actual data. Parameter estimates unbiased but impact on power Example: Some survey questions asked in a random sub-sample of a larger sample Missing at Random (MAR): Missings are related to another factor, but not the variable of interest Example: Respondents in service occupations less likely to report income Missing Not at Random (NMAR): Missings related to the variable itself Example: Respondents with high incomes are less likely to report their incomes

  15. MCAR vs MAR vs MNAR • It is important to understand missingness as statistical methods to deal with missing often assume MCAR or MAR • It is problematic to have MNAR

  16. Analysis strategies Deletion Methods: • Listwise deletion, pairwise deletion Single Imputation Methods: • Mean/mode substitution, dummy variable method, single regression • Must be MCAR Model-Based Methods: • Maximum likelihood, multiple imputation, inverse probability weighting • Can be MAR

  17. Deletion Methods • Simplest approach • Only analyzes cases with available data on each variable • Reduces power • Increases likelihood of biased findings Listwise deletion: deleting a complete case (e.g., participant) Pairwise deletion: deleting single missing variable

  18. Single Imputation Mean/mode substitution: replace missing value with sample mean or mode • Can specify conditions (e.g., over 80% complete) • Reduce variability of data – regressed towards mean Regression imputation: replaces missing values with predicted score from regression • Can overestimate model fit • Reduces variance Dummy variable adjustment with imputation: • Create missing variable (1=missing, 0=not missing) • Impute data (e.g., mean substitution) • Add missing variable as covariate in regression

  19. Dummy Variable Adjustment • Treat missing data as a “level” in categorical variables • E.g., If you had 3 levels for income (<$30,000, $30,000-$60,000, $60,000+), you could include a “missing level” and include this 4-level variable as a covariate in your models • This is preferable over deletion, which will impact your sample size

  20. Model Based Methods Maximum likelihood estimation (EM in SPSS): • Method that identifies a set of parameter values that results in the highest log-likelihood • Unbiased estimates for MCAR and MAR Multiple imputation: • Uses a specified regression model to create multiple datasets with various completed possible values • Analyses are performed within each dataset • Results pooled into one estimate

  21. Preferred Method • Multiple imputation is the gold standard of imputation • Creates most accurate values by taking into account variability due to (1) sampling and (2) imputation • Disadvantage: Time consuming and involves several decisions • MI method, dataset count, iterations between datasets, selection of prior distribution

  22. Missing Data with SPSS Tips and Tricks

  23. What can you do in SPSS? • Understand missingness • Conduct a formal missing value analysis along with single imputation methods • Conduct model-based imputation including MI • Mean value imputation in factor analysis and linear regression • Forecasting add on allows for imputation in time series

  24. Identifying values as missing in SPSS Can define missings in 3 ways: • Identify discrete missing values (up to 3) • A range of values • A range of values and one discrete value **You don’t need to re-code as sysmis

  25. Understanding missings in SPSS • Run frequencies on all primary variables to understand the n that is missing • Identify variables that have high missings: Based on your knowledge of the variable/data collection process/survey, is there a reason why there are more missings? • Understand pattern of missingnessand ignorability • Missing Value Analysis module or Multiple Imputation module or Index for Sensitivity to Nonignorability

  26. Missing Value Analysis • Describes pattern of missing data: where they are located, how extensive they are, is missingness random, do pairs of variables tend to have missings • Little’s MCAR test • Estimates means, SDs, correlations etc for missing value methods • Conducts single imputation

  27. Example SPSS dataset Primary Variables: • Anxiety (yes, no) • Anxiety type (phobia, ocd, gad, ptsd, panic, other) • How nervous are you (1 thru 5) • How restless are you (1 thru 5) • Frequency of alcohol use (never to 4 or more times per week) • Quantity of alcohol use (1-2/day to 10+/day) Does anything stand out that may impact missingness?

  28. Running frequencies Most cases are missing here: Suspect skip-outs and go back to the questionnaire If participants answered no to whether they had an anxiety disorder, they would not be asked about type. If they indicated they did not drink, they would not be asked how many drinks they have per day

  29. Recode variables to account for skip-outs

  30. Missing Value Analysis

  31. Missing Value Analysis • The continuous measures of nervous and restless have a non-significant Little’s MCAR test • This is a good thing! Fail to reject the null, which means that data are missing in a random way • Can proceed with single imputation (with continuous measures)

  32. Replace missing values • Expectation maximization approach • Ensure that imputation is done on variables that are coded appropriately (e.g. reverse code) and complete by subscale (homogeneous items) • Use multiple items to enhance accuracy

  33. Single Imputation SPSS Analyze  Missing Value Analysis  EM…  Save Completed Data

  34. Multiple Imputation Analyze Multiple imputation  Analyze patterns

  35. Multiple Imputation: Missing Patterns • 100% of the variables have at least some missing data • 86.85% have complete cases (rows complete) • 88.53% of cells have data Should not impute variables with >15% missing

  36. Multiple imputation: Missing patterns • 7 patterns of missing 1 = no missings across all variables 7 = missings on all variables Around 80% of data have no missings, while around 15% have missings across all variables We cannot impute those with all missings so remove from dataset (indicates those who enrolled in study but did not complete any measures)

  37. MI: Removal of variables • Should no longer have pattern 7 (all missings) • Missing data dropped from around 11% to less than 2% • Complete Little’s MCAR test

  38. Multiple imputation Model specification # of datasets Can specify method or can occur automatically Constraints: specify min and max of variables in models; scan data to provide descriptives

  39. Multiple imputation: new dataset Yellow cells represent imputed values Imputation variable = dataset number 0 = original dataset 1 thru 5 = imputed datasets

  40. MI: Regression Run linear regression; pull down menu’s include a swirl indicating the analysis will be conducted on the imputed datasets individually and as a pooled dataset

  41. MI: Pooled Regression

  42. Sensitivity analysis case example

  43. Example using Wave 1 & 2 of NESARC

  44. Revise & Resubmit AJGP “The authors write "Second, we are unable to determine the effect of respondent attrition at Wave 2 on our results." Actually, the authors could describe the range of possible influence of respondent attrition at Wave 2 on their results with the use of a sensitivity analysis. See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2047601/ Selective attrition seems to be a very likely candidate to explain the decreasing trend of lifetime psychiatric disorder with older age, far more compelling than "This finding once again supports the notion of older age generally being associated with good emotional health"

  45. What we did • Re-merged Waves 1 & 2 • Merged NESARC had missings due to attrition removed • Coded in missings by identifying those that had no data for Wave 2 stratum • Ran a number of t-tests and chi-square analyses examining significant differences among missing and non-missing on primary variables and sociodemographics

  46. Second round of revisions • Analysis demonstrated factors for predicting missingness • Not sensitivity analysis • For variables that cannot be imputed, re-run analyses with “best” case and “worst” case scenario

  47. What we did • Replace missing independent variables from Wave 1 with extreme scores (endorsed, not endorsed & 1 SD above or below the mean) • Conducted our same analyses with each scenario to examine whether extreme cases changed results

  48. Bottom Line: Who, What, When, Where, Why Always be transparent about missingness. WHO: Who is missing? WHAT: What proportion is missing? WHEN: When are data missing? (in what context) WHERE: Where are these missing? WHY: Why are these missing?

More Related