1 / 52

Analysis of Large “Population-based” Databases for Clinical Research

Analysis of Large “Population-based” Databases for Clinical Research. John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies Georgetown-Howard Center Clinical Translational Science. ………… ………… That we are in the midst of crisis is now well understood.

kalei
Download Presentation

Analysis of Large “Population-based” Databases for Clinical Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Large“Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies Georgetown-Howard Center Clinical Translational Science

  2. ………… ………… That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;.. These are the indicators of crisis, subject to dataand statistics. ………… ………… ………… Pres. Barack Obama (Inaugural speech)

  3. Sequence of Steps in a Research Project • Conceptualization • Planning/Design • Execution • Interpretation • Reporting - Abstracts, Presentation, Publication Data Collection & Processing Data Analysis

  4. Outline • Types, Uses & Opportunities • National & Institutional Databases • Access • Analysis & Statistical Issues

  5. Types, Uses & Opportunities

  6. Types of Large Databases • (Health) Survey Databases • NHANES • (Health) Administrative Databases • HCUP • Discharge & Mortality Databases • Specialty Databases- e.g. stroke • Clinical trials • AASK, ALLHAT

  7. Uses of Large Databases • Secondary Analysis ~ publications • Pilot Data for grant proposals • Power Exploration • Hypothesis Generation & Testing • Estimate of Summary Statistics -prevalence, incidence, mortality, etc

  8. Advantages using large databases • Large Sample • Fast & Easily (Some) Accessible • Provide population Estimates • Can test trend over time • Observational, cross-sectional, longitudinal

  9. Limitations & Challenges • Non-Experimental: (Survey & Administrative) • Most are cross sectional • May require special skills • -special statistical techniques & software usage • Statistical Issues to address • May involve long bureaucracy • -Written request or proposal • - IRB approval • May cost a fee & travel

  10. Funding Opportunities Secondary Analysis • R03, R21 mechanisms---- • Obtain data collected by the parent study or by Ancillary • Studies to prepare a scientific manuscript for publication on a • topic (aims) that has not yet been addressed. • Receive limited preliminary study data summaries, to prepare • a proposal for funding of secondary analyses of data . • Obtain specimens (e.g. blood, urine, imaging scans) for new • assays or analyses to be conducted using an outside funding • source.

  11. Nces.ed.org/nationsreportcard/researchcenter/funding.asp

  12. Funding Opportunities

  13. National Databases

  14. National Health & Nutrition Examination Survey (NHANES): www.cdc.org/nchs/nhanes.html • Population : Adult & Children • Method: Face-Face Interview, Physical Exams • Content: Anthropometry, Respiratory disease, chronic & infectious disease, mental health & cognitive functioning, reproductive history & sexual behavior • Data: N~5000/yr since 1999; Initiated in 1960 • Notes: Supplemental food survey, online tutorial

  15. National Health Interview Survey (NHIS) : www.cdc.org/nchs/nhis.html • Population : Household (Families) Adult & Children • Method: Face-Face Interview, Physical Exams • Content: Health conditions & behaviors, access to & use of health services; Genetic testing, • Data: N ~35,000 Households (~87,500 persons) Initiated in 1957 • Notes:Data used widely by the DHHS to monitor trends in illness and disability and to track progress toward achieving national health objectives.

  16. Surveillance Epidemiology and End Results (SEER): http://seer.cancer.org • Population : Children to Adult • Method: Data collected from cancer registries that cover ~28% of the US population; follow-up with individual cases until death • Content: Cancer incidence, prevalence, and survival data; limited demographics (age, race/ethnicity, region) • Data: Cancer cases in registries, >6Million cases • Notes:Need specialized software to analyze (SEER*Stat or SEER*Prep) downloaded from website; Must sign user agreement to obtain.

  17. Healthcare Cost & Utilization Project (HCUP) http://www.ahrq.org/data/hcup • Population : All ages • Method: A family of healthcare databases and tools • Content: Databases enable research on a broad range of health policy issues, including cost and quality of health services, medical practice patterns, access to health care programs, and outcomes of treatments. • Data: Cancer cases in registries, • Notes:Databases are available for purchase through a central distributor

  18. African America Study of Kidney Disease & Hypertension(AASK):www.niddkrepository.org/ • Population : Adult African Americans, 18-70 years • Method: Participants followed for 2years to measure the long-term effects of blood pressure control in patients with kidney disease attributed to high blood pressure. • Content: BP, markers of kidney function • Data: 1094 • Notes:Largest and longest study of chronic kidney disease in African Americans

  19. Wide-ranging Online Health related Datasets for • Epidemiologic Research • Each data set can be queried using a series of • menus • Provides an online tool for retrieving and • analyzing data CDC Wonderwonder.cdc.org

  20. CDC Wonder

  21. Institutional (GHUCCTS) Databases

  22. Institutional (GHUCCTS) Databases • Obesity Project - HU • Family Genetics Study of Prostate Cancer-HU • HIV in DC – HU • Memory Disorder Study - HU • Spinal Cord Disease Database - MRI • Stroke Database - MRI/NRH • Brain Injury Database- MRI/NRH • National Capital Spinal Cord Injury Model System – MRI/NRH • Strong Heart Study- MRI • The VA Decision Support System Database (DSS) – VA • …….. • ………

  23. Access/Retrieval

  24. Data Access/Retrieval • May require special request or proposals - aims, etc -preparation of detailed analysis plans • Understand the database structure • Extraction of requisite data for specific objectives • Application of appropriate linkage techniques for multiple data sources • Process & Storage

  25. Database Structure • Relational Structure: (1-to-1) • represented by a table of rows & columns • ~ attributes are listed in columns • ID, AGE, GENDER, ….. • ~unique identifiers • Hierarchical (Nested) Structure: (1-to-many) • allows for multiplicity of attributes whiles preserving relationships

  26. Data Structure

  27. Data Analysis Methods

  28. Types of Data Endpoints • Continuous Data - BP, BMI, TC, LDL, HDL, Blood Sugar • Categorical Data - Hypertension, Obese, Dyslipidemia, Diabetes • Count Data 0, 1, 2, 3 • Survival (Time-to-Event) Data - time-to-cardiac event, time-to-death

  29. Core partitioning ~ arises naturally • Race • Gender • Age Group • Geographic Region • Time partitioning • 2000-2010 • 1995-2000; 2000-05 Partition Data Into Subsets

  30. Measures of Central Tendency • Means, Median, Mode, etc • Rates – • Prevalence, Incidence, Survival, Mortality • Variability • SD, range, IQR Descriptive AnalysisBy Partition

  31. Apply visualization methods by subsets • Charts • Scatter Plot matrix • ~ continuous measures • Trellis plot • ~ all measures Visualization MethodsExploratory Analysis

  32. Trellis Plot

  33. InferenceStatistical Tests The method used depends on 1. Outcome measure Univariate Multivariate 2. Study design

  34. Continuous Data Parametric Tests Non-Parametric Equivalent Wilcoxon Signed Rank Test Wilcoxon Rank Sum Test Kruskal-Wallis Test • Paired T-tests ~ non-comparative open-label studies (pre-post studies) • Two Sample T-test ~ comparative studies (eg. parallel-group designs ) • ANOVA (F-Test) ~ comparing multiple groups (eg, parallel-groups designs, factorial designs)

  35. Categorical Data What is the question? Compare rates: prevalence, incidence, mortality! • Chi-square Test • McNemar Test (pre-post designs) • Mantel-Haenzel test- heterogeneity

  36. Survival Data Question? Compare survival rates! Survival curves, hazard ratios • Kaplan-Meier Estimator • Log- Rank Test • Likelihood Ratio Test

  37. Regression Methods • used when it is necessary to adjust for different covariate/confounding effects Cholesterol level ~ gender, age, diet

  38. Regression Methods • Continuous Data ~ Linear Regression Models • Categorical Data ~ Logistic Regression Models ~ Conditional Regression Models • Survival Data ~ Proportional Hazard Regression

  39. Multi-Level ModelsHierarchical (Nested) Models • Multilevel Regression • Mixed Effect Models • Nested Models -GEE -Proc Nested • Bayesian Approaches

  40. Multivariable Methods Use to analyze multiple outcomes jointly TC ~ gender, age, diet Risk factors univariate [HDL, LDL, TG] ~ gender, age, diet Multivariable

  41. Multivariable Methods • MANOVA • Discriminant Analysis • Factor Analysis • Cluster Analysis • Principal Component Analysis

  42. Statistical Issues

  43. Sampling error • Missing data • high likelihood of finding a significant • difference due to chance alone • Potential for bias result is substantial Statistical Issues

  44. Use ~ Recommendations for Health Survey Data • Statistical weights • Stratification • Clustering • Variance Estimation

  45. The statistical weight of a sampled person is the • number of people in the population that the person • represents. • If sampling rate is 1/1000 • Each sampled person represents 1000 people • Each sampled person would have a sample weight of 1000 • Weights derived from • selection probabilities • response rates • post-stratification adjustments (e.g. gender, education, etc) Use of Statistical Weights

  46. Population divided before sampling into disjoint, exhaustive groups (strata) • Members termed primary sampling units (PSUs) • Independent samples are taken in each strata • Strata formed by similar demographic areas Stratification

  47. ClusteringHierarchical (Nested) Data • Persons residing in a small area (cluster) may have • similar characteristics • Responses of subjects in clusters may be correlated • Dependence between subjects leads to inflate • variance • Correlation must be accounted for in the analysis

  48. Use appropriate variance estimation methods: Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators Default method for most stats programs Replication methods: Calculates different parameter estimates for each replicate and combines these to estimate variance. Jackknife, etc Variance Estimation

More Related