260 likes | 504 Views
Data Management Systems in Epidemiology Seminar 1 April 25, 2012. Data management is important for: Documenting observations Reproducibility Organizing thoughts Publication Presentations Applying for funding Communicating with mentor Required by funding agency such as NIH
E N D
Data Management Systems in Epidemiology Seminar 1 April 25, 2012
Data management is important for: • Documenting observations • Reproducibility • Organizing thoughts • Publication • Presentations • Applying for funding • Communicating with mentor • Required by funding agency such as NIH • Analyzing data differently in the future
Steps in creating a database: • Defining your variables • Format and range of values for variables • Creating a data dictionary • Planning data entry procedures • Testing data entry procedures • Data entry • Creating a dataset for analysis • Backing up and archiving the dataset
Defining New Variables • Assign a name to each variable for identifying • variables in the database and during the analysis • Variable’s name should: • - Clearly identify the question on the survey or • type of information collected • - Be understandable, consistent and short • Use lower case to name all variables this • eliminates errors when software are • case sensitive. • Note: Some software such as SPSS only • allow 8 characters
Constructing Database • A database transform data into information. • Before creating database you must identify • and clarify its purposes. • Who will use it? • What are their needs? • What will the data be used for? • What do I want to say with this data?
Major types of databases utilized in public health: Flat-file database (spreadsheet) Relational database
Relational Database • A relational database stores data in a table. • Each table consist of records (rows) and • fields (columns). • Tables can be linked and are related • Examples of Software for creating relational • databases are: • MS Access • Oracle • MySQL • Epi Info
Flat File Database • A flat file database is a single file with rows • and columns, with no relationships between • records • Choosing between flat file and relational • database depends on the information you • are collecting. • If the information include complex relationships, • you should use a relational database. This will • reduce data entry time, error and redundancy.
Data Dictionary • A data dictionary is for identifying the • meaning of the collected data. • A data dictionary should include: • Variable type (nominal, string, text) • Vaiable format (“Yes”, “No”, “Missing”) • Acceptable values (a response coded • can only include 0, 1, or 9999)
Coding • Coding = Translation and summarization • The majority of statistical analysis require • that nonnumeric responses be coded into • numeric responses. • Coding Example • “Have you ever been diagnosed with asthma?” • 1=”Yes” • 2=”No” • 9=”Don't Know”
Coding Open-Ended Questions • Coding responses to open-ended questions • are complicated. • Example: • "What hobbies or other interests do you • have?“ • "What has been important about your adult • life?"
Coding Missing Data • Make sure that assigned value for missing data is not a possible numeric value for that data. • Example: When coding missing data for age with “99”. Missing values for age will be analyzed as age=99 years. This is incorrect. • Get familiar with the standard missing value code of the software that you will use • Code “Don't Know” differently from missing data.
Validating Data • Check for illogical answers – For example, • those reported as “female” should not report • that they have had prostatectomy • Most data management systems are able to do • edit checks to validate your data while it is being • entered (set this up). • Most systems let us to control the • range of acceptable values that can be • entered into a field (set this up).
Controlling Data Range The database can be arranged to permit only the values 1, 2, and 9 to be entered into the field For example: “Have you ever been diagnosed with chronic bronchitis?” 1=”Yes” 2=”No” 9=”Don't Know”
Data Entry • Goal: enter the data efficiently and accurately • into the database. • Reduce data entry time by setting up proper • tab orders and hot key short cuts • Data Entry alternatives: • One Database • Multiple Databases
Advantage of One Database • While data is being entered, we can instantly • run summary statistics and interim analyses. • We can easily check for data entry duplication. • Handy if data entry staffs are in multiple • locations and everyone has access to the same • database.
Advantage of Multiple Databases • If the database gets corrupted, we have to only • re-enter that person’s records. • Useful if the data entry staff has different levels • of computer and data entry skills • We will not lose all of our data (just some). • Data can be merged for the analysis of the • project. • Note: Monitor who is entering which records into the database!
Train data entry staff on: • How to enter the data • How to navigate through the database • What should be entered and how • Provide your staff with protocol guidelines • and hard copy of a blank surveys and data • form • Back up the database frequently • • Document the process / make a note of • everything that happens
Documentation of Data Entry Procedure • Retain a notebook and document ALL that • happens. • Choices that are made about what • will be entered and how • Change in staff involved in data • collection and data entry and when • these changes happened • Problems with data entry and solutions • Arrange all data collection forms for easy • retrieval.
Identifying and correcting errors in the database • Double data entry is one method to reduce • data entry errors. • Data are entered twice. • The two entries are compared for each. • variable and create a list of values that • do not match. • Data with errors are checked against • the original data. • Another method is to recheck or reenter a • random fraction of the data.
Testing Data Entry Procedures • It is crucial to test your data entry methods • every time. • How? Check your: • -coding • -data validation • -checking procedures • -revise and check your data dictionary • Finally take six to ten surveys or forms and • enter them into your newly created database.
Backing Up and Archiving • Back up the data regularly (e.g., end of each • day). • Archive a dataset with its documentation and • any important files for interpreting the data.