580 likes | 747 Views
Taking Advantage of DDI 3.0. IASSIST 2007: Workshop Montreal. Presenters. Wendy Thomas Minnesota Population Center Arofan Gregory Open Data Foundation Joachim Wackerow GESIS-ZUMA. Afternoon’s Schedule 1:30 – 5:00. 1:30 Introduction and Overview 2:00 Maintainable objects
E N D
Taking Advantage of DDI 3.0 IASSIST 2007: Workshop Montreal
Presenters • Wendy Thomas • Minnesota Population Center • Arofan Gregory • Open Data Foundation • Joachim Wackerow • GESIS-ZUMA
Afternoon’s Schedule1:30 – 5:00 • 1:30 Introduction and Overview • 2:00 Maintainable objects • 2:30 Questions to Variables to Data • 3:00 Break • 3:30 Creating groups • 4:00 URNs and Versioning • 4:30 What is it good for?
Ground Rules • Time periods are general, we’ll adjust to address your interests • Ask questions • There’s a lot to cover so we may suggest continuing a discussion one-on-one during break or after the workshop • Materials from the workshop will be posted on the DDI site
Basic Element Types Differences from 2.1 --Every element is NOT identifiable --Many individual elements or complex elements may be versioned --A number of complex elements can be separately maintained
3.0 Modules and Schemas (one is not necessarily the other) • Modules • Reflect closely related sets of information similar to the sections of DDI DTD • Modules can be held as separate XML instances and be included in a large instance by either inclusion or reference • All modules are maintainable, but not all maintainables are modules
3.0 Modules and Schemas (one is not necessarily the other) • Schemas • Each .xsd file is a schema • Some schemas are modules • Some schemas are substitution sets • Some schemas simply contain elements that are used by multiple schemas or may require more frequent updates • Some schema are “borrowed”
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml SCHEMAS
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml Schemas: MODULES
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml Schemas: SUBSTITUTIONS
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml Schemas: REUSE
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml Schemas: BORROWED
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc20021212 studyunit tabular_ncube_recordlayout xml Schemas: Maintainable Schemas
1.0 Document Description Citation Guide Document Status Source Document 2.0 Study Description Citation Study Information Methodology Data Access Other Material 3.0 File Description File Text Location Map 4.0 Data Description 5.0 Other Material Instance Archive Study Unit Conceptual Components Abstract / Purpose / Coverage DataCollection Methodology Collection Event Question Scheme Instrument Processing Event LogicalProduct Data Relationships Category Scheme Code Scheme Variable Scheme NCube PhysicalDataProduct Gross Record Structure Base Record Structure PhysicalInstance File Identification Statistics 2.1 sections to 3.0 schema
Why the big change? • Documentation focused on the Codebook remains a value-added commodity • It becomes the de-facto responsibility of the archive rather than the producer • Producers capture information that help them do their work • The Life Cycle model focuses on the flow of data through a system [production oriented] • Codebooks should be an output from the documentation process, not the sole commodity
Tighter control more required items • In order to support processing we needed tighter control on element and attribute contents • Schema provide tighter element level control • Profiles allow for customization of coverage • A large set of required elements provides a more consistent base for programming
Reuse and Replacement • Reuse of elements means that similar actions are handled in similar ways • Questions and Variables use the same category and coding schemes • Instrument flow logic is found in comparison and coding instructions for variables • Identification and referencing are handled in a consistent manner • Replacement by substitution allows for addressing changing technologies without making major changes in existing schemas • Physical data structures • Data types (microdata, aggregate data, future coverage types?)
Maintainable Objects • Publishing persistent parts • Concept lists • Question Banks • Coding schemes • Comparison mapping • Study description
Support for Registries • A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method. • Metadata registries are used whenever data must be used consistently within an organization or group of organizations. http://en.wikipedia.org/wiki/Metadata_registry
Examples of Registry Users • Organizations that transmit data using structures such as XML, Web Services or EDI • Organizations that need consistent definitions of data across time, between organizations or between processes. For example when an organization builds a data warehouse • Organizations that are attempting to break down "silos" of information captured within applications or proprietary file formats http://en.wikipedia.org/wiki/Metadata_registry
Capturing and Reusing Metadata • Whether captured at inception or created after-the-fact some sections must be completed before other sections can be made • The capture of metadata at point of inception in a non-proprietary structure that can be transferred out-of and into process software provides incentive for metadata creation during the life cycle of the data
Metadata flow • DDI is built on the life cycle of the data and some information naturally occurs earlier than other information • Reuse of and reference to certain types of information such as universe, concepts, categories, and coding schemes prescribe a creation order
STEP 4 STEP 5 STEP 1 STEP 2 Universe Scheme Category Scheme Variable Scheme Record Structure Concept Scheme Data Relationships Coding Scheme Remaining Physical Data Product Items Organization / Individual NCube STEP 3 optional STEP 7 StudyUnit Citation Question Scheme Remaining Logical Product items Physical Instance Instrument StudyUnit Coverage STEP 8 Processing Event (coding) Archive / Group / etc.
Questions to Variables Question Development Software Identifying Universe and Concepts Building or Importing Question Text and Response Domains Instrument Development Software CAI Organizing questions and flow logic Capturing raw response data and process data Data Processing Software Data cleaning and verification Recoding and/or deriving new data elements using existing or new categories or coding schemes DDI DDI
Mapping Relationships[Preparing data for the user] • Information previously provided in Guide • Now found in the logical product under Data Relationships • Physical expression of the linking relationships is found in the physical data product
What it tells you • Record Type • How many record types you have • How you can tell what record type it is • Does it provide support for a “multi-part” record • How to identify a “unique” record within a record type • How do you link one record type to another
Multiple Parts / Complex ID SF1 050 000101 27053 DATA SF1 050 000102 27053 DATA SF1 160 002201 27 48000 DATA SF1 160 002202 27 48000 DATA
Multiple Parts / Complex ID SF1 050 000101 27053 DATA SF1 050 000102 27053 DATA SF1 160 002201 27 48000 DATA SF1 160 002202 27 48000 DATA
Multiple Parts / Complex ID SF1 050 000101 27053 DATA SF1 050 000102 27053 DATA SF1 160 002201 27 48000 DATA SF1 160 002202 27 48000 DATA UNIQUE Within File = LOGRECNO IF SUMLEV = 050 then STATE and COUNTY IF SUMLEV = 160 then STATE and PLACE
Grouping • Grouping allows two or more studies to be grouped together • Grouping can be done • by design • after the fact • virtually
Grouping by design • Uses inheritance • to reuse rather than rewrite • handle a form of multiple inheritance trees first by hierarchy and then by reference • Uses comparison to describe changes that take place in a series over time
Grouping after the fact • Uses comparison to describe equivalent or similar objects and how they compare • Questions • Concepts • Variables • Category Schemes • Coding Schemes • including capturing process for recodes
PERSON LEVEL INFORMATION All Waves 1997-2003 All waves inherit person level information from the group Comparable by design
Waves 1997-2003 Satisfaction with life, School Degree All waves contain common topical data on Satisfaction with life and School Degree
Currency Fields CHANGE between 2001 and 2002 Wave 2002-2003 Currency Euro Wave 1997-2001 Currency DM
Question and Data expanded between 1998 and 1999 Wave 1997-1998 Size of Company Wave 1999-2003 Size of Company, Concerns about Euro
This set of questions is included PERIODICALLY Waves 1997, 2000, 2001 Computer Usage
DDI3: GROUPING and COMPARISON • Example • Standard Eurobarometer 1970 ff. • Occupation of respondent • Changing “wave standard” category schemes • Variations across countries - translation, question/variable structure
DDI3: GROUPING and COMPARISON TREND: OCCUPATION R: WHAT IS YOUR OCCUPATION? • Example • Harmonized systematic 3-digit coding • Distinguishing major categories and sub-categories • Avoiding artificial changes in occupational structures over time Source: Jan W. van Deth: Using Published Survey Data, in: Harkness/Van de Vijver, Mohler: Cross-Cultural Survey Methods
DDI3: GROUPING and COMPARISON TREND: OCCUPATION VARIABLE NAME: OCCUP R: WHAT IS YOUR OCCUPATION? 110. FARMER / FISHERMAN (SKIPPERS) <until EB29> 111. FARMER 112. FISHERMAN 120. <SELF EMPLOYED> PROFESSIONAL (LAWYER, MEDICAL PRACTITIONER, ACCOUNTANT, ARCHITECT, ...) 130. OWNER OF A SHOP, CRAFTSMEN, BUSINESS PROPIETOR 131. OWNER OF A SHOP, CRAFTSMEN, OTHER SELF EMPLOYED PERSON 132. BUSINESS PROPRIETORS, OWNER (FULL OR PARTNER) OF A COMPANY 210. EMPLOYED PROFESSIONAL (LAWYER, MEDICAL PRACTITIONER, ACCOUNTANT, ARCHITECT, ...) 220. EXECUTIVE, TOP MANAGEMENT, DIRECTOR <starting with EB30:> GENERAL MANAGEMENT <starting with EB37:> GENERAL MANAGEMENT, DIRECTOR OR TOP MANAGEMENT 230. MIDDLE MANAGEMENT, OTHER MANAGEMENT (DEPARTMENT HEAD, JUNIOR MANAGER, TEACHER, TECHNICIAN) 310. EMPLOYED POSITION, WORKING MAINLY AT A DESK 311. WHITE COLLAR – OFFICE WORKER <until EB29> 312. OTHER OFFICE EMPLOYEES <EB30 to EB36> 320. NON-OFFICE EMPLOYEES, NON MANUAL WORKERS (SERVICE SECTOR, E.G. SHOP ASSISTANT ETC.) 321. EMPLOYED POSITION, NOT AT A DESK BUT TRAVELLING (SALESMAN, DRIVER, ...) ... 540. UNEMPLOYED <starting wirh EB30:> TEMPORARILY NOT WORKING, UNEMPLOYED 998. DK / NA 999. INAP • Resource
DDI3: GROUPING and COMPARISON • „Internal“ standard: • Mannheim Trend File • „External“ standards: • ESOMAR • EB-CH / OSF • ISCO-88 Eurobarometer 3 - 29Wave Standard [Q…. ] OCCUPATION OF SELF: SELF EMPLOYED 01. FARMERS, FISHERMEN (SKIPPERS) 02. PROFESSIONAL - LAWYERS, ACCOUNTANTS, ETC. 03. BUSINESS - OWNERS OF SHOPS, CRAFTSMEN,PROPRIETORS EMPLOYED 04. MANUAL WORKER 05. WHITE COLLAR - OFFICE WORKER 06. EXECUTIVE, TOP MANAGEMENT, DIRECTOR NOT EMPLOYED 07. RETIRED 08. HOUSEWIFE, NOT OTHERWISE EMPLOYED 09. STUDENT, MILITARY SERVICE 10. UNEMPLOYED 00. DK/NA DE - Eurobarometer ? - 17 [ F… ] Sind Sie persönlich berufstätig? Selbständige 1. Landwirte 2. Freie Berufe (z.B. Arzt, Anwalt) 3. Kleine, mittlere, größere Selbständige Berufstätige 4. Arbeiter / Facharbeiter 5. Angestellte / Beamte 6. Ltd. Angestellte / ltd. Beamte Nicht berufstätige 7. Rentner / Pensionär 8. Hausfrauen (nicht andersweitig beschäftigt) 9. Schüler, Studenten, Lehrling 0. Arbeitslos DE - Eurobarometer 18 - 22 [F. ] Sind Sie persönlich berufstätig? 1. Voll berufstätig (einschl. vorübergehend arbeitslos) 2. Teilweise berufstätig (einschl. vorübergehend arbeitslos) 3. Rentner, Pensionär (früher berufstätig) 4. Rentner, Pensionär (früher nicht berufstätig) 5. (In Ausbildung) Lehrling 6. (In Ausbildung) Schüler, Student 7. Nicht berufstätig, aber früher berufstätig gewesen 8. Noch nie berufstätig gewesen DE - EB 18 - 23 [F. ] Welchen Beruf üben Sie zur Zeit aus, bzw. haben Sie zuletzt ausgeübt? 11. Einfache Angestellte 12. Mittlere Angestellte 13. Qualifizierte Angestellte 14. Leitende Angestellte 15. Ungelernte Arbeiter 16. Angelernte Arbeiter 17. Einfache Facharbeiter 18. Qualifizierte Facharveiter 21. Kleinere Selbständige 22. Mittlere Selbständige 23. Größere Selbständige 24. Freie Berufe (z.B. Arzt, Anwalt) 25. Beamte einfacher Dienst 26. Beamte mittlerer Dienst 27. Beamte gehobener Dienst 28. Beamte höherer Dienst 31. Selbständige Landwirte - kleine (unter 5 ha) 32. Selbständige Landwirte - mittlere (5- unter 20 ha) 32. Selbständige Landwirte - große (20 ha +)
DDI3: GROUPING and COMPARISON Group standards inheritance EB OCCUPATION Trend Standard DataCollection LogicalProduct Resource standards comparison to instances Subgroup: overwriting additions inheritance Subgroup:no overwriting comparison to instances DE 18-22 DE 3-16 DE 17 DE 23-29 DE 37 ff. DE 30-36 DE 24-29 DE 18-23 Study Unit: overwriting additions EB 30 DataCollection LogicalProduct EB … DataCollection LogicalProduct EB 36 DataCollection LogicalProduct EBCH 2000-2003 OFS/ISCO-88
Versioning and Maintenance • There are three classes of objects: • Identifiable (has ID) • Versionable (has version and ID) • Maintainable (has agency, version, and ID) • Very often, identifiable items such as Codes and Variables are maintained in parent schemes
Rationale • Because several organizations are involved in the creation of a set of metadata throughout the lifecycle flow: • Rules for maintenance, versioning, and identification must be universal • Reference to other organization’s metadata is necessary for re-use – and very common
Maintenance Rules • A maintenance agency is identified by its domain name (as for it’s website and e-mail) • Maintenance agencies own the objects they maintain • Only they are allowed to change or version the objects • Other organizations may reference external items in their own schemes, but may not change those items • You can make a copy which you maintain, but once you do that, you own it!
Versioning Rules • If an object changes in any way, its version changes • This will change the version of any containing maintainable object • Typically, objects grow and are versioned as they move through the lifecycle • Versions inherit their agency from the maintainable scheme they live in
Identifiable Rules • Identifiers are assigned to each identifiable object, and are unique within their maintained parent scheme • Identifiable objects inherit their version from their containing versionable parent (if any) • Identifiable objects inherit their maintaining agency from the maintainable object they live in
Referencing • When referencing an object, you must provide: • The maintenance agency • The identifier • The version • Often, these are inherited from a maintained scheme • This is part of their identification
Identification • Identification can be by URN or a series of fields • The fields make up the parts of the URN and can be used to compose it • A number of fields can inherit information from a maintainable parent