1 / 49

= information not in proceedings or on CD

daniellat
Download Presentation

= information not in proceedings or on CD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IPUMS-International: High precision Population Census Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Microdata Extracts https://www.ipums.org/international * * *Robert McCaa, Steven Ruggles, Michael Davern, Tami Swenson, and Krishna Mohan PalipudiMinnesota Population Centerrmccaa@umn.edu = information not in proceedings or on CD

  2. Outline of paper (in proceedings, except “0.”) 0. What’s a historian doing at PSD2006? Introduction: The Trusted User Approach The Case for High Precision Samples: The USA Experience High Precision Samples with Implicit Stratification Access Disclosure Controls Technical Disclosure Controls Fear, Hysteria and Paranoia Conclusions and Future Work

  3. Help!!! Why am I (a historian) here? To learn from you to enhance IPUMS-International privacy and confidentiality techniques To inform you of our existence and the challenges we face To invite your contributions, as producers, users, and creators of statistical confidentiality methods To advertise opportunities for post-docs, staff To invite statistical agencies to entrust census microdata to the project

  4. Imagine!!! What’s the problem? Confidentializing IPUMS-International, an integrated microdatabase with: • 150 census samples of households (50 countries) • Containing 300 million person records with hundreds of variables • Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work • Not a single allegation of violation of privacy or statistical confidentiality-- Ever!!

  5. IPUMS-International: a restricted-access, web-based census microdata extraction system • Password protected: to make and retrieve extracts • Licensed researcher selects: • Countries, • Censuses, • Cases/sub-populations, • Variables, and • Sample densities • Extract engine queues request, generates extract • Researcher retrieves extract via web with SSL 128-bit encryption and analyzes using own wares (soft/hard/wet) • NO: CDs, original codes, or complete datasets

  6. 2a. Study documentation2b. Design extract 3. Receive email; logon with p/word 1. Logon w/ password (also SAS, STATA) 4. Download extract (SSL encrypted) 5. UnZip data 6. Analyze 6 stepsusinghttps://www.ipums.org/international:

  7. IPUMS-International, December 2006dark green = disseminating (20 countries, 63 censuses, 185mpr)green = harmonizing (37 countries, 100 censuses, 200mpr)lightest green = negotiating 69 countries, 58% world's population

  8. What has happened since Geneva (xi/05)? NSF-USA renewed funding for 5 years Database grew: 12 countries, 35 censuses, 65mpr More agreements signed, census data acquired New, dynamic metadata system implemented Number of users doubled Publications are taking off Paris Workshop (INED/CEPED): delegates from 14 European countries and 10 non-European, plus academic researchers

  9. IPUMS-EuropeDecember 2006 Dark green = Disseminating (5 countries, 15 censuses, 27mpr) In Lisbon: Portugal and Hungary will become “dark green” with the launch of samples for 4 censuses ea. for Argentina and Hungary, 3 for Portgual and Israel, 2 for Egypt and Rwanda, and 1 for Gaza and the West Bank

  10. What will happen by Lisbon (ISI, viii/07)? Confidentiality methods will be enhanced Database will grow: 7 countries, 19 censuses, 25mpr Dynamic metadata system will be expanded Number of users will increase!!! Publications!!! IPUMS Workshop (Sat Aug 25 at INE-Pt) for producers and users (registration required; please email rmccaa@umn.edu) Microdata Session (Fri Aug 24) Free mug!* *Special conditions apply

  11. 1. Introduction: The “trusted-user” approach to disseminating integrated, anonymized census microdata sample

  12. MBNA: world’s largest independent credit card issuerspecialist in affinity marketing • 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity • 1983: Georgetown Univ Alumni Association (Cawley’s alma mater) supplied MBNA with names and addresses of its members in exchange for percentage of revenues on card usage • Big hit! Large number of new accounts, low risk, high spenders • 1985: new groups: American Dental Association, Aircraft Owners and Pilots Association, National Education Assoc., • 1994: Sierra Club, 45,000 members signed with MBNA generating $400,000 annually for Sierra Club •  The rest is history! • 2005:

  13. MBNA: world’s largest independent credit card issuerspecialist in affinity marketing • 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity • 2005: MBNA, with 25,000 employees, acquired by Bank of America, US$35 billion • How many credit cards do you have? • How many affinity credit cards do you have?

  14. IPUMS-International: world’s largest provider of integrated census microdata to trusted users • 1999: Founded by Steven Ruggles and Bob McCaa, –restrict access to trusted users, and apply corresponding confidentiality techniques • 2002: 1st release of integrated samples for 7 countries; >200 users in first year • Big hit! 69 countries signed; 57 entrusted data to IPUMS, datasets for more than 230 censuses, >150 entire datasets • 2006,

  15. IPUMS-International: world’s largest provider of integrated census microdata to trusted users • 1999: Founded—seeks neither profits or popularity! • 2006, 3rd release: • data for 20 countries, samples for 63 censuses, • 185 million person records, • >1,000 users • 2009, 8th release: • data for 50 countries, samples for ~150 censuses • >300 million person records • thousands of users • Note: data extracts are provided only to licensed users.

  16. 2. The case of High Precision Samples: The USA Experience

  17. 2. High Precision Samples: The Case of the USA • Beginning with the 1980 census, US Census Bureau released 5% samples of households • Not a single allegation of misuse • 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration • 1996: IPUMS-USA samples available via internet • Available at no cost to researchers worldwide • 81% of articles in Demography, since 1990, use high precision samples • In 2000 & 2001, high precision census microdata used twice as often as next most common data source • Analyze information for small population subgroups • very large census microdata samples are among the most powerful tools available for economic and demographic analysis

  18. 2. High Precision Samples: The Case of the USA • Beginning with the 1980 census, US Census Bureau released 5% samples of households • Not a single allegation of misuse • 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration • 1996: IPUMS-USA samples available via internet • Available at no cost to researchers worldwide • 81% of articles in Demography, since 1990, use high precision samples • In 2000 & 2001, high precision census microdata used twice as often as next most common data source • Analyze information for small population subgroups • very large census microdata samples are among the most powerful tools available for economic and demographic analysis

  19. 3. High Precision Samples with Implicit StratificationNote: almost all NSIs are supplying household samples drawn to IPUMS specifications (every nth household from 100% fine-grained geographically stratified microdata)—see table 1

  20. IPUMS-International: High precision samples with implicit stratification • Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) • Sample is stratified by lowest level geography (census tract) • Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography • Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting • Many of our NSI partners have adopted the IPUMS sample design (see table 1). • 26 countries provided 100% microdata for the MPC to draw the sample • Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses • High precision samples for 57 countries entrusting microdata (12/12/2006) • 10% samples: 43 countries • 5% 10 countries • <5% 4 countries

  21. IPUMS-International: High precision samples with implicit stratification • Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) • Sample is stratified by lowest level geography (census tract) • Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography • Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting • Many of our NSI partners have adopted the IPUMS sample design (see table 1). • 26 countries provided 100% microdata for the MPC to draw the sample • Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses • High precision samples for 57 countries entrusting microdata (12/12/2006) • 10% samples: 43 countries • 5% 10 countries • <5% 4 countries

  22. 4. Access Disclosure Controlsa. Memorandum with NSIb. License with researchers

  23. A. NSI with U of Minnesota

  24. A. NSI with U. of Minnesota(2005+)

  25. IPUMSi LICENSE B. License with researchersRestricted Access web-based system Legally-binding license agreement • forces would-be snoopers to violate law by which they can be fined and jailed • protects privacy and confidentiality • assures proper use Access limited to: • Bona-fide researchers (credentials) • With a demonstrated scientific need • who agree to abide by license restrictions • Confidentiality • No redistribution • Safely secured • Alleging that a person has been identified is prohibited

  26. IPUMSi LICENSE B. License with researchersRestricted Access web-based system Legally-binding license agreement • forces would-be snoopers to violate law • protects privacy and confidentiality • assures proper use Access limited to: • Bona-fide researchers (credentials) • With a demonstrated scientific need • who agree to abide by license restrictions • Confidentiality • No redistribution, no commercial use • Safely secured • Alleging that a person can be or has been identified is illegal

  27. “Apply for Access”

  28. License is for 1 year, renewable. End of application

  29. 5. Technical Disclosure Controls

  30. IPUMSi technical measures are also applied, in addition to the legal & administrative protections CONFIDENTIALIZES » Suppress geographical detail» Blur/aggregate sensitive codes» Convert dates to ages (blur key vars.) » Swap cases between districts» Scramble order of records

  31. EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 1. Restrict access to samples • 2. Limit geographical detail • 3. Re-code unique categories--top and bottom • 4. Sign non-disclosure agreement • 5. Prohibit redistribution to third parties • 6. Prohibit attempts to identify individuals or the making any claim to that effect • 7. Require users to provide copies of publications

  32. EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 8. Construct age from birthdate, if necessary • 9. Do not identify date of birth • 10. Do not identify precise place of birth • 11. Migration: timing/place not identified in detail • 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) • 13. Do sensitivity analysis (not yet) • 14. Do confidentiality assessment (not yet)

  33. 6. Countering Fear, Hysteria and Paranoia…with reason “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]-nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999

  34. No census microdata!! Why Not?Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * *Probabilistic linking with 90% of the population missing is not good enough ChoicePoint Data Sources and Clients. Source: Washington Post http://www.choicepoint.com/

  35. No census microdata!! http://www.aclu.org/pizza/

  36. “…there are no known incidents of researchers using their access to microdata to deliberately identify individuals...”--Managing Statistical Confidentiality and Microdata Access: Principles and Guidelines of Good PracticeUNECE, Conference of European Statisticians, Task Force on Census Microdata (October 2006), p. 19 http://www.unece.org/stats/documents/tfcm/1.e.pdf

  37. “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard:“It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.”Protocols on Data Access and Confidentiality, pp. 7-8 --ONS-UK(2004)www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf

  38. 7. Conclusions and Future Work

  39. IPUMS-International strengths • Uniform legal authorization with national statistical authorities • Access restricted to academics with need who agree to abide by stringent confidentiality protections • Experienced integration teams • Proven web-based distribution system • High user satisfaction • Sustainable: NSF, NIH, FP-6 (7?) funded (Europe only)

  40. Significant weakness: statistical disclosure controls…as a result of PSD2006, we will: • Re-consider our portfolio of statistical disclosure controls • Implement a uniform set of controls across all samples and countries • Do sensitivity analysis • Do confidentiality assessment • Revise our documentation on the confidentializing of datasets for each country, describing principles, but not the “keys” • Cite bibliography for users to confidentialize tables and graphs

  41. IPUMS-International, August 2009???dark green = disseminating (50 countries, 150 censuses, 300mpr)green = harmonizing (?? countries, ?? censuses, ???mpr)lightest green = negotiating 2009? --ISI

  42. Thank you!https://www.ipums.org/internationaladditional information at:www.hist.umn.edu/~rmccaa/* * * * * *Contact: rmccaa@umn.edu

More Related