1 / 10

Synthetic D ata G eneration

Synthetic D ata G eneration . - Darshana Pathak. Synthetic Data. A process of creation of realistic data set. Realistic means having characteristics of real world data. Errors Duplicates Similar entities Changing data. Types of Errors:. Spelling mistakes Typographical errors

ossie
Download Presentation

Synthetic D ata G eneration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synthetic Data Generation - DarshanaPathak

  2. Synthetic Data • A process of creation of realistic data set. • Realistic means having characteristics of real world data. • Errors • Duplicates • Similar entities • Changing data

  3. Types of Errors: • Spelling mistakes • Typographical errors • Insert, replace, delete • Transposition errors • Missing attributes • Computational errors (e.g. year) • …

  4. Why do we need it? • Availability of data suitable for record linkage and data visualization research • With all required attributes easily available • Privacy concerns • Personally Identifying Information • IRB approvals • Information disclosure laws

  5. Base Data • We made our task easier by getting real data set as a base data to generate synthetic data. • Idea of using voters registration data - Vanderbilt University student’s PhD Dissertation. • Voter registration data for one of the large counties in NC. • http://www.wakegov.com/elections/8data.htm

  6. Voter Registration Data • Why Voter Registration Data is Available: • According to North Carolina law (General Statute 132), "The public records and public information compiled by the agencies of North Carolina Government or its subdivisions are the property of the people. Therefore, it is the policy of this State that the people may obtain copies of their public records and public information free or at minimal cost unless otherwise specifically provided by law." (Voter registration records are not exempt from this law.)

  7. Data Generation - 1 • Pretty clean data!!! • Introduce realistic errors… Chicken and egg problem. • How do we know the pattern and percentage of different types of errors in real data? • If we knew answer to this question, we could have easily solved the record linkage problem. • Insert id/SSN like column • Registration number

  8. Please Read:Date of birth is not provided in voter records. Per § 163-82.10, effective June 1, 2005, dates of birth that may be generated in the voter registration process, by either the State Board of Elections or a County Board of Elections, are confidential and shall not be considered public records and subject to disclosure to the general public under Chapter 132 of the General Statutes. No list produced under this section shall contain a voter's date of birth; however, lists may be produced according to voters' ages. • Bingo! We have ages, we can get the birth year!

  9. Data Generation - 2 • Insert DOB column to the voters dataset. • Birth year = Current year – age. • Day and month? • Simulate the duplicates, twins, couples and families based on the last name, address, age and accordingly assign DOB

  10. Future Plan Machine learning techniques to simulate real world data errors during synthetic data generation?

More Related