1 / 22

Dirty Data Data Cleansing

Dirty Data Data Cleansing. Xxxxxx DSCI 5240 December 4, 2012. Introduction. Real data is dirty Why clean? Eliminate duplicates Smaller database Accurate statistics The problem Merge/Purge of large databases. Preview. Data Cleansing Solutions Real World Data OCAR’s Data Conclusion.

urban
Download Presentation

Dirty Data Data Cleansing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dirty DataData Cleansing Xxxxxx DSCI 5240 December 4, 2012

  2. Introduction • Real data is dirty • Why clean? • Eliminate duplicates • Smaller database • Accurate statistics • The problem • Merge/Purge of large databases

  3. Preview • Data Cleansing Solutions • Real World Data • OCAR’s Data • Conclusion

  4. Data Cleansing Solutions • Sorted-Neighborhood Method • Equational Theory • Transitive Closure

  5. Sorted-Neighborhood Method • Three phases • 1. create keys • 2. sort the data • 3. merge • Three passes using different key • Multi-pass method

  6. Sorted-Neighborhood Method • Key selection

  7. Sorted-Neighborhood Method • Sort using the key selected

  8. Sorted-Neighborhood Method • A ‘window size’ is created for merging

  9. Merge Phase - Equational Theory • A set of equation rules that defines equivalence • A type of clustering function (pattern recognition) • Rules may require an expert

  10. Merge Phase - Equational Theory English rules: Given two records, r1 and r2. IF (the last names of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2) THEN R1 is equivalent to r2

  11. Merge Phase - Equational Theory Results r1 r2

  12. Merge Phase - Transitive Closure • Applied to a single pass sorted-neighborhood method • Improvement of accuracy • Decreases processing time and cost

  13. Merge Phase - Transitive Closure English rules: Given three records a, b and c. IF (a is similar to b AND b is similar to c) THEN a is similar to c

  14. Real World Data • State of Washington Department of Social and Health Services • Office of Children Administrative Research (OCAR) of the Department of Social and Health Services

  15. OCAR’s Data • 6,000,000 records • Grows by 50,000 per month • 19 fields • First and last name • Birthdate • SSN • Case number • Worker ID • Gender • Race • Service ID • Service dates • Payments

  16. OCAR’s Data - Problems • Names misspelled • Missing birthdates • Missing or wrong SSN • Multiple case numbers • Ghost records

  17. OCAR’s Data - Goals • To answer: • “How many children are in foster care?” • “How long do children stay in foster care?” • “How many different homes do children typically stay in?”

  18. OCAR’s Data - Cleaning • 128,438 records sampled (one service office) • Consulted with expert1 • 24 rules established • Used sorted-neighborhood multi-pass methods • Applied equational theory • Keys • 1. Last name, First name, SSN, and Case number • 2. First name, Last name, SSN, and Case number • 3. Case number, First name, Last name, and SSN 1Timothy Clark, OCAR Computer Information Consultant

  19. OCAR’s Data - Results • Identified 8,504 individuals in sample • 45.8% correctly classified • 86.0% where correctly merged • Multi-pass sorted-neighborhood confirmed

  20. Review • Multi-pass sorted-neighborhood method • Equational method • OCAR’s data

  21. Conclusions • Sort-neighborhood method can be expensive • During the sorting phase • Process time • improved accuracy • Multiple times • Small windows • Computation of the transitive closure

  22. Sources • Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem; Mauricio A. Hernandez and Salvatore J. Stolfo; Department of Computer Science, Columbia University, New York, NY 10027. • Haiguang Li, 2011 class presentation • www.cs.columbia.edu/~sal • http://www.dshs.wa.gov/default.shtm

More Related