1 / 49

Real-World Data Is Dirty

Real-World Data Is Dirty. Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University - 1998. Class Presentation by Chad Foley. April 9, 2014 Inspired by Haiguang Li slides. TOPICS. Introduction A Basic Data Cleansing Solution Test & Real World Results

zeroun
Download Presentation

Real-World Data Is Dirty

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University - 1998 Class Presentation by Chad Foley. April 9, 2014 Inspired by Haiguang Li slides

  2. TOPICS • Introduction • A Basic Data Cleansing Solution • Test & Real World Results • Incremental Merge Purge w/ New Data • Conclusion • Recap

  3. Introduction

  4. The problem: • Some corporations acquire large amounts of information every month • The data is stored in many large databases (DB) • These databases may be heterogeneous • Variations in schema • The data may be represented differently across the various datasets • Data in these DB may simply be inaccurate

  5. Requirement of the analysis • The data mining needs to be done • Quickly • Efficiently • Accurately

  6. Examples of real-world applications • Credit card companies • Assess risk of potential new customers • Find false identities • Match disparate records concerning a customer • Mass Marketing companies • Government agencies

  7. A Basic Data Cleansing Solution

  8. Duplicate Elimination • Sorted-Neighborhood Method (SNM) • This is done in three phases • Create a Key for each record • Sort records on this key • Merge/Purge records

  9. SNM: Create key • Compute a key for each record by extracting relevant fields or portions of fields • O(N) • Example:

  10. SNM: Sort Data • Sort the records in the data list using the key in step 1 • O(NlogN) • This can be very time consuming

  11. SNM: Merge records • Move a fixed size window through the sequential list of records • This limits the comparisons to the records in the window • O(w N)

  12. SNM: Considerations • What is the optimal window size while • Maximizing accuracy • Minimizing computational cost • Execution time for large DB will be bound by • Disk I/O • Number of passes over the data set

  13. Selection of Keys • The effectiveness of the SNM highly depends on the key selected to sort the records • A key is defined to be a sequence of a subset of attributes • Keys must provide sufficient discriminating power

  14. Example of Records and Keys

  15. Equational Theory • The comparison during the merge phase is an inferential process • Compares much more information than simply the key • The more information there is, the better inferences can be made

  16. Equational Theory - Example • Two names are spelled nearly identically and have the same address • It may be inferred that they are the same person • Two social security numbers are the same but the names and addresses are totally different • Could be the same person who moved • Could be two different people and there is an error in the social security number

  17. A simplified rule in English Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2

  18. The distance function • A “distance function” is used to compare pieces of data (edit distance here) • Apply “distance function” to data that “differ slightly” • Select a threshold to capture obvious typographical errors. • Impacts number of successful matches and number of false positives

  19. The distance function • Edit distance : method used to quantify how dissimilar two strings are to one another, by counting the minimum number of operations required to transform one string into the other

  20. Examples of matched records

  21. Building an equational theory • The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert system • In complex problems, an expert’s assistance is needed to write the equational theory

  22. Transitive Closure • In general, no single pass (i.e. no single key) will be sufficient to catch all matching records • An attribute that appears first in the key has higher discriminating power than those appearing after them • If an employee has two records in a DB with SSN 193456782 and 913456782, it’s unlikely they will fall under the same window

  23. Transitive Closure • To increase the number of similar records merged • Widen the scanning window size (w), or • Execute several independent runs of the SNM • Use a different key each time • Use a relatively small window • Call this the Multi-Pass approach

  24. Transitive Closure • Each independent run of the Multi-Pass approach will produce a set of pairs of records • Although one field in a record may be in error, another field may not • Transitive closure can be applied to those pairs to be merged

  25. Multi-pass Matches Pass 1 (Lastname discriminates) KSNKAT48NRTH789 (Kathi Kason 789912345 ) KSNKAT48NRTH879 (Kathy Kason 879912345 ) Pass 2 (Firstname discriminates) KATKSN48NRTH789 (Kathi Kason 789912345 ) KATKSN48NRTH879 (Kathy Kason 879912345 ) Pass 3 (Address discriminates) 48NRTH879KSNKAT (Kathy Kason 879912345 ) 48NRTH879SMTKAT (Kathy Smith 879912345 )

  26. Transitive Equality Example IF A implies B AND B implies C THEN A implies C From example: 789912345 Kathi Kason 48 North St. (A) 879912345 Kathy Kason 48 North St. (B) 879912345 Kathy Smith 48 North St. (C)

  27. Test Results

  28. Test Environment • Test data was created by a database generator • Names are randomly chosen from a list of 63000 real names • The database generator provides a large number of parameters: • size of the DB, • percentage of duplicates, • amount of error…

  29. Correct Duplicate Detection

  30. False Positive Rates

  31. Time for each run

  32. Accuracy for each run

  33. Real-World Test • Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington) • OCAR’s goals • How long do children stay in foster care? • How many different homes do children typically stay in?

  34. OCAR’s Database • Most of OCAR’s data is stored in one relation • The DB contains 6,000,000 total records • The DB grows by about 50,000 records per month

  35. Typical Problems in the DB • Names are frequently misspelled • SSN or birthdays are either missing or clearly wrong • Case number often changes when the child’s family moves to another part of the state • Some records use service provider names instead of the child’s • No reliable unique identifier

  36. OCAR Equational Theory • Keys for the independent runs • Last Name, First Name, SSN, Case Number • First Name, Last Name, SSN, Case Number • Case Number, First Name, Last Name, SSN

  37. OCAR Results

  38. Incremental Merge/Purge w/ New Data

  39. Incremental Merge/Purge • Lists are concatenated for first time processing • Concatenating new data before reapplying the merge/purge process may be very expensive in both time and space • An incremental merge/purge approach is needed: Prime Representatives method

  40. Prime-Representative: Definition • A set of records extracted from each cluster of records used to represent the information in the cluster • The “Cluster Centroid” or base element of equivalence class

  41. Prime-Representative creation • Initially, no PR exists • After the execution of the first merge/purge, clusters of similar records are created • Correct selection of PR from cluster impacts accuracy of results • No PR can be the best selection for some clusters

  42. 3 Strategies for Choosing PR • Random Sample • Select a sample of records at random from each cluster • N-Latest • Most recent elements entered in DB • Syntactic • Choose the largest or more complete record

  43. Important Assumption • No data previously used to select each cluster’s PR will be deleted • Deleted records could require restructuring of clusters (expensive) • No changes in the rule-set will occur after the first increment of data is processed • Substantial rule change could invalidate clusters.

  44. Results • Cumulative running time for the Incremental Merge/Purge algorithm is greater than the classic algorithm • PR selection methodology could improve cumulative running time • TOTAL running time of the Incremental Merge/Purge algorithm is always LESS than the classic algorithm

  45. Conclusion

  46. Cleansing of Data • Sorted-Neighborhood Method is expensive due to • the sorting phase • the need for large windows for high accuracy • Multiple passes with small windows followed by transitive closure improves accuracy and performance (for a specified level of accuracy) • increasing number of successful matches • decreasing number of false positives

  47. Questions 1? • What are two major reasons merging large databases becomes a difficult problem? • The databases are heterogeneous • The identifiers or strings differ in how they are represented within each DB

  48. Questions 2? • What are the three main steps of the Sorted Neighborhood Method? • Creation of key(s) • Sorting records on this key • Merge/Purge records

  49. Questions 3? • What is edit distance used for? • method used to quantify how dissimilar two strings are to one another, • by counting the minimum number of operations required to transform one string into the other

More Related