1 / 19

Data Matching

Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection By Peter Christen Presented by Joseph Park. Introduction.

nydia
Download Presentation

Data Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Matching Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection By Peter Christen Presented by Joseph Park

  2. Introduction • “Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases” • Also known as: • Record or data linkage • Entity resolution • Object identification • Field matching

  3. Aims & Challenges • Three tasks: • Schema matching • Data matching • Data fusion • Challenges: • Lack of unique entity identifier and data quality • Computation complexity • Lack of training data (e.g. gold standards) • Privacy and confidentiality (health informatics & data mining)

  4. Overview of Data Matching • Five major steps: • Data pre-processing • Indexing • Record pair comparison • Classification • Evaluation

  5. Diagram

  6. Data Pre-processing Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values

  7. Example of Data Pre-processing

  8. Indexing Reduces computational complexity Generates candidate record pairs Common technique—Blocking

  9. Example of Blocking

  10. Record Pair Comparison Comparison vector – vector of numerical similarity values

  11. Example of Record Pair Comparison

  12. Jaroand Winkler String Comparison • Jaro: • Combines edit distance and q-gram based comparison • Winkler: • Increases Jaro similarity for up to four agreeing initial chars

  13. Record Pair Classification • Two-class or three-class classification: • Match or non-match • Match or non-match or potential match (requires clerical review) • Supervised and unsupervised • Active learning

  14. Example of Record Pair Classification

  15. Unsupervised Classification Threshold-based classification Probabilistic classification Cost-based classification Rule-based classification Clustering-based classification

  16. Probabilistic Classification • Three-class based • Different weights assigned to different attributes • Newcombe & Kennedy – cardinalities • Comparison vectors, binary comparison • Conditionally independent attributes assumed

  17. Formulae

  18. Example of Probabilistic Classification

  19. Active Learning Trains a model with small set of seed data Classifies comparison vectors not in training set as matches or non-matches Asks users for help on the most difficult to classify Adds manually classified to training data set Trains the next, improved, classification model Repeats until stopping criteria met

More Related