1 / 31

Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

ChoiceMaker Technologies. The NY Citywide Immunization Registry’s MEDD De-Duplication Project. Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*. § ChoiceMaker Technologies, Inc. Andrew.Borthwick@choicemaker.com. *New York City Department of Health

nelia
Download Presentation

Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ChoiceMaker Technologies The NY Citywide Immunization Registry’s MEDD De-Duplication Project Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* § ChoiceMaker Technologies, Inc. Andrew.Borthwick@choicemaker.com *New York City Department of Health vpapadou@dohlan.cn.ci.nyc.ny.us dwalker@dohlan.cn.ci.nyc.ny.us Adapted from a presentation at the 34th National Immunization Conference Washington, DC July 7, 2000

  2. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project The NYC CIR • New York Citywide Immunization Registry was mandated in January 1997 • All health-care providers are required to submit immunizations • Goals of the system: • Doctors look up kids’ immunization statuses to determine which shots to give • Notify parents when their children are due for an appointment • Identify citywide immunization trends • Similar registries are being built at the state and local level around the country

  3. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project NYC CIR Background • About 122,000 children are born in NYC every year • Each month the CIR receives: 50-100,000 patient records and 80-200,000 immunization records • From >1,100 institutions and private providers • Given this volume, hand-matching each new record before it enters the CIR is unrealistic

  4. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project NYC CIR: Background Contains 1.8 million records Very high duplication rate estimated at 3 records: 2 children because of very strict criteria for automatic merging During April-September 1998 CIR staff reviewed and manually de-duplicated about 260,000 record pairs: spent 1,700 hours

  5. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD: What it is A system for deciding when two records represent the same child Fast and accurate Replicates the human decision-making process

  6. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD’s Decision-Making Process For every record pair, MEDD computes a probability between 0 and 100% that the pair should be merged High probabilities  “merge” Low probabilities “don’t merge” Intermediate probabilities (close to 50%) indicate “don’t know” and require human review Thresholds dividing the merge/ don’t know/ don’t merge cases are set by the user

  7. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Maximum Entropy Modeling • MEDD uses “Maximum Entropy Modeling” • A new statistical decision-making technique • Learn the human judgment process by training from examples • Has been used in sentence parsing, computer vision, financial modeling, and proper-name identification • Has achieved state-of-the-art results on these problems

  8. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Maximum Entropy Modeling: Features • Maximum Entropy uses “Features” • Feature = a function which looks at specific fields in the pair of records to make a “merge” or “don’t merge” decision • MEDD has many different features, each of which is assigned a “weight” during training

  9. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Sample MEDD Features • Mother’s Birthday • Match of Mom’s B’day predicts “Merge” • Mismatch of Mom’s B’day predicts “No-Merge” • Neither feature fires if Mom’s B’day wasn’t filled in on both records • We have no evidence in this case • Many other features • Child’s birthday • Child’s first and last name • Medicaid Number

  10. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Training the System Record pairs hand-marked with merge/no-merge decisions A set of features Maximum Entropy Parameter Estimator A weight for each feature

  11. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Probability Computation For a pair of records, MEDD computes the probability that the pair should be merged as: Merge = product of weights of all features predicting “merge” for the pair NoMerge = product of weights of all features predicting “no merge” for the pair

  12. High Probability. Human Decision: Merge Merge Total = 587.2 No-merge total = 2.9 MEDD predicts “Merge” with 99.5% confidence

  13. Low Probability. Human Decision: No-Merge Merge Total = 3.8 No-merge total = 181.1 MEDD predicts “No-merge” with 97.9% confidence

  14. Intermediate Probability. Human Decision: Merge Merge Total = 6.3 No-merge total = 5.4 Predicts “Merge” with 53.9% confidence (Human review)

  15. New York Citywide Immunization Registry: The MEDD De-duplication Project Sophisticated MEDD features: Name Frequency • Name Frequency • “Rodriguez” is 9 times more common than “Walker” in NYC • Less than 3 kids per year are born with the names “Borthwick” and “Papadouka” • Hence we build features categorizing names as “very common”, “somewhat common”, “very rare”, etc. • Given that we have a name match, the fact that the names are very common is a feature predicting “don’t merge” • A match between rare names is a feature predicting “merge”

  16. New York Citywide Immunization Registry: The MEDD De-duplication Project Sophisticated MEDD features: Partial Name Match • Soundex: A phonetic representation of names • Connor = Conor = Conner = CNR • When the Soundex representation of two names matches, a feature fires predicting “merge” • Edit Distance: Features firing based on two names having an edit distance of 1 • BorthwichBorthwickBortwick

  17. New York Citywide Immunization Registry: The MEDD De-duplication Project Special Situation Features • Every database has its quirks • HMO XYZ always sends its data to the CIR with Day of Birth = “1” • Birthday = July 1, 1998 not July 15, 1998 • We have a special feature: • If Provider = “HMO XYZ” AND Day of Birth = 1 AND dates differs only on day of birth, THEN predict merge • We plan to allow users to define these types of features themselves

  18. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Test Procedure • MEDD tested on c. 3,000 pairs under NYC DOH supervision • Pairs were carefully hand-scored by NYC DOH as Merge/Don’t Merge • ChoiceMaker never saw the test data

  19. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD Evaluation Results Even with double-checking, human error rate is no better than 0.3%

  20. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: What MEDD Offers Can be trained on just 3,000 record pairs Judges nearly 1,000 record-pairs per second Achieves very high accuracy by finding the optimal weighting of the different clues (“features”) indicating merge/don’t merge Says “merge”, “don’t merge”, or “I don’t know” Can be rigorously tested Registry management can make informed judgments regarding the effort vs. accuracy trade-off

  21. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project The 5 Stages of the De-duplication Process • “Blocking”: Identify list of possible duplicates (SmartSearch) • “Decision-Making”: Identify a definitive list of duplicate records (MEDD) • Human Review of • Records marked as “don’t know” by MEDD • Records held by special filters (twins, scanty records, etc.) • Linkage: Link records that belong to the same child together (if A=B and B=C then A=C) • Update the CIR

  22. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche Project Avalanche: A project by which we systematically de-duplicate the whole CIR by comparing every record to every record meeting certain criteria Uses our querying tool Smart Search and our de-duplication tool MEDD Project Avalanche I: February-April 2000 Project Avalanche II: May-July 2000

  23. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche I • Used strict blocking criteria for finding possible duplicates to be passed on to MEDD such as: • Exact match on DOB+Medical Record or • Exact match on Medicaid number or • First name+gender+DOB+last name=maiden name (and vise versa) or • Last name+First name+DOB • Used 98% as the cut-off for automatic merging • Hand-reviewed records produced by the filters

  24. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche I: Results * Estimated

  25. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche II • In April 2000 we loaded 4 months worth of data that were held due to Y2K problems • Used more liberal blocking criteria: • Medical Record Number+ • month and year of DOB or • day and year of DOB or • day and month of DOB or • first name • Used 90% as the cut-off for automatic merging • Currently hand-reviewing records produced by the filters

  26. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche II: Results *Estimated

  27. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche: Discussion Using a very conservative cut-off for automatic merging we reduced the duplicates by about 27.5% each time, more than 30% including human review As a result of Project Avalanche 81% of records now have immunizations vs. 58% 6 months ago Since MEDD is not yet implemented on the front end of the CIR, you don’t see the total number of duplicates decreasing over time in these early runs

  28. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Future of MEDD at the CIR As part of the Lead and CIR integration MEDD will be inserted on the front end, thus reducing the number of duplicates being created Improving MEDD’s performance will enable us to automatically merge more duplicates with the same error rate Will continue with Project Avalanche until we bring the duplication rate down to an acceptable level

  29. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: ChoiceMaker Status • Currently have two employees • Andrew Borthwick, Ph.D. • Prof. Arthur Goldberg • Have several major contracts with New York City Dept. Of Health • Good prospects of finding similar work with other state and municipal health departments

  30. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: De-duplication Marketplace • Immunization Registries have very difficult duplicate record problems • Many others have similar problems • Medical researchers (correlating birth certificate and maternal death records) • Banks, phone companies (correlating clients from different lines of business) • Direct marketers (merging mailing lists)

  31. ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: ChoiceMaker’s Plans Do further research to decrease the amount of consulting time needed to deploy MEDD Seeking first-round investors to fund expansion of R&D and marketing Have an opening for someone with an M.S. in C.S. or similar qualifications, starting 10/1/2000 and a C.S. Ph.D. starting 11/1/2000

More Related