260 likes | 401 Views
Entity Resolution with Evolving Rules. Youzhong Ma 2010-9-25 Lab of WAMDM. Outline. Motivations ER Related concepts ER properties Conclusions. Entity Resolution background. Entity Resolution background. Naïve ER Approach Vs. New Approach. Outline. Motivations ER Related concepts
E N D
Entity Resolution with Evolving Rules Youzhong Ma 2010-9-25 Lab of WAMDM
Outline • Motivations • ER Related concepts • ER properties • Conclusions
Outline • Motivations • ER Related concepts • ER properties • Conclusions
ER Related concepts • Suppose market A will merge market B • They have to combine their customers • The same person may occur in two markets’ customer DB, but some attributes are different • How to deal with it?
ER Rule • Boolean functions • determines if two records represent the same entity: true or false. • Distance functions • How different(similar) the records are.
ER procedure The Evolving rule approach only works if the ER algorithm satisfies Certain properties and B2 is Stricter than B1. So one contribution of this paper is to exploit Under what conditions and for what ER algorithms Are incremental approaches feasible? original records set S = {r1,r2,r3,r4} ER input Pi = {{r1},{r2},{r3},{r4}} B1:Pname E1 = {{r1,r2,r3},{r4}} (6 comps) 6 comps Naïve approach 3 comps Evolving rule B2: Pname ∧ Pzip E2 = {{r1,r2},{r3},{r4}}
Materialization! original records set S = {r1,r2,r3,r4} ER input Pi = {{r1},{r2},{r3},{r4}} B1:Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}} (6 comps) Pname Ename = {{r1,r2,r3},{r4}} Pzip Ezip = {{r1,r2},{r3},{r4}} 3comps B2: Pname ∧ Phone E2 ={{r1},{r2,r3},{r4}}
Outline • Motivations • ER Related concepts • ER properties • Conclusions
Two important properties for ER algorithms that enable efficient rule evolution for match-based clustering • Rule Monotonicity(RM) • Context Free(CF)
Rule Monotonicity(RM) B1: Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}} B2:Pname E2 = {{r1,r2,r3},{r4}}
Existing properties in literature • General Incremental VS. Context Free • Order independent VS. Rule Monotonicity • An ER algorithm is order independent if the ER result is same regardless of the order of the records processed.
Outline • Motivations • ER Related concepts • ER properties • Conclusions
conclusions • Propose a new ER approach with evolving rules • Exploiting the properties (RM、CF) of the ER algorithms that enable efficient rule evolution • Providing guidance to the ER algorithms designer
Some problems • How are the comparision rules generated? • How to design the ER Algorithms that hold the RM and CFproperties? • How to Implement the ER algorithms in MapReduce framework?