170 likes | 342 Views
Query- driven Data Completeness Management. Simon Razniewski Supervised by Werner Nutt. Area: Data Quality/Decision Support. Data Quality research investigates how good data is Dimensions of Data Quality are: Correctness Timeliness Completeness.
E N D
Query-driven Data Completeness Management Simon Razniewski Supervisedby Werner Nutt
Area: Data Quality/Decision Support • Data Quality research investigates how good data is • Dimensions of Data Quality are: • Correctness • Timeliness • Completeness Query-driven Data Completeness Management
Example Scenario: School Data Management in South Tyrol Central school database Statistical reports ?? Notoriously incomplete Completeness important Query-driven Data Completeness Management
Example: Final Grades • Vocational schools enter final grades, many others don‘t • Query: How many pupils in class 12 have ‘A‘ in Math? • Answer: 3400 • Can we trust this? • Pupils from high schools could be missing in the result No! Query-driven Data Completeness Management
Example: Final Grades (2) • Vocationalschoolsenter final grades, manyothersdon‘t • Query: Howmanypupilsatvocationalschoolsin class 12 have ‘A‘ in Math? • Answer: 1700 • Can wetrustthis? • All grades fromvocationalschoolsare in thedatabase Yes! Query-driven Data Completeness Management
Research Questions • How can completeness information be stored in a database? • How can one find out whether query answers are complete (and correct)? • Where can completeness information come from? Query-driven Data Completeness Management
Where Else is Incompleteness a Problem? • … whenmanyusercontributeto a database Openstreetmap, Wikipedia (?) • … whendatasubmissionis optional Data fromsurveys • … whendatafrom different sourcesisintegrated Biological databases • … whenthe real worldchangeswithoutinformingthedatabase Addressdata Query-driven Data Completeness Management
Abstract Problem Database, where in some parts data may be missing (grey) Query to the database • Can we trust the query answer? • Does the query only touch the complete (green) parts of the database? • If not, where does it touch grey parts? • How could we modify the query to touch only green parts? Query-driven Data Completeness Management
Approach This dataiscomplete I querythispartofthedatabase. Can I trusttheanswer? Database …andthis Youcannot, because… But youcould.. Andthis! 1 Formalismforstatementsaboutcompleteness 4 Derivation of statements from business process analysis 2 Reasoning procedures 3 Implementation techniques Query-driven Data Completeness Management
Approach: Describe Complete Parts by Queries pupil(name, class, school_name, school_type) grade(name, subject, value) • Complete: All grades ofpupilsfromvocationalschools: QCgrades(n,s,v) :- grade(n,s,v), pupil(n,c,sn,‘vocational‘) • Complete: All pupils QCpupils (n,c,sn,st) :- pupil(n,c,sn,st) Query-driven Data Completeness Management
Approach: Compare Queries with Complete Queries Query: All pupils in class 12 with ‘A‘ in Math Qgoodmath(n) :- grade(n,‘Math‘,‘A‘), pupil(n,12,sn,st) QCgoodmath(n) :- grade(n,‘Math‘,‘A‘), pupil(n,12,sn,st) pupil(n,c,sn,‘vocational‘) Reasoning: Are Qgoodmath and Qcgoodmathequivalent? Query-driven Data Completeness Management
Done so Far • Problem 1: Statements about database completeness and query completeness [LID2011] • Problem 2: Reasoning tasks + algorithms + complexity [VLDB2011] • Problem 3: Map reasoning tasks to satisfaction modulo theories (SMT) Query-driven Data Completeness Management
Challenge • Identify core problems and doable steps • Schema constraints • Nulls • Business Processes • Probabilistic Reasoning • XML-databases Query-driven Data Completeness Management
Publications • Checking Query CompletenessoverIncomplete Databases, Simon Razniewskiand Werner Nutt, Workshop on Logic in Databases, 2011 • CompletenessofQueriesoverIncomplete Databases, Simon Razniewskiand Werner Nutt, International Conference on Very Large Databases, 2011 • Submitted:Incomplete Databases: Missing Records andMissing Values, Werner Nutt, Simon Razniewskiand Gil Vegliach, Workshop on Data Quality in Data Integration Systems, 2012 Query-driven Data Completeness Management
Thank you. Query-driven Data Completeness Management
Related Work • Motro 1989: Introduced important concepts for describing database completeness • Levy 1996: Introduced a central reasoning problem about data completeness • Fan, Geerts 2009: Worked on a similar problem of data completeness for master data Query-driven Data Completeness Management