1 / 52

Informationsintegration Information Quality

Informationsintegration Information Quality. 26.1.2006 Felix Naumann. Overview. Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration. IIS.

Thomas
Download Presentation

Informationsintegration Information Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InformationsintegrationInformation Quality 26.1.2006 Felix Naumann

  2. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  3. IIS Database Management Systems vs. Integrated Information Systems DBMS Felix Naumann, VL Informationsintegration, WS 05/06

  4. Complete (assumed) Accurate Trusted Fast Free Incomplete Inaccurate Untrusted Slow Possible cost Low expectations Low quality High expectations High quality DBMS Quality vs. IIS Quality Felix Naumann, VL Informationsintegration, WS 05/06

  5. Datenqualität vs. Datenfehler • Qualität kann nicht einzig durch Data Cleansing erhöht werden. • Ansehen, Objektivität, … • Accuracy ≠ Quality • Duplicates ≠ Quality Felix Naumann, VL Informationsintegration, WS 05/06

  6. Datenqualität in IIS • Integrierte Informationssysteme besonders anfällig für Qualitätsprobleme • Probleme akkumulieren • Qualität der Ursprungsdaten (Eingabe, Fremdfirmen,...) • Qualität der Quellsysteme (Konsistenz, Constraints, Fehler, ...) • Qualität der Integrationsprozesse • Parsen, Transformieren • Mappings • Probleme treten erst bei integrierter Sicht zu Tage Felix Naumann, VL Informationsintegration, WS 05/06

  7. Example: Customer Relationship Management (CRM) • Probleme im CRM eines Multi-Channel Vertriebs • Kunden doppelt geführt • Kunden falsch bewertet • Falsche Adressen • Haushalte / Konzernstrukturen nicht erkannt • Folgen • „False positives“: Verärgerte Kunden durch mehrere / unpassende Mailings • „False negatives“: Verpasste Gelegenheiten durch fehlende / falsche Zuordnung (Cross-Selling) • Sinnlose Portokosten bei falschen Adressen Quelle: Prof. Ulf Leser (VL Data Warehouses) Felix Naumann, VL Informationsintegration, WS 05/06

  8. Cost of Dirty Data • A.T. Kearny: 25%-40% der operativen Kosten entstehen durch schlechte Datenqualität. • Data Warehouse Institute: Industrie und Verwaltung in den USA verlieren jährlich 600 Milliarden USD. • SAS Studie: Nur 18% der Deutschen Betriebe vertrauen ihren Daten. • AT&T (70er): 20-30% aller Anschlüsse unbenutzt wegen schlechter Daten. • 80% aller Krankenhaus Datensätze enthalten Fehler. Hmmm... • ... Felix Naumann, VL Informationsintegration, WS 05/06

  9. Fixed cost • price, patience, … • Optimize IQ • IQ criteria Optimize IQ! • Fixed quality • complete & correct • Optimize cost • time & throughput Felix Naumann, VL Informationsintegration, WS 05/06

  10. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  11. Information Quality (IQ) • Was ist Informationsqualität ? • „Fitness for use“ • “User satisfaction” • Anwendungsabhängig • Folgen geringer Datenqualität • Falsche Prognosen • Verpasstes Geschäft • Qualität ist besonders bei integrierten Informationen interessant • Oft keine Kontrolle über Informationsquellen (Autonomie!) • Oft zweifelhafte Qualität • Internet macht Publikation leicht • Vielzahl verfügbarer Quellen Felix Naumann, VL Informationsintegration, WS 05/06

  12. IIS Quality Criteria IQ := “Even though quality cannot be defined, you know what it is.” Robert Pirsig Felix Naumann, VL Informationsintegration, WS 05/06

  13. Information Quality (IQ) IQ := {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... } Felix Naumann, VL Informationsintegration, WS 05/06

  14. IQ Classification of [WS96] • Intrinsic IQ • Believability, Accuracy, Objectivity, Reputation • Contextual IQ • Value-added, Relevancy, Timeliness, Completeness, Amount • Representational IQ • Interpretability, Understandability, Repr. Consistency, Repr. conciseness • Accessibility IQ • Accessibility, Security Felix Naumann, VL Informationsintegration, WS 05/06

  15. Content-based IQ Criteria • …concern the actual data. • Accuracy • is the extent to which data is correct, reliable, and certified free of error. [WS96] • Completeness • is the extent to which data is not missing and is of sufficient breadth, depth, and scope for the task at hand. [WS96] • Customer support • is the amount and usefulness of human help via email or telephone. • Documentation • is the amount and usefulness of documents with metadata. • Interpretability • is the extent to which data is in appropriate languages, symbols, and units, and the definitions are clear. [WS96] • Relevancy (or relevance) • is the extent to which data is applicable and helpful for the task at hand. [WS96] • Reliability • is the degree to which the user can trust the information • Value-Added • is the extent to which data is beneficial and provides advantages from its use. [WS96] Felix Naumann, VL Informationsintegration, WS 05/06

  16. Technical IQ Criteria • …concern software and hardware. • Accessibility (or availability) • of a DBMS is the probability that a feasible query is correctly answered in a given time range. • Is the extent to which data are available or easily and quickly receivable [WS96]. • Latency • is the amount of time in seconds from issuing the query until the first data item reaches the user • Price (cost effectiveness) • is the amount of money a user has to pay for a query. • is the extent to which the cost of collecting appropriate data is reasonable [WS96]. • Response time • measures the delay in seconds between submission of a query by the user and reception of the complete response from the IS. • Security • is the extent to which access to data is restricted appropriately to maintain its security [WS96]. • Timeliness • is the extent to which the age of the data is appropriate for the task at hand [WS96]. Felix Naumann, VL Informationsintegration, WS 05/06

  17. Intellectual IQ Criteria • …concern subjective aspects. • Believability • is the extent to which data is regarded as true, real, and credible [WS96]. • Objectivity • is the extent to which data is unbiased, unprejudiced, and impartial [WS96]. • Reputation • is the extent to which data is trusted or highly regarded in terms of its source or content [WS96]. Felix Naumann, VL Informationsintegration, WS 05/06

  18. Instantiation-related IQ Criteria • …concern the presentation of retrieved data. • Amount of data • is the extent to which the quantity or volume of available data is appropriate [WS96]. • Representational conciseness • is the extent to which data is compactly represented without being overwhelming [WS96]. • Representational consistency • is the extent to which data is always represented in the same format and are compatible with previous data [WS96]. • Understandability (ease of understanding) • is the extent to which data are clear without ambiguity and easily comprehended [WS96]. • Verifiability (traceability) • Is the extent to which data are well documented, verifiable, and easily attributed to a source [WS96]. Felix Naumann, VL Informationsintegration, WS 05/06

  19. IQ Criteria (classical) • Accuracy • Definition: • Usually: Percentage of incorrect tuples • For integration: Percentage of incorrect data values • Assessment: • Domain and Constraint Testing • Lookup tables • Scientific measurements • Data-input experience • Improvement: • Often: Deletion • Better: “Data Scrubbing” Felix Naumann, VL Informationsintegration, WS 05/06

  20. IQ Criteria (classical) • Response Time • Definition: • Usually: Time until complete query result is received • For integration: Latency • Assessment: • “Cost Calibration” • Continuous assessment • Improvement: • Source selection • Classical optimization • Federated Optimization Felix Naumann, VL Informationsintegration, WS 05/06

  21. IQ Criteria (new) • Completeness • Definition: • Coverage: Number of real world objects represented • Density: Number of attributes covered • For IIS: NULL-values • Assessment: • Sampling • Existing Metadata • Improvement: • Source selection • “Best k” vs. “k best” Felix Naumann, VL Informationsintegration, WS 05/06

  22. IQ Criteria (new) • Reputation / Trust • Definition: • Reputation: Memory and summary of behavior from past transactions • Trust: Expectation about future behavior • Assessment: • Individual experience • Corporate guidance • Trust-networks • Improvement: • ??? Felix Naumann, VL Informationsintegration, WS 05/06

  23. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  24. Fixed cost • price, patience, … • Optimize IQ • IQ criteria Optimize IQ! • Fixed quality • complete & correct • Optimize cost • time & throughput Felix Naumann, VL Informationsintegration, WS 05/06

  25. DBMS Cost criteria Cost model Optimization algorithm Integrates IS Quality criteria Quality model Optimization algorithm + Information integration A New Optimization Paradigm – Many Changes Felix Naumann, VL Informationsintegration, WS 05/06

  26. DB Cost Criteria • Response time • Execution time • Latency • Throughput • Cardinality • … Assessed through system parameters and statistics. Felix Naumann, VL Informationsintegration, WS 05/06

  27. IIS Quality Criteria IQ := {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... } Assessed in 3 classes… Felix Naumann, VL Informationsintegration, WS 05/06

  28. Prozess Subjekt Objekt Anfrage IQ-Assessment • Relevanz • Glaubwürdigkeit • ... • Vollständigkeit • Zeitnähe • ... • Verfügbarkeit • Antwortzeit • ... Felix Naumann, VL Informationsintegration, WS 05/06

  29. IQ-Assessment Felix Naumann, VL Informationsintegration, WS 05/06

  30. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  31. DB Cost Models • Operators • + (add) • max • x (multiply) Felix Naumann, VL Informationsintegration, WS 05/06

  32. (?, ?, ?, ?, ?, ?, ?, ?) Merging operators (?, ?, ?, ?, ?, ?, ?, ?) IQ-vector aggregated IQ-vector A Quality Model for Integrated IS • 2 Problems • Many Dimensions • Multidimensional ⊓ ⊔ S3 (95,0,0.7,1,99.95,60,38,3) S1 (95,0,0.7,1,99.95,60,48.2,0) S2 (99,0,1,0.2,99.9,80,52.8,0) Felix Naumann, VL Informationsintegration, WS 05/06

  33. (89.35,0,1,1,99.8,28.8,76.06,3) merge (94.05,0,1,1,99.85,48,54.86,0) merge Merge IQ in many Dimensions IQ Merge Functions Availability: A B Price: A + B Response Time: max[A, B] Coverage: Sylvester ⊓ S3 ⊔ (95,0,0.7,1,99.95,60,38,3) S1 (95,0,0.7,1,99.95,60,48.2,0) S2 (99,0,1,0.2,99.9,80,52.8,0) Felix Naumann, VL Informationsintegration, WS 05/06

  34. IQ-criteria have Different units Different ranges Different importance So... convert scale weight Multidimensional IQ (89.35,0,1,1,99.8,28.8,76.06,3) >(82.35,0,2,1.5,95,32,71.77,2)? MADM methods: SAW, TOPSIS, ELECTRE, AHP, DEA Felix Naumann, VL Informationsintegration, WS 05/06

  35. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  36. DB-type Optimization • Goal • Minimize response time • Maximize throughput • Restrictions • Complete • Correct (not just accurate: filter conditions…) Find best plan! Felix Naumann, VL Informationsintegration, WS 05/06

  37. (name, company) (name, company) ⋈(compID=ID) (salary > 1000) (salary > 1000) ⋈(compID=ID) emp comp emp comp DB-type Optimization SELECT name, company FROM emp, comp WHERE emp.compID = comp.ID AND emp.salary > 1000 Felix Naumann, VL Informationsintegration, WS 05/06

  38. (name, company) ⋈(compID=ID) (salary > 1000) ... emp_1 emp_n comp DB-type Optimization Felix Naumann, VL Informationsintegration, WS 05/06

  39. (name, company) ⋈(compID=ID) MERGE comp (salary > 1000) (salary > 1000) ... emp_1 emp_n DB-type Optimization (name, company) ⋈(compID=ID) (salary > 1000) MERGE comp ... emp_1 emp_n Felix Naumann, VL Informationsintegration, WS 05/06

  40. (name, company) ⋈(compID=ID) MERGE comp (salary > 1000) (salary > 1000) ... emp_1 emp_n IIS-type Optimization • Change is efficient • But: Result can be incomplete. • Preferences? Felix Naumann, VL Informationsintegration, WS 05/06

  41. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  42. IIS-type Optimization • Goal • Maximize information quality • (Maximize completeness) • Restrictions • Price • Bandwidth • Time (user patience) Find K best sources – Find best K sources Felix Naumann, VL Informationsintegration, WS 05/06

  43. IIS-type Optimization • K best sources • Simple IQ model, but • Sources may not complement each other • at tuple level (replication) • at attribute level • Best K sources • Finds optimal query result • Uses IQ merging Felix Naumann, VL Informationsintegration, WS 05/06

  44. Output Quality-ranked plans Phase 3 Phase 2 Phase 1 Plan selection Query planning Source selection Naive (3-phase) approach [NLF99] Input Query,views, IQ scores BA • Good: executes only best plans • Bad: still needs to compute all plans Felix Naumann, VL Informationsintegration, WS 05/06

  45. Integrated (single-phase) approach [LN00] Input Output Query, views, and IQ scores Quality-ranked plans HiQ B&B: one phase, quality-based branch & bound algorithm • Good: executes only best plans • Good: computes only a fraction of all plans Felix Naumann, VL Informationsintegration, WS 05/06

  46. Overview • Motivation: IQ for integrated IS • Definition of IQ • Optimizing IQ • IQ assessment • IQ model • IQ query answering in DBMS • IQ query answering in IIS • IQ-driven integration Felix Naumann, VL Informationsintegration, WS 05/06

  47. 0766607194 H. Melville $3.98  0766607194 Herman Melville Moby Dick $5.99   These are IQ considerations! These are IQ considerations! These are IQ considerations! Conflict Resolution amazon.com ID max length MIN CONCAT bn.com Felix Naumann, VL Informationsintegration, WS 05/06

  48. Conflict Resolution null = unknown Internal Conflict-Resolution Function Felix Naumann, VL Informationsintegration, WS 05/06

  49. Conflict Resolution • Numerical: SUM, AVG, MAX, MIN, … • Non-numerical: MAXLENGTH, CONCAT, AnnCONCAT,… • Special: RANDOM, COUNT, CHOOSE, FAVOR, MaxIQ,… • Domain-specific … Human speci-fication of IQ Automated speci-fication of IQ Felix Naumann, VL Informationsintegration, WS 05/06

  50. slow, up-date, complete, … 0766607194 H. Melville $3.98  0766607194 Herman Melville Moby Dick $5.99   fast, outdated, incomplete, … Conflict Resolution amazon.com ID max length MIN CONCAT bn.com Felix Naumann, VL Informationsintegration, WS 05/06

More Related