1 / 20

Trio: A System for Data, Uncertainty, and Lineage

Trio: A System for Data, Uncertainty, and Lineage. Stanford University. The “Trio” in Trio. Data Student #123 is majoring in Econ: (123,Econ)  Major Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS )  Major With confidence 60% student #456 is a CS major:

skyler
Download Presentation

Trio: A System for Data, Uncertainty, and Lineage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trio: A System for Data, Uncertainty, and Lineage Stanford University

  2. The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major and “CS is hard” some web page

  3. Ingredients for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  4. Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances

  5. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  6. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe):uncertainty about presence • 3. Confidences absent  unknown ?

  7. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  8. Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result (not 12)! ? ? ?

  9. to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage – from another Trio relation • External lineage – from an external source • In Trio: A Boolean functionλ from data elements to other data elements (or external sources)

  10. Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? λ(31) = (11,2)˄(21,2) λ(32,1) = (11,1)˄(22,1) ; λ(32,2) = (11,1)˄(22,2) λ(33) = (11,1) ˄ 23

  11. Example with Lineage Correctly captures possible instances in the result (7) Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2)˄(21,2) ? λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2) ? λ(33) = (11,1) ˄ 23 ?

  12. Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete [Databases with Uncertainty and Lineage, VLDB J. Special Issue (2/08)]

  13. ULDBs: Lineage • Conjunctive lineage sufficient for most operations (,,⋈) • Disjunctive lineage for duplicate-elimination () • Negative lineage for set difference () • General case after several queries: • Boolean formula • Beyond just lineage: • Can capture arbitrary Boolean constraints among tuples • Can represent any finite (sub-)set of possible instances

  14. Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  15. Applications– SERF • Stanford Entity Resolution Framework <id>r1 <name>J. Doe <phone>351-2535 <id>r2 <name>John Doe <phone>357-2635 <email>jdoe@yahoo.com <id>r3 <name>John D. <phone>351-2535 <email>jdoe@yahoo.com Iteration 1 <id>r4 <name>{John Doe||John D.} <phone>{357-2635||351-2535} <email>jdoe@yahoo.com Iteration 2 <id>r5 <name>{John Doe||John D.||J. Doe} <phone>{357-2635||351-2535} <email>jdoe@yahoo.com Swoosh: A Generic Approach to Entity Resolution Benjelloun, Garcia-Molina, Whang et al.

  16. Applications– SERF • Stanford Entity Resolution Framework <id>r1 <name>J. Doe <phone>351-2535 <id>r2 <name>John Doe <phone>357-2635 <email>jdoe@yahoo.com <id>r3 <name>John D. <phone>351-2535 <email>jdoe@yahoo.com Iteration 1 <id>r4 <name>{John Doe : 0.4 ||John D. : 0.3} <phone>{357-2635 : 0.5 ||351-2535 : 0.5} <email>jdoe@yahoo.com Iteration 2 <id>r5 <name>{John Doe : 0.4 ||John D. : 0.3||J. Doe : 0.3} <phone>{357-2635 : 0.3 ||351-2535 : 0.6} <email>jdoe@yahoo.com Swoosh: A Generic Approach to Entity Resolution Benjelloun, Garcia-Molina, Whang et al.

  17. Applications – PharmGKB • Pharmaceutical / Medical Databases • PharmGKB.org • Relationships between Drugs, Diseases, Genes • Evidence from Literature References, Clinical Outcomes, etc.

  18. Applications – PharmGKB.org • Relational Schema • Drugs(DrugID, Name, AltName) • Diseases(DiseaseID, Name, AltName) • Genes(GeneID, Name, Symbol, AltSymbol) • Relationships(RelID, RelatedTo, Evidence) • Independent Base Data • P[EffectiveA,B,C] • = P[A] P[B] P[C] DrugA GeneC EffectiveA,B EffectiveA,B,C DiseaseB

  19. Applications– PharmGKB.org E.g.: C = Heart Disease B = High Blood Pressure A = Warfarin • Relational Schema • Drugs(DrugID, Name, AltName) • Diseases(DiseaseID, Name, AltName) • Genes(GeneID, Name, Symbol, AltSymbol) • Relationships(RelID, RelatedTo, Evidence) • Non-Independent Base Data • P[EffectiveA,B,C] • = P[E|A,B,C] • ≠ P[A] P[B] P[C] • ULDBs Vs. Bayesian Nets  Factorized representation w/conditional probability tables (CPTs) DrugA GeneC EffectiveA,B EffectiveA,B,C DiseaseB

  20. UNCERTAINTY LINEAGE DATA Search “stanford trio” Trio contributors, past and present Parag Agrawal, Omar Benjelloun, Julien Chaumond, Ashok Chandra, Anish Das Sarma, Alon Halevy, Chris Hayworth, Ander de Keijzer, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara, Martin Theobald, Jeff Ullman, Jennifer Widom

More Related