1 / 61

Trio: A System for Data, Uncertainty, and Lineage

UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Jennifer Widom et al Stanford University. Depiction of Trio Project. Some Context. Trio Project They’re building a new kind of DBMS in which: Data Accuracy Lineage

duy
Download Presentation

Trio: A System for Data, Uncertainty, and Lineage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNCERTAINTY LINEAGE DATA Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University

  2. Depiction of Trio Project

  3. Some Context Trio Project • They’re building a new kind of DBMS in which: • Data • Accuracy • Lineage • are all first-class interrelated concepts

  4. The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major “CS is hard” some web page

  5. Depiction Data Uncertainty Lineage (“sourcing”)

  6. Original Motivation for the Project New Application Domains • Many involve data that is uncertain • (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) • Many of the same ones need to track the lineage (provenance) of their data Neither uncertainty nor lineage is supported in current database systems

  7. Importance of Lineage • [technical] Lineage: • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient • [fluffy] Applications use lineage to reduce or resolve uncertainty

  8. Sample Applications • Information extraction • Find & label entities in unstructured text • Often probabilistic • Information integration • Combine data from multiple sources • Inconsistencies • Scientific experiments • Inexact/incomplete data • Many levels of “derived data products”

  9. Sample Applications • Sensor data management • Approximate readings • Missing readings • Levels of data aggregation • Deduplication (“data cleaning”) • Object linkage, entity resolution • Often heuristic/probabilistic • Approximate query processing • Fast but inexact answers

  10. The “Usual” DBMS Features (From first lecture of any Intro to Databases class) • Efficient, • Convenient, • Safe, • Multi-User storage of and access to • Massive amounts of • Persistent data

  11. Completeness vs. Closure Completeness: blue=yellow All sets-of-instances Representable sets-of-instances Closure: arrow stays in blue Op2 Op1 Proposition: An incomplete representation is still interesting if it’s expressive enough and closed under all required operations

  12. Operations: Semantics • Easy and natural (re)definition for any standard • database operation (call it Op) D D’ Closure: up-arrow always exists Op’ direct implementation possible instances rep. of instances Op on each instance I1, I2, …, In J1, J2, …, Jm Note: Completeness  Closure

  13. Incompleteness Instance3 Instance2 Instance1 generates 4th instance: empty relation ? ?

  14. Non-Closure Under Join ⋈ Result has two possible instances: Instance2 Instance1 Not representable withor-sets and?

  15. Another “Trio” in Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use

  16. Another “Trio” in Trio • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS

  17. Remainder of Talk • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS 4. Demo

  18. Quote from Jennifer • We are not about machine learning or probabilistic reasoning! • We are about efficient and convenient storage, manipulation, and retrieval of large data sets (with uncertainty and lineage in them)

  19. Running Example: Crime-Solving • Saw (witness, color, car)// may be uncertain • Drives (person,color, car)// may be uncertain • Suspects (person)= πperson(Saw ⋈ Drives)

  20. In Standard Relational DBMS Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car

  21. Data Model: Uncertainty • An uncertain database represents a set of • possible instances • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7

  22. Their Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  23. Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances

  24. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  25. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  26. Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations

  27. Our Model is Not Complete or Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?

  28. to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage • External lineage • In Trio: A functionλ from data elements to • other data elements (or external sources)

  29. Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? λ(33) = (11,1), 23 ?

  30. Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ?

  31. Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete

  32. Formal Definition of ULDB

  33. Formal definition of Completeness

  34. Proof of Completeness • Proof: Construct R with x-relations S1, .., Sn, corresponding to R1, .. ,Rn • Construct an extra relation PW that encodes the possible instances. PW contains exactly one x-tuple: (1)|(2)|...|(m). • Each Si is constructed as follows, • For every Pj , each tuple t in Ri forms a maybe x-tuple with just one alternative with value t. • Duplicates within and across possible instances are preserved in Si. • We add (j) in PW to the lineage of alternatives in tuples copied from Pj . • This now exactly encodes the data in each of the possible instances.

  35. Continuous, Proof … • The correct lineage is obtained as follows, • We look at the lineage j in Pj and mimic it in the x-tuples it contributes in S1 through Sn. • For example, if j(t1) = {t2} in Pj, where t1 is a subset of R1 and t2 is a subset of R2, then the x-tuple that t2 gave in S2 is added to the lineage of the x-tuple from t1 in S1. • As a final step, we remove the extra relation PW but retain its symbols as external lineage. • Therefore, each possible LDB of D now has the same schema as each Pj , and represents exactly the same data and internal lineage.

  36. ULDBs: Minimality • A ULDB relation R represents a set of possible • instances • Does every tuple inRappear in some possible instance? (no extraneous tuples) • Does every maybe-tuple inRnotappear in some possible instance?(no extraneous ‘?’s) • Also Data-minimality Lineage-minimality

  37. Data Minimality Examples • Extraneous ‘?’ λ(20,1)=(10,1); λ(20,2)=(10,2) ? extraneous

  38. Data Minimality Examples • Extraneous tuple ? extraneous ? ?

  39. Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage

  40. Simple TriQL Example Create Table Suspects as Select person From Saw, Drives Where Saw.car = Drives.car λ(31)=(11,2),(21,2) ? λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2) ? ? λ(33)=(11,1),23

  41. Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  42. TriQL: Querying Confidences Built-in function:Conf() • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8

  43. TriQL: Querying Lineage Built-in join predicate:Lineage() • SELECT Suspects.person • FROM Suspects, Saw • WHERE Lineage(Suspects,Saw) • AND Saw.witness = ‘Amy’ X ==> Y shorthand for Lineage(X,Y)

  44. Operational Semantics SELECT attr-list FROM X1, X2, ..., Xn WHERE predicate Over standard relational database: For each tuple in cross-product of X1, X2, ..., Xn • Evaluate the predicate • If true, project attr-list to create result tuple

  45. Operational Semantics SELECT attr-list FROM X1, X2, ..., Xn WHERE predicate Over ULDB: For each tuple in cross-product of X1, X2, ..., Xn • Create “super tuple” T from all combinations of alternatives • Evaluate predicate on each alternative in T; keep only the true ones • Project attr-list on each alternative to create result tuple • Details: ‘?’, lineage, confidences

  46. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  47. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  48. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  49. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  50. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

More Related