1 / 50

Principles of Dataspace Systems

Principles of Dataspace Systems. Alon Halevy PODS June 26, 2006. Outline. Example data management challenges Denote by: “dataspaces” [Franklin, H., Maier] Dataspace Support Platforms: “Pay-as-you-go” data management Putting meat to the bones:

wyanet
Download Presentation

Principles of Dataspace Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Dataspace Systems Alon Halevy PODS June 26, 2006

  2. Outline • Example data management challenges • Denote by: “dataspaces” [Franklin, H., Maier] • Dataspace Support Platforms: • “Pay-as-you-go” data management • Putting meat to the bones: • Specific research problems, recent progress • Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

  3. Shrapnels in Baghdad Story courtesy of Phil Bernstein

  4. AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginatedFrom BudgetOf HomePage Personal Information Management [Semex: Dong et al.]

  5. Google Base

  6. The Web is Getting Semantic • Forms (millions) • Vertical search engines (hundreds) • Annotation schemes: Flickr, ESP Game • Google Coop • DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]

  7. “Data is the plural of anecdote” • Digital libraries, enterprises, “smart homes” • Corie Environmental Observation System • Talk to Maier • Circle of Blue • Data about the world’s water sources • The Boeing 777 • [Hanrahan @ Stanford]

  8. Requirements • A system that: • Is defined by boundaries (organizational, physical, logical), not explicit entry. • Provides best-effort services • Little or no setup time • Leverages semantics when possible Manage dataspaces

  9. Other Dataspace Characteristics • All dataspaces contain >20% porn. • The rest has >50% spam.

  10. Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B Dataspaces vs. Data Integration Data integration requires semantic mappings

  11. Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Dataspaces Data integration solutions Investment (time, cost) Artist: Mike Franklin

  12. Dataspaces vs. Data Integration • Data integration systems require semantic mappings. • Dataspaces are “pay-as-you-go” • Dataspaces capture a broader class of semantic relationships: • DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …

  13. Why Do This Now? • Fact: • Data management is about people, not enterprises. • Prediction: • In the next few years, DB conferences will be about data management for the masses. • “CMA”: • If not, our community will become largely irrelevant. • Observation: • We’re doing it anyway: e.g., DB&IR, information extraction, uncertainty, …

  14. Outline • Example data management challenges • Denote by: “dataspaces” • Dataspace Support Platforms: • “Pay-as-you-go” data management • Putting meat to the bones: • Specific research problems, recent progress • Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

  15. Logical Model:Participants and Relationships sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDF XML replica view RDB

  16. Relationships • General form: (Obj1, Rel, Obj2, p) • Obj1, Obj2: instances or sources, • p: degree of certainty • Language for describing relationships? • [Rosati]

  17. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDF XML replica view RDB Dataspace Support Platforms (DSSP) Discover & Enhance Catalog Local Store & Index Search & query Administration

  18. Outline • Example data management challenges • Denote by: “dataspaces” • Dataspace Support Platforms: • “Pay-as-you-go” data management • Putting meat to the bones: • Specific research problems, recent progress • Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

  19. Technical Outline Query Evolve • Queries • Answers • Query processing Reflect

  20. Query Evolve Reflect Dataspace Queries • Keyword queries as starting point • Later may be refined to add structure • Formulated in terms of user’s “schema” • Mostly of the form • Instance*: • “britany spears” • P (instance) • “chicago weather” • “PC chair PODS”

  21. Query Evolve Reflect Semantics of Answers • The actual answers: • P(instance), P*(instance)

  22. Weather Seattle

  23. Query Evolve Reflect Semantics of Answers • The actual answers: • P(instance), P*(instance) • Sources where answer can be found: • Partially specify the query to the source • Help the user clean the query

  24. Toyota Corolla Palo alto

  25. Toyota Palo alto Volvo Palo alto Volvo Palo alto

  26. Query Evolve Reflect Semantics of Answers • The actual answers: • P(instance), P*(instance) • Sources where answer can be found: • Partially specify the query to the source • Help the user clean the query • Supporting facts or sources: • Facts that can be used to derive P(instance) • Rest of derivation may be obvious to user

  27. Query Evolve Reflect Related or Partial Answers • In which country was Dan Suciu born? • Bucharest • Latest edition of software X: • 2004 edition • Is the Space Needle higher than the Eiffel Tower? • Height of Seattle Space Needle • Height of Eiffel Tower 184m 324m Rank all types of answers

  28. Query Evolve Reflect Query Processing: Data Integration Weather(Chicago) Q1 Q2 Q3 Q4 Q41 Q43 Q42 PDMS Active XML Data exchange

  29. Query First name: Jan Middle name: Last name: Van den Bussche Address: ? Evolve Reflect Query Processing: DSSPs Query: Jan Van den Bussche address

  30. Query Evolve Reflect (t1,p1) (t2,p2) Query Processing: DSSPs First name: Jan Middle name: Last name: Van den Bussche Address: ? Keyword query: J. vd Bussche Companies address address: City required StreetAdr address city, zip City? … … … … … … (t1,p3)

  31. Query Evolve Reflect Two Principles • Mappings are approximate at best • What do approximate mappings mean? • Answering queries with them? • Composition? • Answering queries = Finding evidence + combining evidence

  32. Technical Outline Query Evolve • Reuse human attention Reflect

  33. Query Evolve Reflect The Cost of Semantics • Semantic integration modeled by: • {(Obj1, rel, Obj2, p),…} ? Benefit Dataspaces Data integration solutions Investment (time, cost) Artist: Mike Franklin

  34. Query Evolve Reflect Reusing Human Attention • Principle: • User action = statement of semantic relationship • Leverage actions to infer other semantic relationships • Examples • Providing a semantic mapping • Infer other mappings • Writing a query • Infer content of sources, relationships between sources • Creating a “digital workspace” • Infer “relatedness” of documents/sources • Infer co-reference between objects in the dataspace • Annotating, cutting & pasting, browsing among docs

  35. Query Evolve Reflect Past, Present and Future • Leverage past actions & existing structure: • [Dong et al., 2004, 2005], [He & Chang, 2003] • Generalize from current actions • Queries, schema mappings • Beg for extra attention: • ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al.]

  36. Action Query Evolve Reflect Reuse: Learning Schema Mappings[Doan et al.] Mediated schema • Classifiers for mediated schema • Thousands of web forms mapped in little time • Transformic Inc: deep web search. • [Madhavan et al.]: infer mappings for any schemas in the domain (S1, M, S, p)

  37. Technical Outline Query Evolve • Unify lineage, uncertainty, and inconsistency • Model them on views Reflect

  38. Query Evolve Reflect Dataspace Reflection • Answers are uncertain in dataspaces: • data, • mappings, • query answering techniques • Data may be inconsistent • Tracking lineage is crucial

  39. Query Evolve Reflect A DSSP Needs to • Process uncertain data • Update uncertainty with new evidence • Be proactive about reducing uncertainty • Live with inconsistency • Leverage lineage to reduce uncertainty (Obj1, Rel, Obj2, p)

  40. Query Evolve Reflect Two Principles • Need a single formalism for modeling: • uncertainty, • inconsistency, and • lineage • Model them on views

  41. Israel population Query Evolve Reflect Uncertainty & Lineage Israel population Trio @ Stanford

  42. Query Evolve Reflect Uncertainty and Inconsistency • Inconsistency = uncertainty about the truth • Salary (John Doe, $120,000) • Salary (John Doe, $135,000) • Salary (John Doe, $120,000 | $135,000) • Orchestra, Ives @ U. Penn.

  43. Query Evolve Reflect Uncertainty Formalisms 101 • Representasetofpossible worlds • A-tuples - uncertainty on attribute values: • (PODS, 2006, {Chicago, Baltimore}) • Tuples can be optional • Not closed under relational operators

  44. Query Evolve Reflect Uncertainty Formalisms 101 (2) • X-tuples – uncertainty on entire tuples: • {(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} • Still not closed under relational operators

  45. Query Evolve Reflect Uncertainty 101: C-Tables • { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } • Closed and Complete! • Understandable? • See [Das Sarma et al., ICDE 2006]: working models for uncertain data

  46. Query Evolve Reflect Adding Lineage to X-tuples[ULDBs, Benjelloun et al., VLDB 06] ! (l1& l2) l2 l1 {(PODS, 2006, Ch), (PODC, 2006, Ba)} { (…) (…) } { (…) (…) } (t) (t’) No effect on complexity of relational operators

  47. Query Evolve Reflect Adding Lineage to A-tuples (PODS, {2005,2006}, {Chicago, Baltimore}) l1 l2 l3 l4 Lineage on views (projections) • Uncertainty should also be on views! (Halevy, Los Altos, CA, professor}, 0.8 0.6 • Answering queries using uncertain views • See [Dalvi & Suciu, 2005] for a great start

  48. Putting it all Together • Approximate mappings • Evidence combination Query Evolve • Reuse human attention • Create approximate semantic relationships Reflect • Foundation for reasoning about uncertainty, inconsistency and lineage

  49. Conclusion and Outlook • Data management moving to consumers • Dataspaces: key element in this agenda • Pay as you go data management • Reuse human attention • The role of theory: • Reflect, generalize and explain • People, people, people

  50. Some References • SIGMOD Record, December 2005: • Original dataspace vision paper • Maier EDBT 2006 tak • PODS 2006 proceedings: challenges • Data Integration: the Teenage Years • VLDB 2006 • Teaching integration to undergraduates: • SIGMOD Record, September, 2003.

More Related