300 likes | 372 Views
Talk outline. For data to be shared, we must administer* Policies (security, privacy, intellectual property, user interests) Metadata (interfaces, relationships, mappings, …) describe, relate, prescribe * Administration: the efforts that don’t look like programming. Data exchange topics.
E N D
Talk outline For data to be shared, we must administer* • Policies (security, privacy, intellectual property, user interests) • Metadata (interfaces, relationships, mappings, …) • describe, relate, prescribe * Administration: the efforts that don’t look like programming
Data exchange topics • Overview • Messaging approaches • Mapping and matching • Active data management • First class views: services beyond “Get data”: • Agility and metrics
Enterprise Info Integration Tools: Why the CIO is yawning • “It solves a tiny piece of my real problems” • “It requires new organizations, new skills” • “We don’t have cross-organizational schemas” • “Where would an EII group would fit with our decentralized Java/messaging approach?”
Basic approaches to data exchange Research history • Virtual db with distributed query (“Enterprise Info Integration” [EII panel at SIGMOD 05] • Data warehouse • Formatted messages Industrial usage
Metadata to administer • System descriptions • Inter-relationships • Match schemas (also ontologies) • To systems or to “standard” interfaces • OWL is better (long term)! • Map: Generate executable mappings • for data or query or … • CommunitieS’ standardS • Dataflows (workflows) and data supply chains • Done within stovepipes (=silos: tightly coupled group of systems with too little interface to outside)
Compatibility aspects for exchanging / integrating data • Exchange requires getting compatible • Meaning • Structure • Value representation (e.g., units, code sets) • Delivery mode (push or pull; whole or part); Data language Then must map the instance set or query (“certain answers”) • Integration (combination) adds • Identity (entity resolution) • Value • Each separate item can be described, related • Enables incremental progress, separate skills, “Industrial approach” [RSRM01] • Where the time goes [SRLS02]. How much do tools help [06?]
Message based exchanges - today • Physical message standards address the “n2” problem • Used in NATO, US DoD for 30+ years • Wrap each system to provide interfaces that are plug compatible in meaning, schema, value details, etc. • How to discourage mixing in domain processing, e.g., Select Mission where Status = Approved ? • High cost discourages participation • Standards simplification might be a killer app for automated mapping • Removes need for identical schema, value details • Small excitement but big $$$
Mapping vs. Matching Analysis:Mapping benefits • Loosen standard: Compatible meanings • Joins, values just need to be derivable by mapper • Results • Save $$ on hand-writing glue wrappers • More systems will join the game • Incremental progress via “describe + generate” • For new needs, proceed by capture the missing knowledge. Then generate mapping[RSRM01] • Once semantically compatible, need not wrap • Partial mapper (just attribute representations) wasn’t useful “[Influential Paper” writeup, SIGMOD Record June 2004]
Mapping vs. Matching Analysis:Matching • Practicalities for matching • ~7 labor months for two ~200 entity air traffic schemas • Costly process, well worth automating – someday • Great research results are emerging [Halevy, Rahm, Doan, …] • Many MITRE customers found it less urgent • Delivers nothing to end users; just input to coder of mappings • Not full automation – humans must review if correctness matters • Manual is often cheap for small increments • Domain experts can do it without programmers • Management often fails to provide the right domain experts
The principles at work • Principles: • For a big win, drive cost of something to zero. Now you’re really agile • Automate the final stage of delivery pipeline first(if you spend $$$ maintaining changes at that stage) • Explanation: • There’s still a slow, costly manual step before end user gets the smallest delta
You’re all wimps!E*I is mostly passive Researchers and E*I products exchange data that fate creates (EII, warehouse, msgs, P2P) • We also need to influence what gets supplied • To influence the definitions used, develop “standards” in Communities of Interest (COIs) • A user’s guide to standards? • When to use a schema vs. an ontology? • Mandates’ unintended consequences • Manage Data Supply Chains via (data) service agreements
Administer via Communities of Interest (COIs) • Decentralize into Communities of Interest (COIs), to manage influence the definitions used • Communities have preferences, not rigid standards • “Community” is ill-defined. Who constitutes, what do they produce, what do they promise, … [Rosenthal, et. al., SIGMOD Record, Dec. ’04]
Administer via Communities (2) • What formalism should a community use to specify “we prefer this”? • What’s a good granularity for a community? a m’data specification? an “influence” • Metrics: When to form subgroups (or split) • Cohesion, coherence • Experiments? [Castano98?] on units of reuse • Tools? • Air Force Data Strategy (Scott Renner et. al.)
Manage a data supply chain • Goal: replace a drawer full of textualMemorandums of Agreement • Create service level agreements from a template[Seligman et. al. IIWeb04] • Special for data: must-protect, copying and deriving are free, data quality, … • Our formalisms: Data flow graph + obligations + stream transformers • Reasoning can help: • Quickly establish data supply contracts in a new virtual organization (at integration time) • Determine effects of inadvertent failure (at run time)
Support many first class views • Messages support “get data” • DBMSs: base tables support all operations (i.e., are first class) • Update, Notify, Annotate, … have limited support on views • Relegated E/R conceptual views to the shadow world (initial design only) • Security needs to map credential (privilege), event, policy • View operation under-constrains semantics. How to help human disambiguate?
N-tier systems: a place to revisit “first class views” • EII wrongly ignores Update, which is • Unavoidable between system tiers • Useful to create staging tables • Glue methods use 3GL: Hand code each data method -- update, notify, … • Testing cost often exceeds programming. Can we reduce? • Paradigm war: Service (imperative) vs. Data (declarative) • Goal paradigm : Service + Data Messages GUIobjs glue glue Bus. objects glue DB
Toward first class (business class?) views • Updates use 3GL to specifyresponse to constraint violations • Repair? Abort? Ask client? • 3GL has no automated use of derivation logic in the glue • SQL view update is • Unattractive to OO contractors • Too hard to customize • Devise a framework for update to constrained data [A. Keller 84?]: • capture semantic choices • generate update code Messages GUIobjs glue glue Bus. objects glue DB
Access policies inhibit sharing SQL privileges are normally not derivable to views Users cross database boundaries, but privileges don’t (even to nickname table). • Admin is redundant, slow, inconsistent • Discourages use of EII and other views • SQL infers privileges OK – for view creator[RS: SDM04] • Need explicit control over visibility of view definition • Open issue: How to provide backward compatibility?
EII and agile systems • “Each year, our incoming message set expands: ~20 new atomic element types, ~2 new entities” • $Millions to maintain integration, among subsystems and especially across tiers • Our N-tier architecture insulates perfectly -- new functionality can’t be seen at other tiers
Improve Agility • You get what you contract for • Hundreds of micro-requirements. “Agility” just in the preamble • How to write “agility” as a contract requirement? • Metrics for agility and interoperability • How much ($$, delay) does a particular size schema change cost? • An opportunity for data in the paradigm war • System test may cost more than changing code • Can EII reduce testing costs?
General speculations about metrics • How to measure agility? • Experiment • Analyze architecture, give points for standards and for tools • Would it help to measure metadata “harmony” • Choose among standards? Determine weightings? • How to measure “metadata harmony” • Include application objects, not just msg + db • Points for each compatible aspect of m’data • Need meaning first • Compatibility is directional on many aspects • Lower weight if mapping is slow, lossy, or a hassle to manage (w.r.t. current tools) • Weight by desirability of the partner
Acting as consultants • Identify the impossible, repeatedly • No, you won’t get a uniform standard • Without incentives, you won’t get m’data • Confront “wish-based” derived requirements • To get maximum benefit from our data, we need to invest in a standard that will enable everyone to share with everyone else • To reach the stars in 2010, we need a warp drive • Therefore, invest now to develop a warp core! • Ask: Estimate the chance we won’t reach the stars. In that case, what’s the investment’s cost/benefit
Acting as consultants (2) • Clarify the problem • Separate the threads of what’s needed for data sharing, to explain what XML contributes • Suggest bottom up approaches, based on gradual, incentivized provision of m’data • Formalize research problems from the muddle
Semantic Web Comments • OWL • Emphasizes concepts, relationships • Smaller units are more suited to E*I [Castano] • OWL is a standard • OWL is useful between as well as within ontologies • Integration semantics: whatever my inference engine returns • Databases (SQL, XML) emphasize • Records (object and several properties) • Plug-compatible interface format • Sets of instances • Update
Data exchange – semantics! • Related data (not just derived): a constraint that under-determines result • Research base exists for transforming data and query(semantics: intersects all possible worlds) • How to explain semantics to • Ordinary users or developers • Semantic Web mediator folks Does it matter? Who explains closed world assumption?
Idiot-proofing 101:Plan on failure • >50% of large integration efforts fail(Standish) • Often leave behind N-tier, distributed “Hello World” + hardware • The rate for government efforts is probably higher, esp. for crash integration efforts, post 9-11 • Technology futures are unreliable • How to create artifacts “bottom up” that have value • Even if the giant integration effort fails? • In many future architectures [Data Integration Needs an Industrial Revolution, “Decentralized development without a global blueprint” notes]
Idiot-proofing 102: Don’t believe your architecture Design artifacts to be valuable in many possible futures • My 1991 information integration blueprint missed Data Warehouses, Enterprise Java Beans, … • Many Homeland Security development efforts began Fall 2001, with no idea of the final shape • Risks: Your blueprint is wrong, or competing one is accepted, or needed infrastructure (e.g., repositories) are not funded or populated [“Decentralized development without a global blueprint”, notes on my homepage]
Managing data services: What specific offerings? • Meet a contracted requirement from existing ones (by reasoning about data relationships) • Organize the agreements based on containment of the object affected • E.g., by what columns are affected, or by row-selection predicates • Abstraction: Express agreement in terms of abstract data structures (+ a set of allowable representations) • Protection requirements on results. Data is easier to pass on than services (e.g., can be emailed), and release is permanent – one cannot subsequently un-release it.
Some principles • Assume diversity • Legacy, external, and special need systems will be noncompliant • Wrapping reduces work to O(n) programming tasks, with data loss and just “get” operation • Ideal: automatically generate point to point bridges from available metadata. O(n) admin tasks • Easy incremental maintenance • Principle: Strategy encompasses all your data sharing, internal + external together • Much duplication, loss of function if do them separately • Work plan then can prioritize across both