190 likes | 300 Views
Semantic Mediation of Scientific Data via Logic-Based Data Federation Software. Amarnath Gupta Bertram Lud ä scher Reagan Moore San Diego Supercomputer Center University of California, San Diego. declarative DB view definition language. Information Integration / Mediation. Goal:
E N D
Semantic Mediation of Scientific Data via Logic-Based Data Federation Software Amarnath Gupta Bertram Ludäscher Reagan Moore San Diego Supercomputer Center University of California, San Diego
declarative DB view definition language Information Integration / Mediation • Goal: • combine data from different sources s.t. the integrated whole is more than the sum of its isolated parts => SDSC/CSE MIX project (Mediation of Information in XML) • Standard Scenarios: • C2B, e.g. comparison shopping: • AddAll := IntegratedView(amazon, barnes&noble, ...) • B2B, e.g. marketplaces: • Virt_Market := IntegratedView(supplier_1, ... supplier_n) • C2M, e.g. home-buyer: • Full_Picture := IntegratedView(Realtor, Crime, Schools, ...) One-World Mediation e.g. join on ISBN Simple Multiple-World Mediation e.g. join on ZIP
MIX Mediation Challenges • MIX Mediator Architecture (middleware) • wrappers: wrap different data into common format (XML) • mediator: combines sources’ XML views into IntegratedView • MIX Mediator Components • declarative mediator view definition language: • XMAS (XML Matching And Structuring) language, algebra, and first prototype ~ 1999 [SIGMOD99,EDBT00,...] • query composition and rewriting esp. with limited source capabilities • on-demand (“lazy”) query processing of virtual XML docs (DOM-VXD) • Blended Browsing and Querying user interface (BBQ)
New MIX Challenges from Scientific Applications • Complex Data (S2S) • SDSC’s Scientific Data Applications(current/planned, e.g. Neurosciences: SciDAC/SDM, NCMIR, NIH BIRN, Earth sciences, ...) show thatsyntactic/structural integration is insufficientfor ... Complex Multiple-World Mediation Problems: • complex, disjoint, seemingly unrelated data • “hidden semantics” in complex, indirect relationships => Semantic (aka Model/Knowledge-Based) Mediation • lift mediation to the level ofconceptual models(CMs) • use domain experts’ knowledgeformalized as rulesover CMs => Specialized Extensions • temporal, geospatial, statistical, DQ/accuracy... operations =>Extend Mediation Scope and Power via Deductive Rules
??? Integrated View ??? ??? Integrated View Definition ??? ???Mediator ??? Wrapper Wrapper Wrapper protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) A Neuroscience Question What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents?
Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). domain expert knowledge domain map equivalent Description Logic facts Example for Formalizing Domain Knowledge:Domain Map (Ontology) for SYNAPSE and NCMIR • A domain map comprises • Description Logic facts ... • - concepts ("classes") • - roles ("associations") • derived properties ... • ... expressed as logic rules • - (e.g. F-logic)
In addition to registering (“hanging off”) data, a source may also refine the mediator’s domain map... Domain Map Refinement ... source can register new concepts at the mediator ...
FL rule proc. LP rule proc. GCM GCM GCM Mediator Engine CM S1 CM S3 CM S2 XSB Engine Graph proc. CM-Wrapper CM-Wrapper CM-Wrapper XML-Wrapper XML-Wrapper XML-Wrapper S1 Extended Mediator Architecture for Semantic Mediation USER/Client CM (Integrated View) Domain Map DM Integrated View Definition IVD Mediator Repository DB CM Plug-In CM Queries & Results (exchanged in XML) first results & demos: [SSDBM00] [VLDB00] [ICDE01] [NIH-HBP01] Logic API (capabilities) S3 S2
Integrated View Definition DERIVE protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) FROM I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}] , % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value]. • provided by the domain expert and mediation engineer • declarative language (here: F-logic) Query Processing
Mediation Services API Mediator Layer • Source model lifting: • domain knowledge reconciliation • model transformation • Query formulation: • user query • integrated view definition Deductive Engine Model Reasoner • Source registration: • domain knowledge • model & schema • query & computation capabilities • Query processing: • view unfolding • semantic optimization • capability-based rewriting Optimizer Wrapper Layer • Query interface (down API): • SDLIP, SOAP, ... • (subsets of) SQL, X(ML)-Query, CPL,... • DOM • SRB-based access • Result delivery interface (up API): • SDLIP, SOAP, ... • pull (tuple/set-at-a-time, DOM) vs. push (stream) • synchronous/asynchronous • direct data/data reference File Sources RDB Sources OODB Sources HTML Sources XML Sources Digital Libraries (Collections) Mediator System Architecture
Source Data Type Query Capability Result Delivery Access Protocol & Transport ARC XPath XQuery DOOD SQL tree file table SDLIP Stream JDBC SRB Tuple-at-a-time Binary for Viewer Set-at-a-time HTTP Selections SPJ CORBA RMI Mediation Services:Source Registration (System Issues)
Mediation Services: Source Registration (Semantics Issues) • Domain Map Registration • provide concept space/ontology • … as a private object (“myANATOM”) • … merge with others (give “semantic bridges”) • … and check for conflicts • Conceptual Model Registration • schema: classes, associations, attributes • domain constraints • “put data into context” (linking data to the domain map)
Client Update Client Query Client Thin Result Viewer Fat Result Viewer Navigate/ Ad-hoc Query Capability Query on Schema Derive Before Insert Check Data Merge Before Insert Client-side Processing Client-side Buffer Send Full Data Context Sensitive Server-side Buffer Server-Push/ Client-Pull Mediation Services: Client Registration
Other Existing Infrastructure • Transparent Access to Remote Data Collections: Storage Resource Broker (SRB) and Metadata Catalog (MCAT) • “Production-Level” Software • PPDG: interface to LBNL Storage Manager, collection creation, replication management • Use of manual and automatic wrapper technology (Minerva, Roadrunner, V. Crescenzi, Universita di Roma Tre) => XWrap Elite
HPSS SRB and the Particle Physics Data Grid S-Commands S-Commands Wisc Client 2 SRB Server @Wisc Wisc Client 1 SRB Server @LBL SRB Server @LBL Disk cache file caching esrb.driver esrb.driver IPC IPC Stage() purge() fileStatus() file purging File caching request Stage() purge() fileStatus() HRM FC esrb.server Stage() purge() fileStatus()
Year 1 Deliverables • define interface metadata format (Critchlow) • extend XWrap to generate wrappers using the interface metadata description instead of requiring human interaction (GT) • develop a canonical XML-based query and response format as a dynamic interface between query engine and wrappers (Critchlow, GT, SDSC) • communication via agent protocols? How about using digital library infrastructure (e.g. Simple Digital Library Interoperability Protocol, SDLIP) • use extended XWrap to create wrappers for the genomics domain for evaluation (GT) • extend the SDSC query and metadata architecture to interoperate with the LLNL DataFoundry (SDSC, Critchlow) • ... interoperation at the wrapper level: Minerva wrappers, XWrap
References • Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering(ICDE), Heidelberg, Germany, IEEE Computer Society, April 2001. • Model-Based Information Integration in a Neuroscience Mediator System, B. Ludäscher, A. Gupta, M. E. Martone, demonstration track, 26th Intl. Conference on Very Large Databases(VLDB), Cairo, Egypt, September 2000. • Knowledge-Based Integration of Neuroscience Data Sources, A. Gupta, B. Ludäscher, M. E. Martone, 12th Intl. Conference on Scientific and Statistical Database Management(SSDBM), Berlin, Germany, IEEE Computer Society, July 2000.