780 likes | 886 Views
Future Database Needs SC 32 Study Period February 5, 2007. JTC1 SC32N1633. Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905 bebargmeyer@lbl.gov. Topics. Study period purpose New challenges
E N D
Future Database Needs SC 32 Study Period February 5, 2007 JTC1 SC32N1633 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905 bebargmeyer@lbl.gov
Topics • Study period purpose • New challenges • A brief tutorial on Semantics and semantic computing • where XMDR fits • Semantic computing technologies • Traditional Data Administration • Some limitations of current relational technologies • Some input from other sources
Future Database NeedsStudy Period • A one-year study period to identify and understand case studies related to this area. • Bring together a small group of experts in a meeting on “Case Studies on new Database Standards Requirements”. • The workshop would provide input to existing SC32 projects and may provide background material for new proposals for upgrades or for new work within SC32 in time for 2007 SC32 Plenary --Document 32N1451
The Internet Revolution A world wide web of diverse content: The information glut is nothing new. The access to it is astonishing.
Challenge: Find and process non-explicit data Analgesic Agent For example… Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…); However, want to study patients taking analgesic agents Non-Narcotic Analgesic Analgesic and Antipyretic Nonsteroidal Antiinflammatory Drug Acetominophen Datril Tylenol Anacin-3
Challenge: Specify and compute across Relations, e.g., within a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
Contamination Biological Radioactive Chemical mercury lead cadmium Challenge: Combine Data, Metadata & Concept Systems Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003” Concept system: Data: Metadata:
Dublin Core Registries Software Component Registries Common Content Common Content Challenge: Use data from systems that record the same facts with different terms Database Catalogs Common Content ISO 11179Registries UDDIRegistries Table Column Data Element Common Content Common Content Business Specification Country Identifier OASIS/ebXMLRegistries CASE Tool Repositories XML Tag Attribute Common Content Common Content Business Object Coverage TermHierarchy OntologicalRegistries Common Content
Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others DataElementConcept Algeria Belgium China Denmark Egypt France . . . Zimbabwe Same Fact, Different Terms Data Elements Algeria Belgium China Denmark Egypt France . . . Zimbabwe L`Algérie Belgique Chine Danemark Egypte La France . . . Zimbabwe DZ BE CN DK EG FR . . . ZW DZA BEL CHN DNK EGY FRA . . . ZWE 012 056 156 208 818 250 . . . 716 Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 3-Alpha Code ISO 3166 English Name ISO 3166 French Name ISO 3166 2-Alpha Code ISO 3166 3-Numeric Code
Challenge: Draw information together from a broad range of studies, databases, reports, etc.
Challenge: Gain Common Understanding of meaning between Data Creators and Data Users text text data data environ agriculture climate human health industry tourism soil water air ambiente agricultura tiempo salud hunano industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 A common interpretation of what the data represents EEA USGS text data environ agriculture climate human health industry tourism soil water air DoD 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 Users text data environ agriculture climate human health industry tourism soil water air EPA 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 text data 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 ambiente agricultura tiempo salud huno industria turismo tierra agua aero 123 345 445 670 248 591 308 3268 0825 1348 5038 Others . . . Users Information systems Data Creation
Challenge: Drawing Together Dispersed Data text text data data environ agriculture climate human health industry tourism soil water air ambiente agricultura tiempo salud hunano industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 A common interpretation of what the data represents EEA USGS text data environ agriculture climate human health industry tourism soil water air DoD 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 Users text data environ agriculture climate human health industry tourism soil water air EPA 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 text data 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 ambiente agricultura tiempo salud huno industria turismo tierra agua aero 123 345 445 670 248 591 308 3268 0825 1348 5038 Others . . . Users Information systems Data Creation
Semantic Computing • We are laying the foundation to make a quantum leap toward a substantially new way of computing: Semantic Computing • How can we make use of semantic computing? • What do organizations need to do to prepare for and stimulate semantic computing?
Coming: A Semantic Revolution • Searching and ranking • Pattern analysis • Knowledge discovery • Question answering • Reasoning • Semi-automated • decision making
The Nub of It • Processing that takes “meaning” into account • Processing based on the relations between things not just computing about the things themselves. • Computing that takes people out of the processing, reducing the human toil • Data access, extraction, mapping, translation, formatting, validation, inferencing, … • Delivering higher-level results that are more helpful for the user’s thought and action
Semantics Challenges • Managing, harmonizing, and vetting semantics is essential to enable enterprise semantic computing • Managing, harmonizing and vetting semantics is important for traditional data management. • In the past we just covered the basics • Enabling “community intelligence” through efforts similar to Wikipedia, Wikitionary, Flickr
A Brief Tutorial on Semantics • What is meaning? • What are concepts? • What are relations? • What are concept systems? • What is “reasoning”?
Thought or Reference (Concept) Refers to Symbolises Symbol Referent Stands for “Rose”, “ClipArt” Meaning: The Semiotic Triangle C.K Ogden and I. A. Richards. The Meaning of Meaning.
CONCEPT Refers To Symbolizes “Rose”, “ClipArt” Stands For Referent Semiotic Triangle:Concepts, Definitions and Signs Definition Sign
Definitions in the EPA Environmental Data Registry http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box Mailing Address: State USPS Code: http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode The U.S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U.S. or Canada Mailing Address State Name: http://www.epa/gov/edr/sw/AdministeredItem#StateName The name of the state where mail is delivered
Computable Meaning rdfs:subClassOf owl:equivalentClass owl:disjointWith CONCEPT Refers To Symbolizes “Rose”, “ClipArt” Stands For Referent If “rose” is owl:disjointWith “daffodil”, then a computer can determine that an assertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).
What are Relations? WaterBody Relation Merced River Fletcher Creek isA isA Merced Lake Merced Lake Fletcher Creek Concepts and relations can be represented as nodes and edges in formal graph structures, e.g., “is-a” hierarchies.
Concept Systems have Nodes and may have Relations Nodes represent concepts A Lines (arcs) represent relations 1 2 a b c d Concept systems are concepts and the relations between them. Concept systems can be represented & queried as graphs
Linear Large Non-linear Non-linear Large linear Small linear Small non- linear Deep Natural Flowing Shallow Stagnant Artificial River Stream Canal Reservoir Lake Marsh Pond A More Complex Concept Graph Concept lattice of inland water features From Supervaluation Semantics for an Inland Water Feature Ontology Paulo Santos and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22
Directed Acyclic Graph Tree Bipartite Graph Partial Order Graph Partial Order Tree Clique Powerset of 3 element set Ordered Tree Compound Graph Faceted Classification Types of Concept System Graph Structures
Graph Taxonomy Graph Directed Graph Undirected Graph Directed Acyclic Graph Clique Bipartite Graph Partial Order Graph Faceted Classification Lattice Partial Order Tree Note: not all bipartite graphs are undirected. Tree Ordered Tree
What Kind of Relations are There?Lots! Relationship class: A particular type of connection existing between people related to or having dealings with each other. • acquaintanceOf - A person having more than slight or superficial knowledge of this person but short of friendship. • ambivalentOf - A person towards whom this person has mixed feelings or emotions. • ancestorOf - A person who is a descendant of this person. • antagonistOf - A person who opposes and contends against this person. • apprenticeTo - A person to whom this person serves as a trusted counselor or teacher. • childOf - A person who was given birth to or nurtured and raised by this person. • closeFriendOf - A person who shares a close mutual friendship with this person. • collaboratesWith - A person who works towards a common goal with this person. • …
Example of relations in a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
Ontologies are a type of Concept System • Ontology: explicit formal specifications of the terms in the domain and relations among them (Gruber 1993) • An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. • Why would someone want to develop an ontology? Some of the reasons are: • To share common understanding of the structure of information among people or software agents • To enable reuse of domain knowledge • To make domain assumptions explicit • To separate domain knowledge from the operational knowledge • To analyze domain knowledge http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html
What is Reasoning?Inference Disease is-a is-a Infectious Disease Chronic Disease is-a is-a is-a is-a Heart disease Diabetes Polio Smallpox Signifies inferred is-a relationship
California part-of part-of Alameda County Santa Clara County part-of part-of part-of part-of San Jose Berkeley Santa Clara Oakland Reasoning: Taxonomies & partonomies can be used to support inference queries E.g., if a database contains information on events by city, we could query that database for events that happened in a particular county or state, even though the event data does not contain explicit state or county codes.
Reasoning: Relationship metadata can be used to infer non-explicit data Analgesic Agent • For example… • patient data on drugs currently being taken contains brand names (e.g. Tylenol, Anacin-3, Datril,…); • (2)concept system connects different drug types and names with one another (via is-a, part-of, etc. relationships); • (3) so… patient data can be linked and searched by inferred terms like “acetominophen” and “analgesic” as well as trade names explicitly stored as text strings in the database Non-Narcotic Analgesic Analgesic and Antipyretic Nonsteroidal Antiinflammatory Drug Acetominophen Datril Tylenol Anacin-3
Analgesic Agent Opioid Non-Narcotic Analgesic Opiate Morphine Sulfate Codeine Phosphate Nonsteroidal Antiinflammatory Drug Acetominophen Reasoning: Least Common Ancestor Query What is the least common ancestor concept in the NCI Thesaurus for AcetominophenandMorphine Sulfate? (answer = Analgesic Agent) Analgesic and Antipyretic
Reasoning: Example “sibling” queries: concepts that share a common ancestor • Environmental: • "siblings" of Wetland (in NASA SWEET ontology) • Health • Siblings of ERK1 finds all 700+ other kinase enzymes • Siblings of Novastatin finds all other statins • 11179 Metadata • Sibling values in an enumerated value domain
Reasoning: More complex “sibling” queries: concepts with multiple ancestors site neoplasms breast disorders • Health • Find all the siblings of Breast Neoplasm • Environmental • Find all chemicals that are a • carcinogen (cause cancer) and • toxin (are poisonous) and • terratogenic (cause birth defects) Breast neoplasm Non-Neoplastic Breast Disorder Eye neoplasm Respiratory System neoplasm
End of Tutorial about concept systems What are the “Database Language” challenges?
Metadata Registries & Database Technologies – Which Does What? Traditional Data Registries (11179 Edition 2) • Register metadata which describes data—in databases, applications, XML Schemas, data models, flat files, paper • Assist in harmonizing, standardizing, and vetting metadata • Assist data engineering • Provide a source of well formed data designs for system designers • Record reporting requirements • Assist data generation, by describing the meaning of data entry fields and the potential valid values • Register provenance information that can be provided to end users of data • Assist with information discovery by pointing to systems where particular data is maintained.
Traditional MDR:Manage Code Sets Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others DataElementConcept Algeria Belgium China Denmark Egypt France . . . Zimbabwe Data Elements Algeria Belgium China Denmark Egypt France . . . Zimbabwe L`Algérie Belgique Chine Danemark Egypte La France . . . Zimbabwe DZ BE CN DK EG FR . . . ZW DZA BEL CHN DNK EGY FRA . . . ZWE 012 056 156 208 818 250 . . . 716 Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 3-Alpha Code ISO 3166 English Name ISO 3166 French Name ISO 3166 2-Alpha Code ISO 3166 3-Numeric Code
What Can XMDR Do? Support a new generation of semantic computing • Concept system management • Harmonizing and vetting concept systems • Linkage of concept systems to data • Interrelation of multiple concept systems • Grounding ontologies and RDF in agreed upon semantics • Reasoning across XMDR content (concept systems and metadata) • Provision of Semantic Services
We are trying to manage semantics in an increasingly complex content space Structured data Semi-structured data Unstructured data Text Pictographic Graphics Multimedia Voice video
Case Study • Combining Concept Systems, Data, and Metadata to answer queries.
Title 40--Protection of Environment CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS § 141.62 40 CFR Ch. I (7–1–02 Edition) § 141.62 Maximum contaminant levels for inorganic contaminants. (a) [Reserved] (b) The maximum contaminant levels for inorganic contaminants specified in paragraphs (b) (2)–(6), (b)(10), and (b) (11)–(16) of this section apply to community water systems and non-transient, non-community water systems. The maximum contaminant level specified in paragraph (b)(1) of this section only applies to community water systems. The maximum contaminant levels specified in (b)(7), (b)(8), and (b)(9) of this section apply to community water systems; non-transient, noncommunity water systems; and transient non-community water systems. Contaminant MCL (mg/l) (1) Fluoride ............................ 4.0 (2) Asbestos .......................... 7 Million Fibers/liter (longer than 10 μm). (3) Barium .............................. 2 (4) Cadmium .......................... 0.005 (5) Chromium ......................... 0.1 (6) Mercury ............................ 0.002 (7) Nitrate ............................... 10 (as Nitrogen) Linking Concepts: Text Document
Chemical Contamination Definition The addition or presence of chemicals to, or in, another substance to such a degree as to render it unfit for its intended purpose. Broader Term contamination Narrower Terms cadmium contamination, lead contamination, mercury contamination Related Terms chemical pollutant, chemical pollution Deutsch: Chemische Verunreinigung English (US): chemical contamination Español: contaminación química SOURCE General Multi-Lingual Environmental Thesaurus (GEMET) Thesaurus Concept System(From GEMET)
Concept System (Thesaurus) Contamination chemical pollutant Biological Radioactive Chemical chemical pollution cadmium lead mercury
Chemicals in EPA Environmental Data Registry Environmental Data Registry
X Merced River B Fletcher Creek A Merced Lake Data Monitoring Stations Measurements
Metadata Contaminants Metadata
Relations among Inland Bodies of Water Fletcher Creek feeds into Merced River Merced River feeds into fed from feeds into Fletcher Creek Merced Lake Merced Lake
Contamination Biological Radioactive Chemical mercury lead cadmium Combining Data, Metadata & Concept Systems Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 2 parts per billion between December 2001 and March 2003” Concept system Data Metadata