170 likes | 293 Views
Ocean Biodiversity Informatics. A Semantic Modelling Approach to Biological Parameter Interoperability. Roy Lowry & Laura Bird British Oceanographic Data Centre Pieter Haaring RIKZ, Rijkswaterstaat, The Netherlands. Presentation Overview. The nature of the problem
E N D
Ocean Biodiversity Informatics A Semantic Modelling Approach to Biological Parameter Interoperability Roy Lowry & Laura Bird British Oceanographic Data Centre Pieter Haaring RIKZ, Rijkswaterstaat, The Netherlands
Presentation Overview • The nature of the problem • Dictionaries and data models • The starting position • Manual mapping • Automation through semantic matching • From dictionary to semantic model • Mapping semantic models • Semantic model applications • Conclusions and lessons learned
The Nature of the Problem • BODC and Rijkswaterstaat both have marine databases holding a wide range of physical, chemical and biological parameters • Both were to be included pan-European metadatabases (EDIOS and SEA-SEARCH CDI) using a common discovery vocabulary • BODC set up the vocabulary and obviously included a mapping to the BODC Parameter Dictionary • Problem arose of how to provide a similar mapping for the Rijkswaterstaat • If the Rijkswaterstaat data markup vocabulary could be mapped to the BODC Parameter Dictionary then the BODC discovery vocabulary mapping could be used
Dictionaries and Data Models • BODC systems have roots in the GF3 model, which means: • Data values are linked to a parameter code • Parameter code is defined in a Parameter Dictionary • The parameter code specifies more than one metadata item for the data value • For chemical and biological data ‘more than one’ becomes ‘a lot’
Dictionaries and Data Models • Rijkswaterstaat uses data models (DONAR becoming WADI) • Measurements are accompanied by attributes containing specific atomic metadata items • Each attribute is populated from a controlled vocabulary • DONAR constrains attribute term combinations using a ‘parameter dictionary’ concept • WADI reduces maintenance overheads by allowing any combination
The Starting Position • BODC • Parameter Codes defined by two plain-text fields • Related semantic information not necessarily in the same field • Fields would not concatenate sensibly • OK for humans, but not for machines • Rijkswaterstaat • Consistently located semantics • Metadata fields that concatenate sensibly in both Dutch and English
Manual Mapping • Manual mapping protocol • For each entry in the Rijkswaterstaat ‘dictionary’ spreadsheet • Look up code with identical meaning using BODC Dictionary search tools (Access Filter by Form) • If found • Copy BODC code from Access and paste into spreadsheet • Else • Prepare dictionary update record and submit for QA and load • Error prone and 500 entries is pushing the limit of human endurance!
Semantic Matching • When code lists run into thousands, automation is required • Rijkswaterstaat developed a semantic matching tool to pull matching terms (preferably one) from the BODC dictionary • Defeated by the lack of standardisation in the BODC plain-text fields e.g. • Calanus abundance • Abundance of Calanus • Calanus count • Number of Calanus
Dictionary to Semantic Model • Became apparent that the BODC Dictionary required significant improvement if it was to support mapping automation • Development strategy was to model the parameter code in the same way DONAR models a measurement • Semantic model developed to cover all codes in BODC Dictionary
Dictionary to Semantic Model • Semantic model developed from DONAR with an increased semantic element count to overcome shoe-horning • Principle that semantic elements may be combined automatically to produce text descriptions maintained • Currently implemented as three sub-models • Element superset will ultimately be created as a single model
Dictionary to Semantic Model • Biological sub-model semantic elements • Parameter (Abundance, Biomass) • Taxon_code (ITIS code) • Taxon_name • Taxon_subgroup (gender, size, stage) • Parameter_compartment_relationship (per unit volume of the, per unit area of the) • Compartment (water column, bed, sediment) • Sample_preparation • Analysis • Data_processing • Needs further refinement e.g. subdivide Taxon_subgroup
Mapping Semantic Models • Two stage process • First map the semantic elements • DONAR Parameter = BODC Parameter + Parameter_compartment_relationship • DONAR Compartment = BODC Compartment • Then map vocabularies for mapped elements • Surface water = water column • Relational database designers will recognise this as normalisation
Mapping Semantic Models • Number of ‘look-ups’ required is reduced by an order of magnitude • Vocabulary elements have simple semantics so automation is possible • Approximately 90% of the Rijkswaterstaat to BODC mapping accomplished by a single SQL statement • Straightforward extension of vocabulary maps (different names for same thing) sorted out most of the rest • Thesauri could help reduce the need for this
Mapping Semantic Models • ‘Hard Core’ problems required manual resolution • Unclear or ambiguous semantics in Rijkswaterstaat element vocabularies (residual beta) • Problems with Dutch to English translation • Some mapping errors were detected • Caused by homonyms (Branchiura) • Emphasises the need for more than just a name for a taxon (reference or ITIS code)
Semantic Model Applications • Semantic modelling is a lowest common denominator approach to metadata • This is what makes it good for mapping • The approach also offers the basis for user-controlled data discovery and interoperability • User chooses the semantic element subset • User data selection interaction based on the subset vocabulary • Automated interoperability requires more sophistication (thesauri, ontologies)
Conclusions • Don’t even think about manual mapping of large parameter dictionaries • 99% of a map is completed in the first 10% of the time • More standardisation means fewer errors and problems • Semantic model vocabularies need ontologies and thesauri to achieve their full interoperability potential
Conclusions • Semantic modelling works for mappings between dictionaries and data models • It also has great potential for parameter discovery and interoperability