1 / 39

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach. Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics. Summary. Motivation

cain-allen
Download Presentation

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics

  2. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  3. Motivation • Goal – Gazetteer Integration • how to migrate entries from gazetteer GB to gazetteer GA • Problems • Duplicated Entries Elimination:Gazetteers may “overlap” – requires detecting and eliminating duplicates • Reclassification of migrated entries:Gazetteers may adopt different classification schemes – requires mapping the classification scheme of GB to that of GA

  4. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  5. WordNet (2005), “WordNet - a lexical database for the English language”. Cognitive Science Laboratory, Princeton University, Princeton, NJ – USA. Available at: http://wordnet.princeton.edu Gazetteers & Thesauri • Gazetteer • a gazetteer is “a geographical dictionary (as at the back of an atlas) containing a list of geographic names, together with their geographic locations and other descriptive information” [WordNet 2005]. • a gazetteer is a catalog of geographic feature, where each entry has as attributes: • a unique ID • a unique type – a term taken from a feature type thesaurus • a name • optionally, a location – an approximation of the feature footprint

  6. UNESCO (1995), “UNESCO Thesaurus”. United Nations Educational, Scientific and Cultural Organization, 1995. http://www.ulcc.ac.uk/unesco Gazetteers & Thesauri • Thesauri • a thesaurus is “a structured and defined list of terms which standardizes words used for indexing” [UNESCO 1995] • thesaurus relationships • NT – narrower term • BT – broader term • RT – related term • ...

  7. ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer Gazetteers & Thesauri ADL Gazetteer Ex:ADL Feature Type Thesaurus

  8. Gazetteers & Thesauri ADL Feature Type Thesaurus (sample terms rooted at ‘regions’)

  9. Gazetteers & Thesauri ADL Feature Type Thesaurus (sample entry)

  10. Wrapper Wrapper DataSource DataSource Gazetteers & Thesauri Mediator Mediator GAZ CAT DS CAT GAZ Reference Gazetteer Local Catalogue External Catalogue External Gazetteer Local DataSource Wrapper DataSource

  11. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  12. ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS Gazetteer Integration • Gazetteer Integration Problem • how to migrate entries from gazetteer GB to gazetteer GA TA TB GA GB ADL Gazetteer GEONet

  13. Gazetteer Integration • Duplicated Entries Elimination: • Gazetteers GA and GB may have entries that representthe same real-world features • use footprints to detect possible duplicates FB FA fa ≡ fb TA TB GA GB ADL Gazetteer GEONet

  14. Gazetteer Integration • Reclassification of migrated entries: • Gazetteers may adopt different classification schemes – requires mapping the classification scheme of GB to that of GA TA TB GA GB m( tb ) = ta ADL Gazetteer GEONet

  15. Gazetteer Integration

  16. Gazetteer Integration • Aligning terms does not work... ...

  17. Gazetteer Integration • Aligning term definitions is even worse... • (ADL) bay: indentations of a coastline or shoreline enclosing a part of a body of water; bodies of water partly surrounded by land. • (GNS) bay: a coastal indentation between two capes or headlands, larger than a cove but smaller than a gulf. • (GNS) island: tracts of land, smaller than a continent, surrounded by water at high water.

  18. SWEET (2006) The Semantic Web for Earth and Environmental Terminology (SWEET). Jet Propulsion Laboratory, California Institute of Technology. Available at: http://sweet.jpl.nasa.gov/index.html Gazetteer Integration • Formal approaches (based on DL) are hopeless... ... <owl:Class rdf:ID="Island"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource= "http://sweet.jpl.nasa.gov/ontology/space.owl#surroundedBy_2D" /> <owl:allValuesFrom> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <owl:Class rdf:about="#OceanRegion" /> <owl:Class rdf:about="#LandwaterRegion" /> </owl:unionOf> </owl:Class> </owl:allValuesFrom> </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf rdf:resource="#LandRegion" /> </owl:Class> ... </rdf:RDF>

  19. Gazetteer Integration

  20. Gazetteer Integration • Instance-based Thesauri Mapping: • use duplicates to figure out how to map the classification scheme of GB to that of GA FB FA fa ≡ fb TA TB GA GB m( tb ) = ta ADL Gazetteer GEONet

  21. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  22. TA TB GB GA Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • n(ta ,tb)= number of occurrences of pairs of objects faand fbsuch that: • fa GAand fbGB • fa≡ fb • taand tbare the types of fa, and fb, respectively • n(ta) = the number of entries in FA classified as ta FA FB

  23. TA TB GB GA n( ta , tb ) + Δ 1 P( ta, tb ) = n( ta )+ 1 | TB | Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • P(ta ,tb) = Mapping Rate Estimator • an estimation for the frequency that the term tamaps to tb, for each pair of terms ta TA and tb TB FA FB where:Δ =

  24. TA TB GB GA Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • = Threshold Mapping Rate • m(tb) = ta iff P(ta ,tb) Problem: What is the value of  ? FA FB

  25. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  26. ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS Experiments with Geographic Data Data collection • ADL Gazetteer (ADL Feature Type Thesaurus - TA) • Instances: 16783 • Thesaurus terms: 210 • GEOnet Server Names (GEOnet Thesaurus - TB) • Instances: 87608 • Thesaurus terms: 642

  27. Experiments with Geographic Data Model Evaluation & Test • Data collected was partitioned into 7 datasets • 6 for tuning • 1 for testing Tuning sets Testing set

  28. Experiments with Geographic Data Collected data Testing set 6-fold cross-validation

  29. Training Set (Tk) ... Experiments with Geographic Data Collected data Testing set 6-fold cross-validation

  30. Experiments with Geographic Data ... ...

  31. Validation Set (Vk) ... Experiments with Geographic Data Collected data Testing set 6-fold cross-validation

  32. Testing set Experiments with Geographic Data Collected data Validation Step Training Set (Tk) Validation Set (Vk) ... ... 6-fold cross-validation

  33. Experiments with Geographic Data Collected data Testing set 6-fold cross-validation

  34. Experiments with Geographic Data Collected data Estimated Threshold Mapping Rate Testing set 6-fold cross-validation

  35. Experiments with Geographic Data Collected data Testing set 6-fold cross-validation

  36. Experiments with Geographic Data Testing Step Collected data Threshold: 0.4 • Legend: • C: correct term alignments • P: proposed term alignments Testing set Example: Aligned terms ... 6-fold cross-validation

  37. Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions

  38. Conclusions • Conclusions: • duplicates help reclassification ! • a “semantic approach” may work when “syntactic approaches” fail (badly) • If you buy the idea, you also get... • a strategy to gradually learn how to reclassify gazetteer entries (as in a mediator) • a strategy to mediate access to object catalogs in general(as long as it is possible to detect duplicates) • (Gazetteer for the Brazilian territory: • extracted from the ADL Gazetteer • entries classified according to 4 different (aligned) schemes • encapsulated by Web services)

  39. Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics

More Related