1 / 16

Publish -Time Data Integration for Open Data Platforms

WOD 2013. Publish -Time Data Integration for Open Data Platforms. Julian Eberius , Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden) . Motivation. Premise.

nigel
Download Presentation

Publish -Time Data Integration for Open Data Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WOD 2013 Publish-Time Data Integration forOpen Data Platforms Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden)

  2. Motivation

  3. Premise Continuous publishing without standardization will continuously increase heterogeneity on the platform. Is there a solution without predefined schemata / ontologies?

  4. Problem Different namesforattributesofthe same meaning Different meaningsforattributeswith same values

  5. System Overview

  6. Offline • Domain Clustering • Bottom-upclustering on schema-level • Used online tolimitsearchspace • But also toimproveaccuracy • Domain Statistics • Create different formsvaluesetsynopses • Usedto save comparisonwork online

  7. Online • Input • New datasetds+ withvaluesetsvs+ • Output • Attribute namesuggestions • Constraint • Instanteneousresponse time (Publish-Time!) • Basic Approach • Assignds+ todomainbased on schemainformation • Generaterecommendationsbased on values

  8. Naiv-C • Most Naive Approach: • Iterateover Corpus C • returnthenamesof all attributeswithsufficientlysimilarvaluesets • orderthembyoverallfrequency in thecorpus • Properties: • Finds all similarvaluesets • Generates thelargestpossiblenumberofrecommendations • Extremelylongrun time • Mightgeneratetomanyrecommendations

  9. Naiv-D • Domain-based Approach: • Classifyincomingdatasetintodomain D • Iterateover Domain D • continueas in Naiv-C • Properties: • Findslesssimilarvaluesets • Shorter run time • Onlygeneratesrecommendationsfromonedomain

  10. Cluster / Analysis-D • Synopsis-basedApproaches: • Create representativevaluesets RVS fordatasets in domain • Match onlyagainst RVS • Clustering-D • Cluster VS in domain, create RVS • Pre-computerecommendationlistas all attributenamesofvaluesetsparticipating in final cluster • Online: find singlemostsimilar RVS in D • Analysis-D • Create RVS directlyforsetsof VS withequalname • Online: Find setofsimilar RVS in D

  11. Evaluation

  12. Quality I

  13. Quality II

  14. Runtimes

  15. Cluster Size

  16. Conclusion • Weneedstatistics-baseddataintegrationatpublish timetolimitthegrowthofheterogenityin large publicdatasetcorpora. • Lots ofworkto do: clustering, matching, statistics, indexing, performance.

More Related