1 / 35

A Semantic Web Search and Metadata Engine

A Semantic Web Search and Metadata Engine. Roi Adadi David Ben-David. Glossary. SWD. < rdf:RDF > … < rdfs:Class rdf:ID =”Department” /> < rdfs:Class rdf:ID =”Course” /> < rdf:Property rdf:ID =“name” > < rdfs:domain > < owl:Class >

connie
Download Presentation

A Semantic Web Search and Metadata Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Semantic Web Search and Metadata Engine RoiAdadi David Ben-David

  2. Glossary SWD <rdf:RDF> … <rdfs:Classrdf:ID=”Department” /> <rdfs:Classrdf:ID=”Course” /> <rdf:Propertyrdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOfrdf:parseType="Collection"> <rdfs:Classrdf:about=# Department /> <rdfs:Classrdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“number” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“department” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource=“#Department”> </rdf:Property> <rdf:Propertyrdf:ID=“creditPts” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course> … </rdf:RDF> • Semantic Web Document (SWD) • A web page that serializes an RDF graph. • Uses one of the recommended RDF syntax languages, i.e. RDF/XML, N-TRIPLE or N3. • Semantic Web Term (SWT) • An RDF resource that represents an instance of rdfs:Class or rdf:Property, and can be universally referenced by its URI reference (URIref). • Semantic Web Ontology (SWO) • An SWD is considered to be an SWO when a significant proportion of the statements it makes defines new SWTs. • Semantic Web Database (SWDB) • An SWD that does not define or extend a significant number of terms. • Introduces individuals and makes assertions about them. • Make assertions about individuals defined in other SWDs. SWT SWT SWT SWT SWT SWT

  3. SWO Class Document FOAF http://xmlns.com/foaf/spec/index.rdf Contain 12 classes and 51 properties (in 466 triples) (No individuals) Class Organization Property mbox

  4. SWDB FOAF description for Tim Finin www.cs.umbc.edu/~finin//foaf.rdf Defines three individuals and make statements about them (No classes or properties) Name statement Nick Name statement

  5. Motivation • Current form of the Semantic Web • web of Semantic Web Documents (SWD) • Navigating the Semantic Web is difficult • Paucity of explicit hyperlinks (beyond NS in URIrefs). • Relations such as rdfs:seeAlso and owl:imports are rare. • There is a need for a search engine customized for SWD • Find and analyze SWDs on the web. • Suggest a measure for SWDs’ importance (ranking).

  6. Who needs it? • Semantic Web researchers • Search for SWTs and SWOs for publishing their knowledge. • Software Agents • Search SWDs for external knowledge. • Retrieve SWOs to fully understand SWTs. Find the most popular ontology to publish a personal profile

  7. Why don’t just use Google? • Conventional web navigation and ranking models are not suitable for the Semantic Web. • They do not differentiate SWDs from other web pages. • They do not parse and use the internal structure of SWD and the external semantic links among SWDs • Designed to work with NL and unstructured text The FOAF ontology is not among the 10 search results in Google for “person ontology”

  8. Swoogle Objectives • Finding appropriate ontologies • Qualified search (Terms + Types) • Ontologies are sorted by their popularity. • Finding instance data • Querying SWDs with constraints on the classes and properties used by them. • Helps to integrate Semantic Web data on the web. • Characterizing the Semantic Web • Structural properties

  9. Related Work • Ontology Based Annotation Systems • SHOE, Ontobroker, webKB, QuizRDF, CREAM, … • Annotating online documents. • Document indexes based on the annotations, but not on the entire document. • Use their own ontologies that might not suit some SWDs

  10. Related Work – cont. • Ontology Repositories • DAML Ontology Library, SemWebCentral, Schema Web, … • Collect ontologies (simply store the entire RDF document). • Do not automatically discover SWDs but rather require people to submit URLs. • Constitute a small portion of the Semantic Web.

  11. Related Work– cont. • Semantic Web Browsers • W3C’s Ontaria • Searchable and browsable directory of RDF documents developed by the W3C. • Do not automatically discover SWDs. • Stores the full RDF graphs. • Indexes individuals of well known classes • e.g. foaf:Person, rss:Item Experiments show: outperforms them all!

  12. Swoogle • Crawler-based indexing and retrieval system for the Semantic web. • Discover semantic web documents • Computes relations between documents • Store and reason over extracted metadata • The system is designed to scale up to handle tens of millions of documents • Enables rich query constraints on semantic relations

  13. Swoogle Architecture

  14. Swoogle Architecture - Discovery • Collects candidate URLs to find and cache SWDs • Submitted URLs. • A Web crawler. • A customized meta-crawler (using conventional search engines). • SwoogleBot Semantic Web Crawler . • Analyzes SWDs to produce new candidates. Up until now Swoogle has found over 1.7M SWDs with more than 1G triples!

  15. Swoogle Architecture – Indexing • Analyzes the discovered SWDs • Generates the bulk of Swoogle’s metadata about the Semantic Web • Characterizes features associated with SWDs and SWTs. • Tracks relations among SWDs and SWTs. How SWDs use/define/populate a given SWT? How two SWTs are associated?…

  16. Swoogle Architecture – Analysis • Analyzes the generated metadata. • Classification of SWOs and SWDBs. • Hosts the modular ranking mechanisms. • Ontology Rank.

  17. Swoogle Architecture – Services • provides search services to software agents and users, allowing them to access metadata and navigate the semantic web • Swoogle Search – searches SWDs using constraints on URLs, SWTs being used or defined, etc. • Ontology Dictionary – searches ontologies at the term level and offers more navigational paths.

  18. SWD Metadata • SWD metadata is collected to make SWD search more efficient and effective. • Derived from the content of SWD as well as the relations among SWDs • 3 categories of metadata: • Basic metadata • Relations among SWDs • Analytical results

  19. Basic Metadata • Language Features – properties describing the syntactic or semantic features of an SWD. • Encoding – syntactic encoding of an SWD. • “RDF/XML”, “N-TRIPLE” and “N3”. • Language – the language used by an SWD. • “OWL”, “DAML+OIL”, “RDFS” and “RDF”. • OWL Species – the language species of an SWD written in OWL. • “OWL-LITE”, “OWL-DL” and “OWL-FULL”

  20. Basic Metadata – cont. <rdf:RDF> <rdfs:Classrdf:ID=”Department” /> <rdfs:Classrdf:ID=”Course” /> <rdf:Propertyrdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOfrdf:parseType="Collection"> <rdfs:Classrdf:about=# Department /> <rdfs:Classrdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“number” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“department” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource=“#Department”> </rdf:Property> <rdf:Propertyrdf:ID=“creditPts” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course> </rdf:RDF> • RDF Statistics – properties summarizing node distribution of the RDF graph of an SWD. • How an SWD defines new classes, properties and individuals. • Let foobe an SWD and let C(foo), P(foo), I(foo) be the set of classes, properties and individuals defined in the SWD foo respectively. The onology-ratioR(foo) is calculated by: • R(foo) ranges from 0 to 1, where 0 implies that foo is a pure SWDB and 1 implies that foo is a pure SWO.

  21. Basic Metadata – cont. • Ontology Annotations– properties that describe an SWD as an ontology. • The SWD has an instance of OWL:Ontology • Swoogle records the following properties: • label (rdfs:label) • comment (rdfs:comment) • versionInfo (owl:versionInfo/daml:versionInfo)

  22. Relations Among SWDs • Capturing and analyzing relations at the RDF node level is hard. • Swoogle generalizes RDF node level relations and Focuses on SWD level relations. • Swoogle captures the following SWD level relations: • TM/IN – SWD is using terms defined by some other SWDs. • IM – an ontology imports another ontology. • EX – an ontology extends another ontology • PV – an ontology is a prior version of another. • CPV – an ontology is a prior version of another and is compatible with it. • IPV - an ontology is a prior version of another and is incompatible with it.

  23. Inter-Ontology relations Indicators of inter-ontology relation

  24. Ranking SWDs • OntologyRank inspired by Google’s PageRank algorithm. • Underlying Random Surfing Model: • Surfer jumps to a random URL • With probability d randomly chooses a link to follow. • With probability 1-d jumps to another random URL.

  25. Page Rank • Given a document A, A’s Page rank is computed by: where are web documents that link to A; C(T)is the total outlinks of T; and d is a damping factor, typically set to 0.85.

  26. PageRank

  27. The SW Navigation Model • The graph formed by SWDs has a richer set of relations. • The edges have explicit semantics • Users can navigate the Semantic Web whithin or across the web and RDF graph through 7 groups of navigational paths

  28. The SW Navigation Model

  29. OntologyRank • The semantics of links lead to a non-uniform probability of following a particular outgoing link. • Given SWD’s A and B, Swoogle classifies inter-SWD links into four categories: • imports(A,B) – A import all content of B. • uses-term(A,B) – A uses some of the terms defined by B (without importing B). • extends(A,B) – A extends the definitions of terms defined by B. • asserts(A,B) – A makes assertions about the individuals defined by B. • Each category is assigned a different weight, which represents the probability of following that kind of link.

  30. OntologyRank – cont. • Given an SWD a, Swoogle computes its raw rank by: where L(a) is the set of SWDs that link to a, T(x) is the set of SWDs that x links to.

  31. OntologyRank – cont. • Then, Swoogle computes the rank for SWDB and SWO by: where T(c) is the transitive closure of SWOs imported by a.

  32. Indexing and Retrieval of SWDs • The problem of Indexing and Searching SWDs • Significant semantic information encoded in marked documents. • Reasoning over large collection of documents can be expensive. • Traditional information retrieval techniques • Faster (coarse view of the text). • Can quickly retrieve a set of SWD’s based on similarities of the source text alone.

  33. Applying IR Techniques • SWDs are not entirely markup. • Search should be applied to both structured and unstructured components of the document. • We may want SWDs to be available to commonly used search engins • Documents must be transformed to a form that a standard IR engine can understand and manipulate. • Well researched methods for ranking matches, computing similarities between documents and employing relevance feedback.

  34. Applying IR Techniques • Look at a document as a collection of either tokens or N-Grams. • URIrefs of classes, properties and individuals corresponds to words in natural languages. • Apply the following process to an SWD • Reduce it to triples. • Extract URIrefs (with duplicates). • Discard URIrefs of blank nodes. • Hash each URI to a token. • Index the document. Matching “time” to: http://foo.com/timeont.owl#timeInterval http://foo.com/timeont.owl#calendarClockInterval http://purl.org/upper/temporal/t13.owl#timeThing indexes by either N-Gram or URIrefs

  35. Swoogle Demo…

More Related