Data on the (Semantic) Web

Data on the (Semantic) Web

Agenda (75 min) • Data on the Web • Extracting data • Publishing data • Linked Data • Metadata in HTML • SPARQL endpoints • Crawling and extraction • Indexing RDF data • Database-style indexing • IR-style indexing

IR view of the Web • Web accessible resources • Documents (typically HTML) • Multimedia • Search engines index NL text • Most of the structure in HTML is discarded • Multimedia is indexed by surrounding text • Additional information on web graph, usage • See Manning, Raghavan, Müntze. Introduction to Information Retrieval. Cambridge Press, 2008.

Data on the Web • Most web pages on the Web are generated from structured data • Data is stored in relational databases (typically) • Queried through web forms • Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions • Extraction using Information Extraction (IE) techniques (implicit metadata) • Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)

Information Extraction methods • Named Entity Recognition (NER) and disambiguation • OpenCalais, Zemanta • Extraction of triples • TextRunner, NELL • Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007. • Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007. • Filling web forms automatically (form-filling) • Madhavan et al. Google's Deep-Web Crawl. VLDB 2008 • Extraction from HTML tables • Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008 • Wrapper induction • Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007

Information Extraction • A tale of many trade-offs • Less or no training data, lower quality • More complex the model to learn, more training data needed • Deeper the analysis, slower the processing • The more narrowly trained, the more likely to break • Populating a Knowledge Base is easier than ad-hoc extraction • However, a complete and correct semantic representation of the content may not be need for all tasks

Publishing data on the Web • Pre-Semantic Web technologies have been inadequate • Existing formats are not appropriate for serendipitous reuse • HTML: structure is lost due to a mix of presentation and content • XML: captures structure, but not semantics • Lack of protocols to talk to databases over the Web • Motivation has been lacking • Publishers are interested to the extent that they benefit from sharing data, e.g. because it drives traffic back to their site

What the Semantic Web provides • Data format: RDF • Designed for object-relationship data • Identification of objects by URIs • Multiple serializations: RDF/XML, Turtle, N3, N-Triples, Trix etc. • Schema language: OWL • Description Logic based • Extensible using rule languages such as RIF • Query language and protocol: SPARQL • The principles of Linked Data

Methods for publishing RDF data • Multiple ways of publishing RDF data • SPARQL endpoints • Linked Data • Metadata in HTML documents • Data feeds • GRDDL • Automated tools • Each require different treatment in crawling and extraction

SPARQL endpoints • SPARQL is a standard query language and protocol for accessing RDF stores via HTTP • Also possible to expose a traditional RDBMs via a wrapper • Advantages: • Most flexible and best performing access from a consumer perspective • Disadvantages: • Higher maintenance • Discovery is problematic • Tools: • Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.) • RDB-to-RDF mappers such as D2RQ and Triplify • SPARQL query builders

Linked Data • A web of interlinked RDF documents • Each document describes the characteristics of a single object, and links to related objects • Most important: links to the same object in different data sets (sameAs) • Guidelines for proper configuration of web servers to serve such documents • Rapidly growing community • Focus on public datasets (government, scientific) • see linkeddata.org

The even larger picture: entire datasets connected

Linked Data • Advantages: • No change to the publishing of the HTML documents • Data can be published by third party (e.g. Dbpedia) • Disadvantages: • Web servers need to be configured to properly handle URIs that identify concepts instead of documents • Search engines need to be extended to crawl linked data • Data is not always linked to documents • Tools • Linked Data browsers (Tabulator, Marbles etc.) • RDB-to-RDF mappers (D2RQ, Triplify)

Metadata in HTML • Microformats, RDFa, Microdata • Advantages: • Data and document are always in sync • Browser plug-in friendly • Search engine friendly • Copy-paste friendly • Tools: • XML editors (e.g. Oxygen) • RDFa Distiller • RDFa bookmarklet, Ubiquity RDFa plugin • Optimus microformat parser • Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

Microformats (μf) • Agreements on the way to encode certain kinds of data in HTML • Reuse of semantic-bearing HTML elements • Based on existing standards • Minimality: designed to solve particular problems • Microformats exist for a limited set of objects • hCard, hResume, hProduct, hRecipe • Varying degrees of support and stability • hCard and rel-tag are widely supported • Community centered around microformats.org • Specifications and discussions are hosted there

Example: the hCard microformat <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"> <a class="fn url" rel="friend colleague met" href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post(<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from the<span class="vcard"> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a> </span>.

Microformats: limitations • No shared syntax • Each microformat has a separate syntax tailored to the vocabulary • No formal schemas • Limited reuse, extensibility of schemas • Unclear which combinations are allowed • No datatypes • No namespaces, unique identifiers (URIs) • no interlinking • mapping between instances is required

RDFa • W3C standard for embedding RDF data in HTML documents • A set of new HTML attributes • Despite the extension of HTML, RDFa does not require XHTML • A specification of how to extract the data from these attributes • RDFa can be used to embed data in HTML headers or to annotate parts of the body of HTML documents • RDFa is just a syntax, you have to choose a vocabulary separately

Differences in usage • Microformats are the first choice for most publishers because they are simple • If you find none that perfectly fits your needs then you need RDFa • Microformats have a fixed schema: you can not add your own attributes • Example: a social networking site with user profiles • VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections • You either live without this, or go with RDFa

Example: Facebook’s Open Graph Protocol • Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa • Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ • Facebook as consumer • Facebook indexes OGP data whenever someone ‘likes’ a page with OGP data • Social recommendation (‘like’ button) provides publishers with a way to promote their content on Facebook • Shows up in profiles and news feed, the user is subscribing to a channel of future feeds from the web page they liked • Facebook Graph API allows 3rd party developers to access the data • http://developers.facebook.com/docs/api

Example: Facebook’s Open Graph Protocol <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> … </head> ... </html>

Microdata • HTML5 is currently under standardization at the W3C • Introduces Microdata • Similar to microformats • Some predefined vocabularies with central registration • Some of the flexibility of RDFa • Introduce new terms using reverse domain names or full URIs • Semantic HTML elements such as <time>, <video>, <article>…

Microdata example <div itemscopeitemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009 </time>. <imgitemprop="image" src=”me.png" alt=”me”> </p> </div

The state of metadata in HTML • 5-10% of webpages contain some explicit metadata • Depending on how you count… • Too many competing approaches • Too many formats: microformatsvsRDFavsMicrodata • Too many schemas: publishers may need to use multiple different vocabularies or microformats to satisfy everyone

Data on the (Semantic) Web