900 likes | 1.13k Views
Dublin Core Metadata Tutorial July 9, 2007 Stuart Weibel Senior Research Scientist OCLC Programs and Research. Tutorial Roadmap. Principles of Metadata Dublin Core Metadata Basics The Dublin Core Abstract Model Syntax Alternatives for DC Metadata Mixing and Matching Metadata
E N D
Dublin Core Metadata TutorialJuly 9, 2007Stuart WeibelSenior Research ScientistOCLC Programs and Research
Tutorial Roadmap • Principles of Metadata • Dublin Core Metadata Basics • The Dublin Core Abstract Model • Syntax Alternatives for DC Metadata • Mixing and Matching Metadata • History and workings of the Dublin Core Metadata Initiative • Acknowledgements: I have borrowed liberally from tutorial slides sets from Tom Baker, Diane Hillman, Andy Powell, and Marty Kurth, available at Dublincore.org
Basic Principles of Metadata The Web as an information system The Internet Commons Interoperability is key MARC lives The varieties of metadata Modularity Some Challenges
State of the Web as an Information System • Search systems are motivated by business models, not functionality • Index coverage is broad, but unpredictable • Too much recall, too little precision • Index spam abounds • Resources (and their names) are volatile • What about versions, editions, back issues? • Archiving is presently unsolved • Authority and quality of service are spotty • Managing Intellectual Property Rights is difficult
Metadata: Part of a Solution • Structured data about other data • helps to impose order on chaos • enables automated discovery/manipulation • Full Text Web indexing is the dominant idiom for search • Metadata is more useful in structured collections, used in combination with applications designed to take advantage of structured descriptions
Commerce Home Pages Geo Library Internet Commons Scientific Data Whatever... Museums Internet Commons includes Multiple Communities
Interoperabilityrequires conventions about: • Semantics • The meaning of the elements • Structure • human-readable • machine-parseable • Syntax • grammars to convey semantics and structure
Haven’t we done metadata already? The MARC family of standards is the single most successful resource description standard in the world
MARC Cataloging… • Is really MARC-AACR2 cataloging • MARC is the communications format • AACR2 (Anglo-American Cataloging Rules) defines the cataloging rules (semantics • MARC and AACR2 are evolving • Closer alignment with XML as a syntax option • RDA is an effort to modernize AACR2, and alignment it with networked environments • RDA and Dublin Core are cooperating on alignment of a common underlying data model.
What’s wrong with this model on the Web? • Expensive • Complex • Professional Catalogers required • Bias towards bibliographic artifacts • Fixed resources • Incomplete handling of resource evolution and other resource relationships • Anglo-centric • MARC 21 accounts for ¾ of MARC records, but there are many other varieties
Warwick Framework: Modular Metadata • Conceptual Architecture for metadata from the Warwick Metadata Workshop (DC-2) • Conceptual architecture to support the specification, collection, encoding, and exchange of modular metadata • Provide context for metadata efforts (including Dublin Core) • avoids the “black-hole” of comprehensive element sets • focuses interoperability issues at package level • A conceptual framework, NOT an application
Modularity and Extensibility: the Lego metaphor • DC is a beginning, not an end • An architecture for modular, extensible metadata • The simplest common denominator • Add stuff you need for • Local requirements • Domain specific functionality • Other dimensions of description • Eg cloud cover… management… structural metadata….
Descriptive Metadata Standards • IEEE LOM (Learning Object Metadata) • Descriptive and structural metadata to support instructional systems • ONIX (Online Information Exchange) – bookseller metadata • FGDC – Federal Geographic Data Committee: rich descriptive and structural metadata for GIS applications • Encoded Archival Description – description of archival collections • MPEG Multimedia Metadata – large, complicated, still in progress – descriptive, structural, rights management • Dublin Core – core descriptive metadata
Metadata Creation • Metadata is expensive and error prone • A MARC Record costs about $100 USD to create one record at the Library of Congress • Competes with indexing at… $ 00.001 ??? • Capture it as close to point of creation as possible • Capture as much automatically as possible • Should be designed with close attention to the functional requirements it serves • Re-use existing standards whenever possible • Always tension between completeness of description, intended purpose, and cost
Metadata Challenges • Accommodate multiple varieties of metadata • Tension: functionality and simplicity • Tension: extensibility and interoperability • Human and machine creation and use • Community-specific functionality, creation, administration, access work at cross purposes to global interoperability
Interoperability barriers cost time and moneyA Common data model helps avoid this
Dublin Core Basics Design Philosophy – useful metaphors Language and pidgins Characteristics of DC metadata The simple bucket (properties) Resource Types Metadata grammar Dublin Core Principles One-to-one Dumb-down rule Context appropriate values Translations
Dublin Core: Starting Assumptions and Essential Features • Simple • true to a point: the elements are simple, the underlying model is not • Consensus-based • Crucial to early success, both in attracting expertise and deployment. Bottom up • Based on the experience of practitioners, but hard to capture and capitalize on lessons learned • Cross-disciplinary and International • Central success factor
Essential Features (continued) • The Web is the strategic application • On the mark • International • Also central success factor, but hard (20 languages in the Registry) • Lego-like modularity & extensibility • Partially realized promise • Application Profiles are the means • Syntax independence • An ongoing nightmare (HTML…XML…RDF/XML) • Authors will describe their own works • Laughably naïve
A Pidgin for Digital Tourists • Metadata is language • Dublin Core is a small and simple language -- a pidgin -- for finding resources across domains • Speakers of different languages naturally "pidginize" to communicate • E.g., tourists using simple phrases to order beer ("zwei Bier bitte" "dva pivo" "biru o san bai"...) • We are all "tourists" on the Internet.
A Grammar of Dublin Core • By design not as rich as mother tongues, but easy to learn and useful in practice • Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives) • Simple grammars: sentences (statements) follow a simple fixed pattern... • http://www.dlib.org/dlib/october00/baker/10baker.html
property resource statement value Basic Structures in Dublin Core Metadata • The basic unit of metadata is a statement: • Statements consist of a property (a metadata element) and a value • Metadata statements describe resources • More about the Dublin Core Abstract model later
What are the properties and values in the following metadata statements? 245 00 $a Amores perros $h [videorecording] <title> Nueve reinas </title> <type> MovingImage </type> • Different models for conveying related information • Dublin Core syntax fits in more naturally with the structure of the Web
implied verb one of 15 properties property value (an appropriate literal) DC:Creator DC:Title DC:Subject DC:Date... implied subject Resource has property X qualifiers (adjectives) [optional qualifier] [optional qualifier]
Varieties of qualifiers:Element Refinements • Make the meaning of an element narrower or more specific. • a Date Created versus a Date Modified • an IsReplacedBy Relation versus a Replaces Relation • If your software does not understand the qualifier, you can safely ignore it.
Varieties of Qualifiers:Value Encoding Schemes • Says that the value is • a term from a controlled vocabulary (e.g., Library of Congress Subject Headings) • a string formatted in a standard way (e.g., "2001-05-02" means May 3, not February 5) • Even if a scheme is not known by software, the value should be "appropriate" and usable for resource discovery.
Resource has Subject "Languages -- Grammar" LCSH Resource has Date "2000-06-13" ISO8601 Revised
Dumb-Down Principle for Qualifiers • Simple DC does not use element refinements or encoding schemes – statements contain only value strings • Qualified DC uses features of the DCMI Abstract Model, including element refinements and encoding schemes • Dumbing-down is translating Qualified DC to simple DC • Qualifiers refine meaning (but may be harder to understand)
The One to One Principle • Each resource should have one metadata description • For example, do not describe a digital image of the Mona Lisa as if it were the original painting • Group Related descriptions into description sets • Describe an artist and his or her work separately, not in a single description
Appropriate Values • There are generally tradeoffs between local requirements and global requirements • Use elements and qualifiers to meet the needs of your local context, but… • Keep in mind that machines and people use and interpret metadata, so… • Consider whether the values used will help discovery outside your local context
Dublin Core as a multilingual metadata language • Dublin Core has been translated into 20 + languages • machine-readable tokens are shared by all • human-readable labels are defined in different languages • translations are distributed, maintained in many countries • eventually linked in DCMI registry
label label “Verfasser” “Creator” label “Pencipta” One token – labels in many languages dc:creator [Server in Germany] [DCMI Server] [Server in Jakarta]
Metadata languages are "multilingual" • Metadata is not a spoken language • The words of metadata -- "elements" -- are symbols that stand for concepts expressible in multiple natural languages • Standards may have dozens of translations • Are concepts like "title", "author", or "subject" used the same way in English, Finnish, and Korean?
DCMI Open Metadata Registry • Managing vocabularies defined by the DCMI • Languages • Versioning • Controlled vocabularies • Foundation for modular, incremental integration and evolution • The Registry working group is a Dublin Core Community with participants around the world
The Dublin Core Abstract Model Terminology Simple versus Qualified DC Resources Descriptions Description sets Value Strings Element refinements Encoding Schemes Graphical representation of the Abstract Model Summary of general ideas
Important DCMI Document concerningthe Abstract Model and Syntax alternatives • DCMI Abstract Model http://dublincore.org/documents/abstract-model/ • Expressing Dublin Core in HTML/XHTML meta and link elements http://dublincore.org/documents/dcq-html/ • Expressing Dublin Core metadata using the Resource Description Framework (RDF) http://dublincore.org/documents/dc-rdf/ • Expressing Dublin Core metadata using XML http://dublincore.org/documents/dc-xml/
Simple versus Qualified DC • Simple DC supports single descriptions using the 15 base elements and value strings • Qualified DC supports the richer features of the Abstract Model, and allows the use of all DCMI terms as well as other, non-DCMI terms. • An application profile is used to specify a metadata application that includes DCMI terms in combination with non-DCMI terms (mix & match metadata).
property resource statement value The DCMI Abstract Model • A data model for Dublin Core • Agreed upon underlying structure for metadata statements • Many years in the making -- long term contention • Describes the structure of statements about resources thatwe make in our metadata language:
What is a resource? • W3C definition: • “anything that has identity… electronic document, an image, a service” • “not all resources are network retrievable; e.g. human beings, corporations, and bound books can also be considered resources” • In other words, a resource is anything we can identify: • Physical things (books, people, airplanes….) • Digital things (Images, web pages, services….) • Concepts (colors, subjects, eras, places) • In the DC context, the DCMI Type list describes the stuff we describe with DC metadata
Resource types for which DC is often used DCMI TYPE Vocabulary
Abstract Model: Descriptions • A description is composed of: • One or more statements about a single resource • Optionally, the URI of the resource being described • Each statement is made up of • A property URI (that identifies a property) • A value URI (that identifies a value) and/or one or more representations of the value (a value string)
Terminology: Value Strings • A value string is a human-readable string that represents the value of the property • Each value string may have an associated value string language that is an ISO language tag (e.g., pt-BR)
Terminology: Element Refinements • Elements are the same as properties • Element refinements are the same as sub-properties • An element refinement is a special case of an element that shares the meaning of its ‘parent’, but has narrower semantics • Paulo is illustrator of a book, therefore he is also a contributor to the book Illustrator is an element refinement of contributor
Terminology: Encoding Schemes • Values and valuestrings can be ‘qualified’ by encodingschemes in order to clarify their meaning • A Vocabulary Encoding Scheme is used to indicate a terminology set from which a value is taken: Stem cells—Research is a value from LCSH 616.02774 is a value from DDC-22 • A syntax encoding scheme is used to indicate the structure of a value string 2004-10-12 is structured according to the W3CDTF rules for date encoding
Terminology: Description Sets • The 1:1 principle dictates that each description describes one, and only one, resource • Weoften need to describe grouped sets of descriptions, which are known in the abstract model as description sets • An article and its authors • A painting and its artist • When description sets are exchanged between software applications, they are generally encoded according to a particular syntax in a metadata record
value string Abstract Model summary (after Andy Powell) Record (encoded as html, XML, or RDF/XML Description set Resource Description (URI) Resource Description (URI) Resource Description (URI) Statement Statement Vocabulary encoding scheme Statement property (URI) value URI syntax encoding scheme language (pt-BR)
General Ideas • DC is not just the 15 elements, though they comprise the foundation for simple DC • 50+ properties (elements) have been approved by DCMI • The model supports local declarations of additional properties • The model supports application profiles (mixing DC elements with those of other sets) • The model allows the grouping of descriptions to create more complex description entities