1 / 30

A metadata infrastructure using ISO standards

A metadata infrastructure using ISO standards. Introduction. ISO becoming more open? ISO 1.0: Top-down, expensive, cobwebs? ISO 2.0: ISO 1.0 plus Bottom-up, free, webs? Standards on Wikis (wikification) Open systems of metadata Outline: Use of extant standards (11179) for new (12620, 639)

lee-glover
Download Presentation

A metadata infrastructure using ISO standards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A metadata infrastructure using ISO standards

  2. Introduction • ISO becoming more open? • ISO 1.0: Top-down, expensive, cobwebs? • ISO 2.0: ISO 1.0 plus Bottom-up, free, webs? • Standards on Wikis (wikification) • Open systems of metadata • Outline: • Use of extant standards (11179) for new (12620, 639) • OmegaWiki as exemplar O/S project • Peacekeeping forces: ISO & WLDC

  3. Introduction • Application area: human languages • 50% of languages are endangered (UNESCO); • large proportion of languages have no “resources” and no web presence; • discontinuity and fragmentation of research; • sustainability and curation issues • And yet….. • Capability for capturing data like never before; • Expansion of capacity of the Internet and growing pressure for an inclusive multilingual internet; • OLPC programme; • Language experts and non-experts are prepared to contribute time and resources • So, how to create an infrastructure in which to form communities around languages and harmonize results?

  4. Introduction • Language experts may identify linguistic content in a highly precise manner • What are non-experts (user community) capable of? • Providing more specific sets of labels may help in discovery of written or spoken languages in all kinds of media – and help to harmonize research activities - so long as people know what they are looking at. • Inaccuracies of currently tagged content; need to take the problem away from end users • More precise identification improves the chances of getting what you wanted– consider “coffee” vs. “coffee + TYPE + COLOUR …” vs. “strong black coffee, in a mug, with 2 sugars”. • Beyond documentation of names and representations, documentary information for each language might be helpful. • Working towards a machine-readable representation for all such information is a longer-term goal.

  5. ISO standards

  6. ISO standards • Metadata registry according to ISO 11179 series of standards (see, also, ISO 19763). • According to ISO 11179: • A Value Domain is associated with a Conceptual Domain: A Value Domain provides a representation for the Conceptual Domain. • Example Conceptual Domain and set of Value Domains is ISO 3166, Codes for the representation of names of countries. • ISO 3166 describes the set of seven Value Domains: short name in English, official name in English, short name in French, official name in French, alpha-2 code, alpha-3 code, and numeric code. • Each representation contains a set of values that may be used in the value domainassociated with the DEC; each one of the seven associations is a data element. • For each representation of the data, the permissible values, the datatype, the representation class, and possibly the units of measure, are altered. Conceptual domain name: Countries of the world Conceptual domain definition: Lists of current countries of the world represented as names or codes. Value domain name (1): Country codes – 2 character alpha Permissible values: <AF, The primary geopolitical entity known as "Democratic Republic of Afghanistan"> <AL, The primary geopolitical entity known as "People's Socialist Republic of Albania"> . . . <ZW, The primary geopolitical entity known as "Republic of Zimbabwe"> Value domain name (2): Country codes – 3 character alpha Permissible values: <AFG, The primary geopolitical entity known as "Democratic Republic of Afghanistan"> <ALB, The primary geopolitical entity known as "People's Socialist Republic of Albania"> . . . <ZWE, The primary geopolitical entity known as "Republic of Zimbabwe">

  7. Conceptual domain Data element concept Value domain Data element ISO standards /masculine/ /feminine/ /neuter/ … /Afghanistan/ … /Gender/ /Language identifier/ /Country/ /English/ /French/ … lang country en, fr.. GB, FR, CN, gen m, f, n… [Implemented as an XML attribute named ‘…’] ++ Anchors ++ <xml country= FR > <w lemme=vert lang=fr gen=…>verte</w>

  8. ISO standards • 12620 metamodel - ISO standard in preparation

  9. OmegaWiki, a collaborative project to produce a free, multilingual resource in every language, with lexicological, terminological and thesaurus information. Relational databased World Language Documentation Centre (WLDC), currently comprising 22 experts in language technologies, linguistics, terminology standardisation, and localisation ISO, provision of the ISO 639 series of standards; focus here on 639-4 and 639-6 – standards provide the structure Languages Infrastructure

  10. Languages Infrastructure • Model for ISO 639 proposed and developed by LIRICS project participants (Gillam, Romary); recently accepted for inclusion and review in the current iteration of the developing ISO 639 part 4. • intended to be fully compatible with models being developed in ISO TC 37 in general, compatible with the Data Category Interchange Format defined in ISO 12620, and to provide a means for interlinking the collection of identifiers provided across the 639 series. • ISO TC 37 standards for computational use of terminology collections, specifically ISO 16642 and its combination with ISO 12620, emphasize a metamodel in combination with metadata identifiers, referred to as data categories. • Language identifiers of ISO 639 shall be compatible, interoperable, mutually understandable, and usable to the degree of precision needed by the user up to the limitations of these identifiers. • Language identifiers themselves need to be described by metadata. • All of these metadata items can be submitted to the metadata registry specified according to ISO 12620

  11. Languages Infrastructure • ISO 639 model based on: • need to replicate simplistic structure of ISO 639-1 and 639-2 • inferred model of the Ethnologue as published • ISO 12620 / ISO 11179 • emergent model through BSI for ISO 639-6 adapted, generalized and cross-validated from encyclopædic and other sources including: • Gordon Jr, R. G (Ed.) (2005). Ethnologue: Languages of the World, 15th Edn. SIL International. • Voegelin, C.F. and F.M. (1977) Classification and index of the world's languages. New York, NY: Elsevier North Holland, Inc. • Ruhlen, M. (1987) A guide to the world's languages. Vol.1: Classification. London: Edward Arnold. • Bernard Comrie (ed.) (1987) The World's major languages. Oxford University Press, New York, • Chambers, J.K. and Trudgill, P. (1998) Dialectology. Cambridge: Cambridge University Press • Dalby, D (1999). Linguasphere Register of the world’s languages and speech communities. Linguasphere Press. • development of ISO 639-6 initially assisted by a fund made available by the Department of Trade and Industry of the UK and administered by BSI; subsequent efforts in standardization and validation have been funded, and supported, by BSI and ICT Marketing Ltd.

  12. ISO 12620 ISO 11179 “standards as databases” ISO 639-4 ISO 639-X standard ISO 639-6 standard Expert review Community review & infrastructure Languages Infrastructure Data categories Metadata registries Co-ordination SIL, LoC, Infoterm “UN” ISO 639-X data ISO 639-6 data

  13. Languages Infrastructure • The right organizational model? c/w Citizendium • Larry Sanger, a co-founder of Wikipedia who left to become one of its most vocal critics. • "Wikipedia has accomplished great things, but the world can do even better," Dr Sanger said. "By engaging expert editors, eliminating anonymous contribution, and launching a more mature community under a new charter, a much broader and more influential group of people and institutions will be able to improve upon Wikipedia’s extremely useful, but often uneven work. The result will be not only enormous and free, but reliable.“ • A vetted set of editors, dubbed "constables", developing a set of rules for contributors to abide by. • Times Online, 7 September 2007

  14. Languages Infrastructure

  15. ISO 3166-1

  16. ISO 639-1

  17. ISO 639-6

  18. Wikis for Languages

  19. http://lux12.mpi.nl/isocat/

  20. ISOcat architecture client tool web interface REST API WS API core DCR services control access manage session manage user profile manage comments manage balloting manage access access data manage system administrator mirror DBMS Kemps-Snijders, Windhouwer, Wittenburg and Wright

  21. Language Documentation via ISO 639-4: association of metadata descriptors to model interoperable with DCIF (12620) (639-4 section 9) Languages Infrastructure

  22. Languages Infrastructure • Eventual inclusion of all “available” metadata

  23. Languages Infrastructure • Language Codes Standards are growing in number and complexity • From 2 to 6 • From 400 identifiers to upwards of 30000 • From lists to databases • From tables to metadata registries • From published text documents to “published” databases • From IETF RFC to RFCs to RFCs • From a closed membership committee to an open Community initiative (OmegaWiki) • …. with accompanying (web) services and products

  24. Languages Infrastructure • Language Codes Standards are growing in number and complexity • From 2 to 6 – eventually back to 1? • From 400 identifiers to upwards of 30000 – plus supporting metadata • From lists to databases – multiple metadata registers • From tables to metadata registries – registers + policies + “auditors” • From published text documents to “published” databases – “SAD” • From IETF RFC to RFCs to RFCs – consume, consume, consume • From a closed membership committee to an open Community initiative (OmegaWiki) – supporting infrastructure, expert review of community contributions (e-Voting?) • …. with accompanying (web) services and products – Open Source and bespoke, and secured funding as necessary

  25. ISO standards

  26. Next steps • ISO: efforts with ISOcat (TC 37) • OmegaWiki: support for community building • WLDC: verification and validation in an on-going fashion • Connecting the whole thing…and evaluating at scale • a simple catalogue of names of all languages in ISO 639 parts 1-3 has potential for, at least, 7500x 7500 entries (> 56 million) plus associated status information …… • Further connectivity: SRB (MCAT)? OMII (Data 2.0)?

  27. Acknowledgements • EU eContent project LIRICS (22236) • British Standards Institution • OmegaWiki • WLDC • Department for Trade and Industry’s Knowledge Transfer Partnerships scheme (KTP 1739). • Contributions and efforts of colleagues and peers in ISO, BSI, IETF, in the projects identified, and in the wider community also. And thank you for listening….

More Related