The TEI Gaiji module: Representing non-standard characters and glyphs

M. J. Driscoll Arnamagnæan Institute The TEI Gaiji module: Representing non-standard characters and glyphs

Using Unicode • In most cases, Unicode already covers most of the characters most scholars need for transcribing texts in most writing systems. • There are, however, many characters and uncommon glyphs which have yet to make it into Unicode. • Moreover, one may wish to record variants of a single character in order to facilitate scribal identification or for statistical purposes, or simply to reproduce the original as closely as possible. • The TEI ‘Gaiji’ module provides a means of doing this.

Phonemes, characters and glyphs • The phoneme /a/ can be represented in different ways, but in most cases by the character <a>(in lower case), Unicode character point U+0061 (Latin alphabet), U+0430 (Cyrillic alphabet). • The following are all glyphs of <a>: • Characters like <a> can also be referred to as graphemes, and glyphs as allographs of those graphemes.

Characters vs. glyphs • Particularly in older documents, phonemes can be represented by several different characters or combinations of characters. • In medieval Icelandic manuscripts, for example, the following characters can be used to represent /á/ (long /a/): • For each of these characters, various glyphs may occur:

Variant letter forms • Variant letter forms (glyphs) are often distinguished in diplomatic transcriptions of manuscripts (and early printed materials). For Icelandic sources, such variant forms include: • high and round s • ordinary and round r (r-rotunda) • ordinary and round d • ordinary and insular forms of f and v • dotted and dotless i • small capitals, used originally to denote geminates (principally N and R, but occasionally also D, G, M, S and T)

Retaining special characters and glyphs • In a strictly diplomatic transcription variant letter forms such as high s, undotted i, small capital r and so on are retained.

Semi- and fully-normalised transcriptions • In a semi-normalised transcription most – and in a fully normalised transcription all – of these variant letter forms are replaced with their standard equivalents.

Defining characters and glyphs • Using the 'Gaiji' module one can encode characters or glyphs by defining them in one or more <charDesc> (‘character description’) elements in the TEI header and thenreferring to them using the <g> element in the body of the text. • Within <charDesc> one then uses either the <char> element to define a new character, or <glyph> to define a glyph of an existing character. • Within these, several sub-elements are available, including: • <charName>/<glyphName> contains the name of the character or glyph, expressed following Unicode conventions. • <charProp> provides a name and value for some property of the character or glyph, in keeping with Unicode conventions and/or according to some locally defined scheme. • <mapping> contains one or more mappings for the character or glyph, in accordance with some typology, specified by the type attribute. • <graphic> can be used to provide a picture, in some suitable format, of the character or glyph.

Defining characters • A new character can be defined and assigned to a position in the Unicode Private Use Area (PUA), and/or described in terms of Unicode combining characters: • The use of entities (e.g. á for <á>) is now deprecated in XML, but a human-readable entity-like name can be used as the value of the @xml:id attribute, rather than, say, the Unicode code point. This is then referred to in the body of the text as the value of the @ref attribute on <g>. • <g ref="#vdot"/>

Defining glyphs • Glyphs are defined in the same way:

Using <g> in the text • The characters and glyphs are then invoked in the text using <g>:

Generating multi-level transcriptions • Using mark-up like this, multi-level transcriptions – from strictly diplomatic to fully normalised – can easily be generated from a single encoded text by choosing the <reg> or <orig> form along with the relevant mapping (‘dipl’ or ‘norm’). • <w><choice><reg>sat</reg><orig><g ref="#slong"/>at</orig></choice></w> • Default values can also be built into the encoding: • <w><g ref="#slong">s</g>at</w>

Including character declarations in the header • Character declarations may be integrated into <encodingDesc> in two separate ways: • Directly as XML elements. • XIncluded from another location. • Which of these methods is used depends on the circumstances of a particular project. The advantage of using XInclude is that character declarations are always drawn from a single, external source which is distinct from an XML document; they need not be added manually to each document to which they apply. In this way, character declarations function essentially as an authority file. For a project with many documents, this can facilitate management, reduce redundency and reduce the incidence of errors.

Processing <g> elements • <xsl:template match="tei:g[@ref]"><xsl:variable name="charDecls" select="doc('encodingDesc_fasnl_mss.xml')/descendant::tei:charDecl"/><xsl:variable name="href" select="substring-after(@ref, '#')"/><xsl:choose><xsl:when test="$charDecls//id($href)"><xsl:variable name="g" select="$charDecls//id($href)"/><xsl:value-of select="$g/tei:mapping[@type = 'dipl']"/><xsl:value-of select="$g/tei:mapping[@type = 'norm']"/></xsl:when><xsl:otherwise><xsl:value-of select="string('?')"/></xsl:otherwise></xsl:choose></xsl:template>

The TEI Gaiji module: Representing non-standard characters and glyphs