260 likes | 493 Views
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS. Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de . Outline. TITUS Resource Data Peculiarities of TITUS texts
E N D
Types und Tokens Distribution in TITUS Распределениесловоформ в корпусеTITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de
Outline • TITUS Resource Data • Peculiarities of TITUS texts • Tokens and Types calculation in TITUS Resources • Metadata for Tokens and Types distribution 26.06.2013
TITUS Resource Data • TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http://titus.uni-frankfurt.de • TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens A tokenrepresents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 26.06.2013
TITUS Data http://www.clarin.eu/node/1512 Addedby J. Gippert, R. Mittmann 26.06.2013
TITUS Search Engine • TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. 26.06.2013
Peculiarities of TITUS texts: Gothic • BibliaGothicacontainsadditional parallel passages in LatinandGreek. BibliaGothica(http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm). 26.06.2013
Peculiarities of TITUS texts: Old Church Slavonic • Old Church Slavonictextsarerepresentedin twoways: in theGlagoliticalphabet– original form ofthetext– andin Cyrillicone. Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm). 26.06.2013
Peculiarities of TITUS texts: Old Polish • Old Polish texts contain a simultaneous display of editions that have arisen at different times. KazaniaŚwiętokrzyskie(http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm). 26.06.2013
Peculiarities of TITUS texts: Ossetian • The OssetianNartepic is represented in Latinica und in the advanced Cyrillic. Ossetian: Nartepic(http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/ nart/nart.htm). 26.06.2013
Peculiarities of TITUS texts: Russian-Low German • TönniesFenne's Manual (17th century) containsat least 9 different languages orlanguagevariations. 26.06.2013
Peculiarities of TITUS texts: Old Prussian Old Prussiancorpusconsistsofat least 21 different languages orlanguagevariants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 26.06.2013
Creation • A digitizedsourceconsists not onlyof a sourcelanguagewords, but containsvariousinformationwhichdoes not belongoriginallytothedocument: numbers, tags, punctuation marks, edition information etc. $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #<†‡„> $zeile =~ s/\d*\s+<\W<?ConvertCheck:\s+LevelNameTooLong>//g; #<?ConvertCheck: LevelNameTooLong> 26.06.2013
Examples: Gothic 26.06.2013
Examples: Gothic 26.06.2013
Examples: TönniesFenne'sManual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 26.06.2013
Examples: furtherapplication 26.06.2013
Metadata • DC – Dublin Core • TEI – Text Encoding Initiative • CEI – Corpus Encoding Initiative • IMDI – ISLE Meta Data Initiative • OLAC – Open Language Archives Community • CMDI – ComponentMetaData Infrastructure 26.06.2013
CMDI- ComponentMetaData Infrastructure http://www.clarin.eu/cmdi 26.06.2013
TITUSMetadata: HTML Format 26.06.2013
New Metadata Set forTITUS 26.06.2013
MetadataExampleforTITUS – XML CMDI <ResourcePublicationTimeElectronic>16.6.2002</ResourcePublicationTimeElectronic> <ResourceWordcountGeneral> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </ResourceWordcountGeneral><ResourceWordcountTT> <Language></Language> <LanguageTokensTypes> Tokens | Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 1_General</Language> <LanguageTokensTypes>10 Tokens | 9 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 2_Gothic</Language> <LanguageTokensTypes>420 Tokens | 240 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 4_Latin</Language> <LanguageTokensTypes>572 Tokens | 325 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 5_Greek</Language> <LanguageTokensTypes>627 Tokens | 319 Types</LanguageTokensTypes> </ResourceWordcountTT> 26.06.2013
MetadataforTITUS – Browser 26.06.2013
MetadataforTITUS – Browser 26.06.2013
MetadataforTITUS – Browser 26.06.2013
Thankyouforyourattention! Links • ARBIL (Metadaten-Editor) http://tla.mpi.nl/tools/tla-tools/arbil/ • CLARIN http://www.clarin.eu • CMDI http://www.clarin.eu/cmdi • Dublin Core http://dublincore.org/documents/dcmi-terms/ • IMDI http://www.mpi.nl/IMDI/ • OLAT http://www.language-archives.org/ • TEI http://www.tei-c.org/index.xml • TITUS http://titus.uni-frankfurt.de 26.06.2013