Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS

Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de . Outline. TITUS Resource Data Peculiarities of TITUS texts



  

  2. Outline • TITUS Resource Data • Peculiarities of TITUS texts • Tokens and Types calculation in TITUS Resources • Metadata for Tokens and Types distribution 26.06.2013

  3. TITUS Resource Data • TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http://titus.uni-frankfurt.de • TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens A tokenrepresents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 26.06.2013

  4. TITUS Data http://www.clarin.eu/node/1512 Addedby J. Gippert, R. Mittmann 26.06.2013

  5. TITUS Search Engine • TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. 26.06.2013

  6. Peculiarities of TITUS texts: Gothic • BibliaGothicacontainsadditional parallel passages in LatinandGreek. BibliaGothica(http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm). 26.06.2013

  7. Peculiarities of TITUS texts: Old Church Slavonic • Old Church Slavonictextsarerepresentedin twoways: in theGlagoliticalphabet– original form ofthetext– andin Cyrillicone. Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm). 26.06.2013

  8. Peculiarities of TITUS texts: Old Polish • Old Polish texts contain a simultaneous display of editions that have arisen at different times. KazaniaŚwiętokrzyskie(http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm). 26.06.2013

  9. Peculiarities of TITUS texts: Ossetian • The OssetianNartepic is represented in Latinica und in the advanced Cyrillic. Ossetian: Nartepic(http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/ nart/nart.htm). 26.06.2013

  10. Peculiarities of TITUS texts: Russian-Low German • TönniesFenne's Manual (17th century) containsat least 9 different languages ​​orlanguagevariations. 26.06.2013

  11. Peculiarities of TITUS texts: Old Prussian Old Prussiancorpusconsistsofat least 21 different languages ​​orlanguagevariants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 26.06.2013

  12. Creation • A digitizedsourceconsists not onlyof a sourcelanguagewords, but containsvariousinformationwhichdoes not belongoriginallytothedocument: numbers, tags, punctuation marks, edition information etc. $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #<†‡„> $zeile =~ s/\d*\s+<\W<?ConvertCheck:\s+LevelNameTooLong>//g; #<?ConvertCheck: LevelNameTooLong> 26.06.2013

  13. Examples: Gothic 26.06.2013

  14. Examples: Gothic 26.06.2013

  15. Examples: TönniesFenne'sManual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 26.06.2013

  16. Examples: furtherapplication 26.06.2013

  17. Metadata • DC – Dublin Core • TEI – Text Encoding Initiative • CEI – Corpus Encoding Initiative • IMDI – ISLE Meta Data Initiative • OLAC – Open Language Archives Community • CMDI – ComponentMetaData Infrastructure 26.06.2013

  18. CMDI- ComponentMetaData Infrastructure http://www.clarin.eu/cmdi 26.06.2013

  19. TITUSMetadata: HTML Format 26.06.2013

  20. New Metadata Set forTITUS 26.06.2013

  21. MetadataExampleforTITUS – XML CMDI <ResourcePublicationTimeElectronic>16.6.2002</ResourcePublicationTimeElectronic> <ResourceWordcountGeneral> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </ResourceWordcountGeneral><ResourceWordcountTT> <Language></Language> <LanguageTokensTypes> Tokens | Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 1_General</Language> <LanguageTokensTypes>10 Tokens | 9 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 2_Gothic</Language> <LanguageTokensTypes>420 Tokens | 240 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 4_Latin</Language> <LanguageTokensTypes>572 Tokens | 325 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 5_Greek</Language> <LanguageTokensTypes>627 Tokens | 319 Types</LanguageTokensTypes> </ResourceWordcountTT> 26.06.2013

  22. MetadataforTITUS – Browser 26.06.2013

  23. MetadataforTITUS – Browser 26.06.2013

  24. MetadataforTITUS – Browser 26.06.2013

  25. Thankyouforyourattention! Links • ARBIL (Metadaten-Editor) http://tla.mpi.nl/tools/tla-tools/arbil/ • CLARIN http://www.clarin.eu • CMDI http://www.clarin.eu/cmdi • Dublin Core http://dublincore.org/documents/dcmi-terms/ • IMDI http://www.mpi.nl/IMDI/ • OLAT http://www.language-archives.org/ • TEI http://www.tei-c.org/index.xml • TITUS http://titus.uni-frankfurt.de 26.06.2013

  26. 26.06.2013

