410 likes | 588 Views
Competing demands of size, speed, and annotation with historical corpora Mark Davies Brigham Young University http://davies-linguistics.byu.edu Digital Historical Corpora Dagstuhl, Dec 2006. Outline. Relational database architecture Examples of annotation via relational databases
E N D
Competing demands of size, speed, and annotation with historical corpora Mark DaviesBrigham Young Universityhttp://davies-linguistics.byu.edu Digital Historical Corpora Dagstuhl, Dec 2006
Outline • Relational database architecture • Examples of annotation via relational databases • VIEW/BNC (100 million words) • Corpus del Español (2001-02, 100 million words) • Corpus of Historical English (OED, 37 million) • Corpus do Português (2004-06, 45 million) • Proposed 200 million word corpus of historical English
Relational databases SEQUENTIAL WORDS (seqWords) DOCUMENT-LEVEL METADATA Target “weakest” slot first, e.g. [AJ*] 2 [ship] 1 SQL 1: insert into temptable (ID)select a. ID from seqwords as a, metadata as b where a.lemma = ‘ship’ and b.date between 1620 and 1640 and b.genre = ‘fict_adv’ and a.textID = b.textID SQL 2: select count(*), a.lemma from seqwords as a, temptable as b where (a.ID = b.ID – 1 and a.POS like ‘aj%’) ADJ ship (a.ID between b.ID = 5 and b.ID – 5 and a.pos like ‘nn%’) ship near ADJ (a.ID between b.ID = 20 and b.ID – 20) KWIC for ship
Relational databases Disadvantage: • Hierarchical information Advantages: • Speed: (clustered) indexes; < 1-2 sec, 100 mw • Annotation: unlimited columns, links to other databases (e.g. semantic); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings
Lemmatization (historical) • Use frequency to target most frequent forms not in modern lexicon • Using KWIC, enter modern equivalent (3 students, one summer, 100,000 words; 99.8% of all tokens, 1300s-1800s) • Spelling replacements: iustiça / justiça ‘justice’ • Can also use collocational info (a iustiça ‘the justice’) • Use frequency, see which lemma in older periods overly frequent (e.g. 1600s > 2x 1900s, but cavaleiro ‘knight’, Deus ‘God’, etc) • Check forms of top n lemma in each century: non-modern forms highlighted; e.g. Pt avião (haver ‘to have Ved’)as ‘airplane’ (??) in Old Portuguese • Can also check top n words for any POS (vvp_1s, etc) • Also POS, e.g. um/uma __ PREP/que ‘a __ PREP/that’
VIEW / BNC: view.byu.edu (British National Corpus; 100m words)
VIEW / BNC featureshttp://view.byu.edu • Search for any substring, word, phrase, POS: *hood, green [nn*], [vb*] [vvn] (=passive) • See chart by genre / sub-genre: whom by (sub-)genre • Compare collocates of words: {utter/sheer/total} [nn*]; utter dejection, sheer luck, total population • Limit by genre: [AJ*] in tabloid vs broadsheet, [vvi] in FICT vs ACAD; white [NN*] in NEWS vs ACAD • Collocates: 10 words L / R, compare collocates by register (chair in FICT/ACAD), compare collocates of two words ([AJ*] near man/woman) • Integration with WordNet: [<walk].[v*] frequency of more specific verbs related to walk; [=eat] the [<food] • Fast (less than 1-2 seconds for nearly all queries) • Free
Corpus del Español (www.corpusdelespanol.org) ; 100m words 1200s-1900s
Corpus of Historical English: (view.byu.edu/che); 37m words, OE-PDE
Corpus of Historical English: Size and distribution (cf. to Helsinki Corpus, 1.6 million words)
Spelling changes (table): rank-frequency listing by 1500s (display = “per million)”
Chart display (by century): to * [davies:up] [CUSTOMIZED LISTS]
Semantics (wide-range collocates): market [5L/5R] (+1900s -1600s)
Corpus do Português: 45m words, 1200s-1900s: www.corpusdoportugues.org
Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’)
Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’) [1700s]
Lemmatized: fazer.* (fazer ‘to make, do’): +1300s-1600s, -1900s
Tagged for part of speech: mulheres [aj*] (ADJ women) (sorted by 1800s)
Collocates: mulheres // [aj:fs] (ADJ women) (sorted by 1900s)
Collocates: mulheres // [aj:fs] (ADJ women) (1900s vs. 1800s)
Synonyms: [=falar]: synonyms of ‘to speak’; comparison 1800s vs 1900s
Customized lists: User-created lists (can correct our “errors”): sair ‘to leave’
Proposed 200 million word historical corpus / NEH, March 2007
Relational databases Advantages: • Greatly faciliates the annotating of earlier texts • Speed: (clustered) indexes • Annotation: unlimited columns (e.g. original, modern, lemma, POS), other databases (e.g. thesauruses, personal lists); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings (any old program can do that); include frequency in query; useful for genre variation and historical change