Competing demands of size, speed, and annotation with historical corpora

Competing demands of size, speed, and annotation with historical corpora Mark DaviesBrigham Young Universityhttp://davies-linguistics.byu.edu Digital Historical Corpora Dagstuhl, Dec 2006

Outline • Relational database architecture • Examples of annotation via relational databases • VIEW/BNC (100 million words) • Corpus del Español (2001-02, 100 million words) • Corpus of Historical English (OED, 37 million) • Corpus do Português (2004-06, 45 million) • Proposed 200 million word corpus of historical English

Relational databases SEQUENTIAL WORDS (seqWords) DOCUMENT-LEVEL METADATA Target “weakest” slot first, e.g. [AJ*] 2 [ship] 1 SQL 1: insert into temptable (ID)select a. ID from seqwords as a, metadata as b where a.lemma = ‘ship’ and b.date between 1620 and 1640 and b.genre = ‘fict_adv’ and a.textID = b.textID SQL 2: select count(*), a.lemma from seqwords as a, temptable as b where (a.ID = b.ID – 1 and a.POS like ‘aj%’) ADJ ship (a.ID between b.ID = 5 and b.ID – 5 and a.pos like ‘nn%’) ship near ADJ (a.ID between b.ID = 20 and b.ID – 20) KWIC for ship

Relational databases Disadvantage: • Hierarchical information Advantages: • Speed: (clustered) indexes; < 1-2 sec, 100 mw • Annotation: unlimited columns, links to other databases (e.g. semantic); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings

Lemmatization (historical) • Use frequency to target most frequent forms not in modern lexicon • Using KWIC, enter modern equivalent (3 students, one summer, 100,000 words; 99.8% of all tokens, 1300s-1800s) • Spelling replacements: iustiça / justiça ‘justice’ • Can also use collocational info (a iustiça ‘the justice’) • Use frequency, see which lemma in older periods overly frequent (e.g. 1600s > 2x 1900s, but cavaleiro ‘knight’, Deus ‘God’, etc) • Check forms of top n lemma in each century: non-modern forms highlighted; e.g. Pt avião (haver ‘to have Ved’)as ‘airplane’ (??) in Old Portuguese • Can also check top n words for any POS (vvp_1s, etc) • Also POS, e.g. um/uma __ PREP/que ‘a __ PREP/that’

VIEW / BNC: view.byu.edu (British National Corpus; 100m words)

VIEW / BNC featureshttp://view.byu.edu • Search for any substring, word, phrase, POS: *hood, green [nn*], [vb*] [vvn] (=passive) • See chart by genre / sub-genre: whom by (sub-)genre • Compare collocates of words: {utter/sheer/total} [nn*]; utter dejection, sheer luck, total population • Limit by genre: [AJ*] in tabloid vs broadsheet, [vvi] in FICT vs ACAD; white [NN*] in NEWS vs ACAD • Collocates: 10 words L / R, compare collocates by register (chair in FICT/ACAD), compare collocates of two words ([AJ*] near man/woman) • Integration with WordNet: [<walk].[v*] frequency of more specific verbs related to walk; [=eat] the [<food] • Fast (less than 1-2 seconds for nearly all queries) • Free

Corpus del Español (www.corpusdelespanol.org) ; 100m words 1200s-1900s

Corpus of Historical English: (view.byu.edu/che); 37m words, OE-PDE

Corpus of Historical English: Size and distribution (cf. to Helsinki Corpus, 1.6 million words)

Simple frequency of word or phrase over time: turn on

Spelling changes: vn* (chart)

Spelling changes (table): rank-frequency listing by 1500s (display = “per million)”

Chart display (by century): to * [davies:up] [CUSTOMIZED LISTS]

Table display (1500s): to * [davies:up] [CUSTOMIZED LISTS]

Morphology: *ly (+1900s -1800s -1700s)

Morphology/lexicon: word roots: *light* (+1900s -1600s)

Lexicon: New words in 1900s: * (+1900s -1800s)

Relevancy / Z-score like listing: * seat

Lexical bundles: * * (+1900s -1800s)

Semantics (collocates): hard * (+1900s -1500s)

Semantics (collocates): * meat (+1900s -1500s)

Semantics (wide-range collocates): market [5L/5R] (+1900s -1600s)

More information….

Corpus do Português: 45m words, 1200s-1900s: www.corpusdoportugues.org

Frequency tables and chart displays (cujo ‘whose’)

Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’)

Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’) [1700s]

Lemmatized: fazer.* (fazer ‘to make, do’): +1300s-1600s, -1900s

Tagged for part of speech: mulheres [aj*] (ADJ women) (sorted by 1800s)

Collocates: mulheres // [aj:fs] (ADJ women) (sorted by 1900s)

Collocates: mulheres // [aj:fs] (ADJ women) (1900s vs. 1800s)

Collocates: cadeia // [nn*] (ADJ string) (FICT vs. ACAD)

Word comparisons: [nn*] {agudo/aguçado/afilado} ‘sharp N’

Synonyms: [=gritar]: synonyms of ‘to shout’

Synonyms: [=falar]: synonyms of ‘to speak’; comparison 1800s vs 1900s

Customized lists: estou + emotions (I am + emotions): 1900s

Customized lists: User-created lists (can correct our “errors”): sair ‘to leave’

Proposed 200 million word historical corpus / NEH, March 2007

Relational databases Advantages: • Greatly faciliates the annotating of earlier texts • Speed: (clustered) indexes • Annotation: unlimited columns (e.g. original, modern, lemma, POS), other databases (e.g. thesauruses, personal lists); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings (any old program can do that); include frequency in query; useful for genre variation and historical change

Corpora: created and proposed

Competing demands of size, speed, and annotation with historical corpora