1 / 41

Competing demands of size, speed, and annotation with historical corpora

Competing demands of size, speed, and annotation with historical corpora Mark Davies Brigham Young University http://davies-linguistics.byu.edu Digital Historical Corpora Dagstuhl, Dec 2006. Outline. Relational database architecture Examples of annotation via relational databases

emlyn
Download Presentation

Competing demands of size, speed, and annotation with historical corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Competing demands of size, speed, and annotation with historical corpora Mark DaviesBrigham Young Universityhttp://davies-linguistics.byu.edu Digital Historical Corpora Dagstuhl, Dec 2006

  2. Outline • Relational database architecture • Examples of annotation via relational databases • VIEW/BNC (100 million words) • Corpus del Español (2001-02, 100 million words) • Corpus of Historical English (OED, 37 million) • Corpus do Português (2004-06, 45 million) • Proposed 200 million word corpus of historical English

  3. Relational databases SEQUENTIAL WORDS (seqWords) DOCUMENT-LEVEL METADATA Target “weakest” slot first, e.g. [AJ*] 2 [ship] 1 SQL 1: insert into temptable (ID)select a. ID from seqwords as a, metadata as b where a.lemma = ‘ship’ and b.date between 1620 and 1640 and b.genre = ‘fict_adv’ and a.textID = b.textID SQL 2: select count(*), a.lemma from seqwords as a, temptable as b where (a.ID = b.ID – 1 and a.POS like ‘aj%’) ADJ ship (a.ID between b.ID = 5 and b.ID – 5 and a.pos like ‘nn%’) ship near ADJ (a.ID between b.ID = 20 and b.ID – 20) KWIC for ship

  4. Relational databases Disadvantage: • Hierarchical information Advantages: • Speed: (clustered) indexes; < 1-2 sec, 100 mw • Annotation: unlimited columns, links to other databases (e.g. semantic); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings

  5. Lemmatization (historical) • Use frequency to target most frequent forms not in modern lexicon • Using KWIC, enter modern equivalent (3 students, one summer, 100,000 words; 99.8% of all tokens, 1300s-1800s) • Spelling replacements: iustiça / justiça ‘justice’ • Can also use collocational info (a iustiça ‘the justice’) • Use frequency, see which lemma in older periods overly frequent (e.g. 1600s > 2x 1900s, but cavaleiro ‘knight’, Deus ‘God’, etc) • Check forms of top n lemma in each century: non-modern forms highlighted; e.g. Pt avião (haver ‘to have Ved’)as ‘airplane’ (??) in Old Portuguese • Can also check top n words for any POS (vvp_1s, etc) • Also POS, e.g. um/uma __ PREP/que ‘a __ PREP/that’

  6. VIEW / BNC: view.byu.edu (British National Corpus; 100m words)

  7. VIEW / BNC featureshttp://view.byu.edu • Search for any substring, word, phrase, POS: *hood, green [nn*], [vb*] [vvn] (=passive) • See chart by genre / sub-genre: whom by (sub-)genre • Compare collocates of words: {utter/sheer/total} [nn*]; utter dejection, sheer luck, total population • Limit by genre: [AJ*] in tabloid vs broadsheet, [vvi] in FICT vs ACAD; white [NN*] in NEWS vs ACAD • Collocates: 10 words L / R, compare collocates by register (chair in FICT/ACAD), compare collocates of two words ([AJ*] near man/woman) • Integration with WordNet: [<walk].[v*] frequency of more specific verbs related to walk; [=eat] the [<food] • Fast (less than 1-2 seconds for nearly all queries) • Free

  8. Corpus del Español (www.corpusdelespanol.org) ; 100m words 1200s-1900s

  9. Corpus of Historical English: (view.byu.edu/che); 37m words, OE-PDE

  10. Corpus of Historical English: Size and distribution (cf. to Helsinki Corpus, 1.6 million words)

  11. Simple frequency of word or phrase over time: turn on

  12. Spelling changes: vn* (chart)

  13. Spelling changes (table): rank-frequency listing by 1500s (display = “per million)”

  14. Chart display (by century): to * [davies:up] [CUSTOMIZED LISTS]

  15. Table display (1500s): to * [davies:up] [CUSTOMIZED LISTS]

  16. Morphology: *ly (+1900s -1800s -1700s)

  17. Morphology/lexicon: word roots: *light* (+1900s -1600s)

  18. Lexicon: New words in 1900s: * (+1900s -1800s)

  19. Relevancy / Z-score like listing: * seat

  20. Lexical bundles: * * (+1900s -1800s)

  21. Semantics (collocates): hard * (+1900s -1500s)

  22. Semantics (collocates): * meat (+1900s -1500s)

  23. Semantics (wide-range collocates): market [5L/5R] (+1900s -1600s)

  24. More information….

  25. Corpus do Português: 45m words, 1200s-1900s: www.corpusdoportugues.org

  26. Frequency tables and chart displays (cujo ‘whose’)

  27. Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’)

  28. Frequency tables and chart displays (difícil.* de [vr*] ‘hard to VVI’) [1700s]

  29. Lemmatized: fazer.* (fazer ‘to make, do’): +1300s-1600s, -1900s

  30. Tagged for part of speech: mulheres [aj*] (ADJ women) (sorted by 1800s)

  31. Collocates: mulheres // [aj:fs] (ADJ women) (sorted by 1900s)

  32. Collocates: mulheres // [aj:fs] (ADJ women) (1900s vs. 1800s)

  33. Collocates: cadeia // [nn*] (ADJ string) (FICT vs. ACAD)

  34. Word comparisons: [nn*] {agudo/aguçado/afilado} ‘sharp N’

  35. Synonyms: [=gritar]: synonyms of ‘to shout’

  36. Synonyms: [=falar]: synonyms of ‘to speak’; comparison 1800s vs 1900s

  37. Customized lists: estou + emotions (I am + emotions): 1900s

  38. Customized lists: User-created lists (can correct our “errors”): sair ‘to leave’

  39. Proposed 200 million word historical corpus / NEH, March 2007

  40. Relational databases Advantages: • Greatly faciliates the annotating of earlier texts • Speed: (clustered) indexes • Annotation: unlimited columns (e.g. original, modern, lemma, POS), other databases (e.g. thesauruses, personal lists); no performance hit • Size: 200m words nearly same speed as 10m • Not just searching for “known” strings (any old program can do that); include frequency in query; useful for genre variation and historical change

  41. Corpora: created and proposed

More Related