1 / 23

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT). Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7 Project PRESEMT. Just-in-time corpora. Krista Varantola Translators, terminologists In-domain terminology: Domain dictionaries

Download Presentation

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparable Corpora BootCaT (CCBC)(or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7 Project PRESEMT

  2. Just-in-time corpora • Krista Varantola • Translators, terminologists • In-domain terminology: • Domain dictionaries • Don’t exist • Out of date • Not accessible • Collect in-domain web pages • Instant corpus

  3. BootCaT (Bootstrapping Corpora and Terms) • Baroni and Bernardini 2004 • User: input ‘seed terms’ • Send 3-at-a-time to a search engine • Returns search hits page • Retrieve those pages • A corpus! • Cleaning, deduplicating, linguistic processing • Extract terms • Can use extracted terms as seeds, iterate

  4. Very successful • Widely used • More implementations • SkE has WebBootCaT, web front end • Secret: • piggybacks on search engines • They do the donkey-work • on-domain, text-rich pages, no spam, …

  5. Also use for • General language corpus • Long list of general seed words • Pioneer: Sharoff • LCL: Corpus Factory • ‘Varieties of Learner English’ • General English, same queries except • Region=UK, US, Canada, Aus, China, Japan, Korea • Validation under way

  6. Sketch Engine

  7. Corpus query tool, since 2003 • Widely used by lexicographers • Commercial • OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan • National dictionary projects • Bulgaria, Czech Republic, Estonia, Netherlands, Slovakia, Slovenia • Universities • Linguistics, language research, NLP, language teaching

  8. 44 languages and counting Large corpora ready-to-use for Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese

  9. Handles large corpora • Largest to date: 8 billion words • Fast • Web-based: no software to install • Build ‘instant corpora’ from the web • Load your own corpus • Quota of space on SkE server • Word sketches • One-page, automatic accounts of a word’s grammatical and collocational behaviour • Free 30-day trial: sketchengine.co.uk

  10. Adam Kilgarriff Lexical Computing Ltd.

  11. WebBootCaT • BootCaT integrated in SkE • BootCaT a corpus • Clean, de-dupe, POS-tag, then • Load into Sketch Engine

  12. Observation • Specialist domain, L1 • Specialist domain, L2 • Matching terminology

  13. Going multilingual • Translate seeds • English: volcanologyvolcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphictephrochronologygeochronological "volcanic ash" ablation rhyolitic • French:vulcanologuevolcanologie "éruptionvolcanique" sismographesEyjafjallajokull "surveillance de la déformation" géodiquestephra magma téphrochronologiestratigraphiquegéochronologiques "de cendresvolcaniques" ablation rhyolitiques • Thanks again Google • BootCaT for French

  14. CCBC • Input: L1, L1 seeds, L2 • Choose dictionary • Google as default • Google dictionary (25 lg pairs, limited API) • Google translate (1225 lg pairs, only 1 transl) • Option: edit translations • Bootcat 2 corpora • Bilingual word sketches

  15. Bilingual word sketches(very first pass) • For L1 nodeword n • For each of its translations n1, n2, … • For each collocate c in word sketch • For each of its translations c1, c2, … • Does cioccur as collocate in word sketch for ni? • If yes: output <c; ni , ci> • Add L1 and L2 examples sentences

  16. Notes • Grammatical relations • Used to find collocations • Then thrown away • Thresholds: what is “in a word sketch” • Which dictionary • Issue: as for seeds • Live (just)

More Related