1 / 30

Korpuslinguistik mit und für Computerlinguistik

This article provides an overview of corpus linguistics, including the definition of a corpus, its history, types of corpora, and their applications in computational linguistics. It also discusses the construction and annotation of corpora, as well as the differences between corpora and test suites. The article aims to help readers understand and analyze linguistic data using corpus linguistics.

mswann
Download Presentation

Korpuslinguistik mit und für Computerlinguistik

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Korpuslinguistik mit und für Computerlinguistik Martin Volk Universität Zürich Eurospider Information Technology AG

  2. Sources for linguistic information • Introspection (own usage and judgement) • Usage and judgement by others • Questioning (goal-driven) • interview • questionaire • Observation ('involuntary' utterances) • spoken utterances ( corpora) • written utterances ( corpora) Martin Volk

  3. What is a corpus? • a text collection • a representative text collection • a representative and structured text collection • a representative, structured and annotated text collection • ... Martin Volk

  4. Example Is 'ob' used as a preposition in German? • Introspection • Rothenburg ob der Tauber • Dictionary (Wahrig. Deutsches Wörterbuch. 1996): Präp. mit Dativ; veraltet; ob dem Wasserfall • Web: Google 'ob dem' • Sage: Der Wilde Jäger ob dem Neuenburgersee • Corpus Martin Volk

  5. Corpus Examples • CZ94: ... fiel schier vom Stuhl ob der Äusserung eines Ozeanologen ... • CZ94: Bei manchem Ölgiganten kam ob der Ergebnisse gar Euphorie auf. • CZ94: ... rieben sich vergnügt die Hände ob des zu erwartenden Schlagabtauschs. • ob is a preposition with genitive!! • in CZ corpus: 'ob' is tagged as preposition 21 times (obviously some incorrect) Martin Volk

  6. History of Corpus Linguistics • collections of text were widely used in the 19th century and in the first half of the 20th century • language acquisition • orthography (letter frequency) • field linguistics  American Structuralism (influential until 1960) Martin Volk

  7. History of Corpus Linguistics • Chomsky's criticism: Speakers produce and understand infinitely many new sentences/words. • therefore the new research goal is: to describe the underlying language faculty of a speaker (= universal grammar), competence rather than performance Martin Volk

  8. History of Corpus Linguistics • Chomsky's criticism: every collection of texts is a collection of performance data and so many factors contribute to it that it cannot be used to model competence. • A corpus is necessarily skewed. Some sentences won't occur because they are obvious, false or impolite. Martin Volk

  9. theoretical linguistics competence (what is grammatical?) introspection indefinitely many types, productivity grammatical vs. ungrammatical corpus linguistics performance (what is attested?) instances finite number of types degrees of grammaticality History of Corpus Linguistics Martin Volk

  10. Corpus research in Linguistics • Lexicography (Dictionaries) • Grammaticography (Reference grammars) • Learner corpora: Language acquisition • Parallel corpora: Translation Martin Volk

  11. Construction of Corpora • Written text is easier to obtain than spoken text. Some examples: • Newspapers • Fiction (e.g. fairy tales) • Technical Literature (e.g. manuals, medicine) • Personal letters: Email • Advertising (incl. political propaganda) • Belief and Thought (e.g. bible) Martin Volk

  12. Corpora of spoken language • Spontaneous spoken language • recording of dialogues (e.g. telephone conversation) • Prepared spoken language • Public speeches (e.g. in parliament) • Radio or TV news • Spoken utterances must be transcribed for linguistic research. Martin Volk

  13. Size of corpora • Brown Corpus for English (1964, 1 Mio. words) • LIMAS-Corpus for German (1970, 1 Mio. words) • British National Corpus (1995, 100 Mio. words) • Cosmas corpus (2002, > 100 Mio. words) Martin Volk

  14. Brown Corpus (1964) • 500 texts • out of 15 different text types • with 2000 words each Martin Volk

  15. British National Corpus • 90% written English, 10% spoken English • 3209 texts • out of 10 different text types written and • 6 text types spoken • with < 40'000 words each  multi-purpose corpus Martin Volk

  16. Other considerations • Time frame of the corpus • Native and non-native speakers • Sociolinguistic variables • Gender • Age • Education • Dialect • Social context and relationships Martin Volk

  17. Types of corpora • Raw texts • Automatically annotated corpora • Texts with Part-of-Speech tags • Partially parsed texts • Manually annotated corpora • Treebank • FrameNet Martin Volk

  18. Types of Corpora • Balanced Corpora vs. special corpora • Spoken vs. written language • Monolingual vs. Multilingual Corpora • Parallel vs. comparable corpora Martin Volk

  19. Corpora in Computational Linguistics Corpora annotation Facts Rules Preferences learning Martin Volk

  20. My Motivation for Corpus Linguistics • Attempt to build a parser for German • But: problems with ambiguities!! • Therefore: Learn attachment preferences from a corpus! Martin Volk

  21. Corpora vs. Test suites A test suite • is a collection of manually constructed and selected sentences. • is used for testing computational grammars and parsers. • reduces the amount of testing. • leads to specific problems of the NLP system. Martin Volk

  22. Basic problems in CL • Knowledge is missing (too little information) • e.g. unknown words • Ambiguities (too much information) • e.g. in syntax: attachment preferences Martin Volk

  23. Corpora in Computational Linguistics • Widespread use of (manually) annotated material for measuring progress! • Some examples from COLING 2002: • Treebanks to train and test probabilistic grammars • Enriching treebanks with dependency information • Automatic error detection in PoS-Tagged Corpora • SENSEVAL data to train and test word sense disambiguation programs Martin Volk

  24. Possible Student Tasks • Which German prepositions take a noun without a determiner? (e.g. pro, via) • When is mit used as an adverb? (e.g. ) • What is the distribution of separable verb prefixes in German? • How often are relative clauses introduced with welche(r) ? • How often are present participle forms used in German? • What kind of foreign language material is in the corpus? Martin Volk

  25. Possible Student Tasks • Create a small parallel corpus (e.g. with various versions of 'Alice in Wonderland' or National Geographic) • Create a small corpus of spoken language (e.g. by transcription of one issue of 'Big Brother'). • Create a small treebank with the ANNOTATE tool. Martin Volk

  26. What corpora do we have for German? Raw text • ComputerZeitung 1993-97 (about 1.3 million words per year) • ComputerZeitung iX • Tages-Anzeiger 2000 Martin Volk

  27. Information in TagesAnzeiger • Date • Category (Sport, Politics, Culture, Economics etc.) • Author • Title vs. Text Martin Volk

  28. What corpora do we have for German? Syntactically Annotated Text (Treebanks) • NEGRA treebank (20'000 sentences) • ComputerZeitung treebank (3'000 sentences) Text with manually corrected PoS tags • 50'000 sentences from University speeches • others Martin Volk

  29. The goal If you can walk, you can dance. If you can talk, you can sing. If you can parse, you can understand. (Hans Uszkoreit, COLING 2002) Martin Volk

  30. Acknowledgement Some slides were highly influenced by or even copied from Anke Lüdeling's course "Introduction to Corpus Linguistics" at http://www.cl-ki.uni-osnabrueck.de/~aluedeli/Corpuslinguistik.html Martin Volk

More Related