1 / 33

Introduction to Corpora and Corpus Linguistics

Introduction to Corpora and Corpus Linguistics. COGS 523-Lecture 1 General Introduction . Related Readings. Course Pack: Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1 Abney (1996) Statistical Methods and Linguistics

vinny
Download Presentation

Introduction to Corpora and Corpus Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 1 General Introduction COGS 523 - Bilge Say

  2. Related Readings Course Pack: • Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1 • Abney (1996) Statistical Methods and Linguistics Extra Material: (Entirely optional, part of the presentation draws on these material) • McEnery and Wilson (2001) Ch1 • McEnery et al. (2006) A1 and B2 • Tognini-Bonelli (2001). Corpus Linguistics at Work. Ch 3 • Corpora Discussion List Archives: Corpora: Chomsky/Harris Discussion, April 2001 • Borsley&Ingham vs Stubbs Discussion. Lingua 112 (2002) • Schönefeld (1999) Corpus Linguistics and Cognitivism, International Journal of Corpus Linguistics 4(1) COGS 523 - Bilge Say

  3. What is a Corpus? Derlem (alt. Bütünce) Text/Speech/ Video + Annotation Digital media Written/Spoken Language Design Criteria COGS 523 - Bilge Say

  4. Questions of the Week • Is working with corpora a methodology within linguistics or a distinctive subfield (corpus linguistics)? • What potential is there for empirical analysis of corpora to contribute to linguistic theory? • What are the dangers involved in corpus-based linguistics? How can these dangers be reduced? COGS 523 - Bilge Say

  5. What is a Corpus,again? • A body of written text or transcribed speech which can serve as a basis for linguistic analysis or description, designed or required for a particular “representative” function. • An electronic collection of texts in a uniform representation • Corpus vs text archive vs database COGS 523 - Bilge Say

  6. Sinclair’s definition • A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as sample of language COGS 523 - Bilge Say

  7. Should a Corpus be Necessarily • Large? • Be authentic? • Compiled for linguistic analysis? • Be saturated in terms of lexical growth? • Be representative? • Be machine readable? COGS 523 - Bilge Say

  8. A History of Corpora • Pre-computers era (pre 60s) • Transition era (60s to beginning of 90s) • Maturation era (90s onwards) • What did technology bring? • Increased accuracy, speed, accountability, replicability, large volumes of better annotated data. COGS 523 - Bilge Say

  9. Phonology Morphology Lexicon Syntax Semantics Discourse Pragmatics Introspection Experimental Methods Formal Linguistic Analysis Computational Modeling Corpus Based Methods? Linguistics Computational Linguistics Psycholinguistics Sociolinguistics Historical Linguistics Applied Linguistics Corpus Linguistics ? COGS 523 - Bilge Say

  10. Corpus Linguistics • The term emerged in 1980s, although the use of corpora has a long history. • Modern perspectives contain a number of opposing positions. COGS 523 - Bilge Say

  11. Linguistic Subdisciplines with a tradition for corpora • Historical Linguistics • Phonetics • Language Acquisition • Statistical Natural Language Processing/Language Engineering/Computational Linguistics COGS 523 - Bilge Say

  12. Corpus Linguistics: a Methodology, Theory, or Subfield of Linguistics? • Rationalism vs Empiricism • Formalists vs Functionalists • Competence vs Performance • Core vs Periphery • Applied Linguistics vs Theoretical Linguistics • Corpus-Based vs Corpus-Driven Approaches (Tognini-Bonelli) COGS 523 - Bilge Say

  13. False Assumptions • All corpus linguists are descriptivists, interested only in counting and categorizing occurrences in a corpus, and that all generative grammarians are theoreticians unconcerned with the data on which their theories are based. Complexity of the structure is not in the interest of corpus linguist. (Meyer, 2002) COGS 523 - Bilge Say

  14. Evaluating Linguistic Theories • Observational vs explanatory vs descriptive adequacy • Falsifiability, Completeness, Simplicity, Objectivity etc... COGS 523 - Bilge Say

  15. Chomskyan quotes: • “The corpus could never be a useful tool for the linguist, as the linguist must seek to model language” • “Corpus Linguistics does not exist.” • “Any natural corpus will be skewed and incomplete. Some sentences won’t occur, because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.” Indeed Chomsky contributed to modern view of corpus linguistics by improving language technology and to overcoming the structuralist-behaviourist views of language as something that could be enumerated, by way of formal language theory. COGS 523 - Bilge Say

  16. Why Statistics help? (Abney) • Language Acquisition • Language Changes • Language Variation • Grammaticality- Ambiguity – Computation • Modularity is not in isolation COGS 523 - Bilge Say

  17. Grammaticality Judgements *He shines Tony books. He gives Tony books. If intutions do,why bother with corpus analysis? • Artificial data is artificial and creates another kind of skewedness. • “Yes I could say that-but I never would” – gradedness in grammaticality judgements • Intuitions are perceptions.... COGS 523 - Bilge Say

  18. Alternative Views • Leech (92) • “Computer Corpus Linguistics” is a new research enterprise, a new philosophical approach that • Concentrates on linguistic performance • Leads to a more empirical view of scientific inquiry • Exploits qualitative as well as quantitative methodology to produce a quantitatively oriented language model such as Bayesian language models. • Not everyone agrees! COGS 523 - Bilge Say

  19. Further Remarks • Corpus Linguistics contributed to blurring the distinction between grammar and lexicon. • Sinclair’s open choice vs idiom principle • Cognitive linguists can accommodate data and facts revealed by corpus linguistic analysis COGS 523 - Bilge Say

  20. Corpus Linguistics vs Corpus Based Linguistics There is no inherent incompatibility between theoretical generative linguistics and corpus linguistics (Seegmiller) Generative and corpus linguistics are two approaches to the same problem, and must meet somewhere. Generative theories should match or be backed up by real data. (Schiffrin) What is possible and what is probable? Corpus linguistics offers a way of describing things that we *do* regularly and frequently with greater confidence and reliability than by using introspection alone. (Krishnamurthy) COGS 523 - Bilge Say

  21. Take existing theory as a starting point and correct and revise the theory in light of corpus evidence. Favour very large, full text corpora, with the idea of cumulative representativeness and no annotation-to be able to free oneself of preconceived theories. e.g collocations rather than colligations Without a corpus, there is no meaningful work to be done (attributed to Sinclair, Stubbs – but see their own writings) Corpus-Based Linguistics vs Corpus-Driven Linguistics COGS 523 - Bilge Say

  22. Reconciling Views • Corpora are excellent resources for verifying the falsifiability, completeness, simplicity, strength, and objectivity of linguistic hypotheses (Meyer, 2002). • They can provide additional linguistic perspectives which improve our knowledge of language and our ability to use it (a weaker position) COGS 523 - Bilge Say

  23. The Rise of Corpora COGS 523 - Bilge Say (McEnery and Wilso, 2001)

  24. Range of Activities in Corpus-based Linguistics • Corpus Design, Compilation and Annotation • Developing Tools for (1) or Analysis of Corpora • Linguistic Studies or Applications using corpora developed in (1) using tools developed in (2) COGS 523 - Bilge Say

  25. Types of Corpora • General (typically balanced and made available for general linguistic use) vs Specialized (Dialect corpora,language acquisition corpora,learner corpora) • Core Corpora • Written vs Spoken Corpora • Full-text vs Sample-text Corpora COGS 523 - Bilge Say

  26. More Typology • Finite-size (Static) vs Dynamic/Monitor Corpora • Monolingual vs Multilingual Corpora (Parallel corpora, Comparable Corpora) • Rather Graded Distinctions: • Raw vs Annotated, • Balanced vs Pyramidal vs Opportunistic Corpora • Synchronic vs Diachronic COGS 523 - Bilge Say

  27. Some Examples of Corpora • Pre-electronic corpora • Biblical and Literary Studies • Lexicographical • Dialect Studies • Language Education • Grammatical • Quirk’s Survey of English Usage Corpus (later computerized) had 200 samples of 5000 words each, half spoken, half written, tagged manually with 65 grammatical features. COGS 523 - Bilge Say

  28. More Examples • Major Electronic Corpora • Brown Corpus (Francis and Kucera, 1965) Brown University Standart Corpus of Present Day American English- 1 million words, 1961-64, 500 samples of 2000 words each • Lancaster-Oslo-Bergen Corpus (LOB corpus) a comparable corpus of British English – fewer westerns exist,though! • FBrown and FLOB – comparable corpora of 1990s COGS 523 - Bilge Say

  29. Major Electronic Corpora • Also modeled after Brown: • Kolhapur Corpus of Indian English • Wellington Corpus of New Zealand English... • London-Lund Corpus (1975)- 100 5000-word samples of spoken data, major spoken corpus till mid 1990s, predominantly highly educated adult speakers • Lancaster/IBM Spoken Corpus (SEC)-better balance-11 categories,detailed prosodic annotation COGS 523 - Bilge Say

  30. Major Electronic Corpora • Longman Dictionary of Contemporary English (LDOCE); COBUILD Project-Bank of English-524 million words as of 2004. • International Corpus of English • International Corpus of Learner’s English- 2M words- 500 word essays, different English backgrounds • Longman Learner’s Corpus, HKUST Learner’s Corpus • CHILDES Child Language Data Exchange System • European Corpus Initiative – ECI – 93 million words • Many corpora are available from LDC and ELDA/ELRA. COGS 523 - Bilge Say

  31. Major Natural Language Processing Corpora • PennTreebank (1993) – 4.9 million words, tagged and parsed, not balanced (optional paper in course pack) • TIPSTER corpus- AP Newswire and Wall Street Journal – mainly used for Information Retrieval • More variety by National Corpora and dependency treebanks COGS 523 - Bilge Say

  32. National Corpora • British National Corpus (BNC Corpus) • 100 million words, 90% written, 10% spoken, BNC Baby – 2 million word sampler, SARA and Xaira – its own corpus query tools, wholly tagged by CLAWS tagger • American National Corpus (ANC) • In progress, preliminary releases available • Czech National Corpus (optional paper in course pack) • 12 full time persons working for 5 years in a speacialized institute • 100 million words • Partially tagged and parsed in Prague Dependency School tradition • See METU Online links COGS 523 - Bilge Say

  33. Lecture 2 • Corpus Design Issues • Readings: • Tognini-Bonelli (2001) Corpus Issues. Ch3 • McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack • Meyer (2002) Planning the Construction of a corpus. Ch 2. COGS 523 - Bilge Say

More Related