470 likes | 946 Views
Introduction to Corpora and Corpus Linguistics. COGS 523-Lecture 1 General Introduction . Related Readings. Course Pack: Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1 Abney (1996) Statistical Methods and Linguistics
E N D
Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 1 General Introduction COGS 523 - Bilge Say
Related Readings Course Pack: • Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1 • Abney (1996) Statistical Methods and Linguistics Extra Material: (Entirely optional, part of the presentation draws on these material) • McEnery and Wilson (2001) Ch1 • McEnery et al. (2006) A1 and B2 • Tognini-Bonelli (2001). Corpus Linguistics at Work. Ch 3 • Corpora Discussion List Archives: Corpora: Chomsky/Harris Discussion, April 2001 • Borsley&Ingham vs Stubbs Discussion. Lingua 112 (2002) • Schönefeld (1999) Corpus Linguistics and Cognitivism, International Journal of Corpus Linguistics 4(1) COGS 523 - Bilge Say
What is a Corpus? Derlem (alt. Bütünce) Text/Speech/ Video + Annotation Digital media Written/Spoken Language Design Criteria COGS 523 - Bilge Say
Questions of the Week • Is working with corpora a methodology within linguistics or a distinctive subfield (corpus linguistics)? • What potential is there for empirical analysis of corpora to contribute to linguistic theory? • What are the dangers involved in corpus-based linguistics? How can these dangers be reduced? COGS 523 - Bilge Say
What is a Corpus,again? • A body of written text or transcribed speech which can serve as a basis for linguistic analysis or description, designed or required for a particular “representative” function. • An electronic collection of texts in a uniform representation • Corpus vs text archive vs database COGS 523 - Bilge Say
Sinclair’s definition • A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as sample of language COGS 523 - Bilge Say
Should a Corpus be Necessarily • Large? • Be authentic? • Compiled for linguistic analysis? • Be saturated in terms of lexical growth? • Be representative? • Be machine readable? COGS 523 - Bilge Say
A History of Corpora • Pre-computers era (pre 60s) • Transition era (60s to beginning of 90s) • Maturation era (90s onwards) • What did technology bring? • Increased accuracy, speed, accountability, replicability, large volumes of better annotated data. COGS 523 - Bilge Say
Phonology Morphology Lexicon Syntax Semantics Discourse Pragmatics Introspection Experimental Methods Formal Linguistic Analysis Computational Modeling Corpus Based Methods? Linguistics Computational Linguistics Psycholinguistics Sociolinguistics Historical Linguistics Applied Linguistics Corpus Linguistics ? COGS 523 - Bilge Say
Corpus Linguistics • The term emerged in 1980s, although the use of corpora has a long history. • Modern perspectives contain a number of opposing positions. COGS 523 - Bilge Say
Linguistic Subdisciplines with a tradition for corpora • Historical Linguistics • Phonetics • Language Acquisition • Statistical Natural Language Processing/Language Engineering/Computational Linguistics COGS 523 - Bilge Say
Corpus Linguistics: a Methodology, Theory, or Subfield of Linguistics? • Rationalism vs Empiricism • Formalists vs Functionalists • Competence vs Performance • Core vs Periphery • Applied Linguistics vs Theoretical Linguistics • Corpus-Based vs Corpus-Driven Approaches (Tognini-Bonelli) COGS 523 - Bilge Say
False Assumptions • All corpus linguists are descriptivists, interested only in counting and categorizing occurrences in a corpus, and that all generative grammarians are theoreticians unconcerned with the data on which their theories are based. Complexity of the structure is not in the interest of corpus linguist. (Meyer, 2002) COGS 523 - Bilge Say
Evaluating Linguistic Theories • Observational vs explanatory vs descriptive adequacy • Falsifiability, Completeness, Simplicity, Objectivity etc... COGS 523 - Bilge Say
Chomskyan quotes: • “The corpus could never be a useful tool for the linguist, as the linguist must seek to model language” • “Corpus Linguistics does not exist.” • “Any natural corpus will be skewed and incomplete. Some sentences won’t occur, because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.” Indeed Chomsky contributed to modern view of corpus linguistics by improving language technology and to overcoming the structuralist-behaviourist views of language as something that could be enumerated, by way of formal language theory. COGS 523 - Bilge Say
Why Statistics help? (Abney) • Language Acquisition • Language Changes • Language Variation • Grammaticality- Ambiguity – Computation • Modularity is not in isolation COGS 523 - Bilge Say
Grammaticality Judgements *He shines Tony books. He gives Tony books. If intutions do,why bother with corpus analysis? • Artificial data is artificial and creates another kind of skewedness. • “Yes I could say that-but I never would” – gradedness in grammaticality judgements • Intuitions are perceptions.... COGS 523 - Bilge Say
Alternative Views • Leech (92) • “Computer Corpus Linguistics” is a new research enterprise, a new philosophical approach that • Concentrates on linguistic performance • Leads to a more empirical view of scientific inquiry • Exploits qualitative as well as quantitative methodology to produce a quantitatively oriented language model such as Bayesian language models. • Not everyone agrees! COGS 523 - Bilge Say
Further Remarks • Corpus Linguistics contributed to blurring the distinction between grammar and lexicon. • Sinclair’s open choice vs idiom principle • Cognitive linguists can accommodate data and facts revealed by corpus linguistic analysis COGS 523 - Bilge Say
Corpus Linguistics vs Corpus Based Linguistics There is no inherent incompatibility between theoretical generative linguistics and corpus linguistics (Seegmiller) Generative and corpus linguistics are two approaches to the same problem, and must meet somewhere. Generative theories should match or be backed up by real data. (Schiffrin) What is possible and what is probable? Corpus linguistics offers a way of describing things that we *do* regularly and frequently with greater confidence and reliability than by using introspection alone. (Krishnamurthy) COGS 523 - Bilge Say
Take existing theory as a starting point and correct and revise the theory in light of corpus evidence. Favour very large, full text corpora, with the idea of cumulative representativeness and no annotation-to be able to free oneself of preconceived theories. e.g collocations rather than colligations Without a corpus, there is no meaningful work to be done (attributed to Sinclair, Stubbs – but see their own writings) Corpus-Based Linguistics vs Corpus-Driven Linguistics COGS 523 - Bilge Say
Reconciling Views • Corpora are excellent resources for verifying the falsifiability, completeness, simplicity, strength, and objectivity of linguistic hypotheses (Meyer, 2002). • They can provide additional linguistic perspectives which improve our knowledge of language and our ability to use it (a weaker position) COGS 523 - Bilge Say
The Rise of Corpora COGS 523 - Bilge Say (McEnery and Wilso, 2001)
Range of Activities in Corpus-based Linguistics • Corpus Design, Compilation and Annotation • Developing Tools for (1) or Analysis of Corpora • Linguistic Studies or Applications using corpora developed in (1) using tools developed in (2) COGS 523 - Bilge Say
Types of Corpora • General (typically balanced and made available for general linguistic use) vs Specialized (Dialect corpora,language acquisition corpora,learner corpora) • Core Corpora • Written vs Spoken Corpora • Full-text vs Sample-text Corpora COGS 523 - Bilge Say
More Typology • Finite-size (Static) vs Dynamic/Monitor Corpora • Monolingual vs Multilingual Corpora (Parallel corpora, Comparable Corpora) • Rather Graded Distinctions: • Raw vs Annotated, • Balanced vs Pyramidal vs Opportunistic Corpora • Synchronic vs Diachronic COGS 523 - Bilge Say
Some Examples of Corpora • Pre-electronic corpora • Biblical and Literary Studies • Lexicographical • Dialect Studies • Language Education • Grammatical • Quirk’s Survey of English Usage Corpus (later computerized) had 200 samples of 5000 words each, half spoken, half written, tagged manually with 65 grammatical features. COGS 523 - Bilge Say
More Examples • Major Electronic Corpora • Brown Corpus (Francis and Kucera, 1965) Brown University Standart Corpus of Present Day American English- 1 million words, 1961-64, 500 samples of 2000 words each • Lancaster-Oslo-Bergen Corpus (LOB corpus) a comparable corpus of British English – fewer westerns exist,though! • FBrown and FLOB – comparable corpora of 1990s COGS 523 - Bilge Say
Major Electronic Corpora • Also modeled after Brown: • Kolhapur Corpus of Indian English • Wellington Corpus of New Zealand English... • London-Lund Corpus (1975)- 100 5000-word samples of spoken data, major spoken corpus till mid 1990s, predominantly highly educated adult speakers • Lancaster/IBM Spoken Corpus (SEC)-better balance-11 categories,detailed prosodic annotation COGS 523 - Bilge Say
Major Electronic Corpora • Longman Dictionary of Contemporary English (LDOCE); COBUILD Project-Bank of English-524 million words as of 2004. • International Corpus of English • International Corpus of Learner’s English- 2M words- 500 word essays, different English backgrounds • Longman Learner’s Corpus, HKUST Learner’s Corpus • CHILDES Child Language Data Exchange System • European Corpus Initiative – ECI – 93 million words • Many corpora are available from LDC and ELDA/ELRA. COGS 523 - Bilge Say
Major Natural Language Processing Corpora • PennTreebank (1993) – 4.9 million words, tagged and parsed, not balanced (optional paper in course pack) • TIPSTER corpus- AP Newswire and Wall Street Journal – mainly used for Information Retrieval • More variety by National Corpora and dependency treebanks COGS 523 - Bilge Say
National Corpora • British National Corpus (BNC Corpus) • 100 million words, 90% written, 10% spoken, BNC Baby – 2 million word sampler, SARA and Xaira – its own corpus query tools, wholly tagged by CLAWS tagger • American National Corpus (ANC) • In progress, preliminary releases available • Czech National Corpus (optional paper in course pack) • 12 full time persons working for 5 years in a speacialized institute • 100 million words • Partially tagged and parsed in Prague Dependency School tradition • See METU Online links COGS 523 - Bilge Say
Lecture 2 • Corpus Design Issues • Readings: • Tognini-Bonelli (2001) Corpus Issues. Ch3 • McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack • Meyer (2002) Planning the Construction of a corpus. Ch 2. COGS 523 - Bilge Say