490 likes | 616 Views
Topics in computational morphology and phonology. John Goldsmith LSA Institute 2003. Basics. Focus of the course is on algorithmic learning of natural language. This is both very concrete and very abstract… We’ll get our hands dirty (with data) and our minds clean (with theory).
E N D
Topics in computational morphology and phonology John Goldsmith LSA Institute 2003
Basics • Focus of the course is on algorithmic learning of natural language. • This is both very concrete and very abstract… • We’ll get our hands dirty (with data) and our minds clean (with theory). • We’re really interested in systems that work from real data and derive real linguistic analyses.
Changes in our views… • Naturally, this activity affects our view of what the ideal linguistic analysis is…our view of linguistics does not emerge unscathed and unchanged.
Basics • Everything that’s important is accessible through the Internet – • The syllabus for this course • The software that we’ll be exploring • Linguistica: automatic learning of morphology • Phonological complexity: learning of phonological complexity (tactics?) • Word Breaker • Dynamic Computational network
Software • It’s important to download the software (especially Linguistica) and to run it, and see what it does and what it is telling us. Try running it on different corpora and different languages.
Evolution of some ideas • Late 1980s – trying to develop a view of phonological theory in which we could employ a quantitative notion of well-formedness, along with a simplified set of rule-changes, plus the over-arching principle: • A rule applies iff its application improves the well-formedness of the representation it applies to.
Connection to Well-formedness Condition of early autosegmental phonology (Goldsmith 1976)
But how to come up with a general characterization (especially a language-particular one) of well-formedness? • Interest in neural nets, in which a natural notion of “complexity” of a state exists, and in which neural nets could be understood to “relax” into the least complex state consistent with the input.
This idea will come back later. • At the same time, phonology (as much else in linguistics) seemed to show the hallmark of competition: complexities arise when (otherwise simple) generalizations conflict with each other, and the language has to decide which wins. This is a very natural notion in the context of neural network calculation, but not from the point of view of serial rule application.
In addition, neural networks deal with competition by using numbers rather than ordering or ranking. This feels like a big change.
MS Shock • Late 1990s • Lack of interest in linguistic theory • Interesting tools: HMM, decision trees, etc. • Is a linguist someone who cares about all that involves language? (Jakobson: Linguista sum: linguistici nihil a me alienum puto )
Language Identification • Why Language ID (LID)? Practical applications. Text / speech. • We typically define the challenge of choosing one language from within a small universe of languages (e.g., French, German, Spanish, Dutch, English): customers of MS Word? • First guess: “dead-ringers”: sounds that uniquely identify their language.
Dead-ringers Problems: Fewer dead-ringers with written text (compared with sound): à? è? â? Dead-ringers don’t work: we often have “borrowings” from the wrong language. Biggest case is in names (persons, places companies, etc.), but others exist. Not a good enough strategy. Why?
What’s not good enough? • We demand of ourselves to quantify our success. In particular, we want 98% correct identification from universe of 8 languages within 5 words.
Language ID • “Dead ringer” approach inadequate because it overlooks the enormous amount of information that is present. • But what is that information? • It lies in the frequencies of the letters and the frequencies of the letter combinations. • But then:
How do we combine all of the information we have regarding all of the frequencies of the letters and letter combinations in a single test sentence (5 word sequence, e.g.)? • There is a way.
Probability theory …is exactly what we need. Each language provides a set of probabilities for letters (etc.). Each language then analyzes the test string and tells us how good an example the string is as an example of that language.
A string comes from a language L (out of our set) if language L assigns the highest probability to that string (compared to the probabilities assigned by the other languages). • All we need to do is collect the relevant information about letters (combinations) from a decent sample of each language.
Impact • The problem is solved, using numerical methods; using probabilistic model; • The solution requires more computation (but simple computation) than traditional linguistic analyses: we pretty much need to have a computer. • The computer is definitely necessary in order to collect the frequencies.
But ! • Once the program is written to collect frequencies, it will work perfectly well for any language. It’s a rough Language-Data Acquisition Device: for learning a very specific, particular linguistic task.
How particular is this task? • Eventually it occurred to me that we were asking a very linguistic question, even if we were focusing on standard orthography. • Suppose we had easy access to phonological transcriptions. • We’d be asking, for a given utterance, how good is it as an utterance in English? French? German? How “well-formed” is it?
And we can test different models (theories) and see instantaneously how good they are relative to each other.
Next Event • Word-breaking (word-segmentation) in Asian languages • Carl de Marcken’s dissertation • Word breaking in English, Chinese • Using Minimum Description Length analysis
Unsupervised learning • There’s a practical side to this… • And a theoretical side. • Work to date: • Morphology learning: work on European languages, morphological rich languages. • Phonology • Syntax
General themes:What (the heck) are we doing? • Mediationalist / distributionalist views of language (Huck and Goldsmith 1995): • This is strictly distributionalist; but it does not need to be (work on MT: machine translation).
Schools of explanation • Historical • Psychological • Social • Algorithmic • Computational linguistics is the embodiment of the algorithmic point of view.
Strengths of computational approaches to linguistics: (a) They allow for testing of ideas against data. (b) They allow for better exploration of data: you get your hands dirty (and your mind clean). (c) They suggest models to linguists.
Contrasts between the automatic learning and the hand-crafted rule-based computational linguists: The great debate of the 1990s in computational linguistics.
Machine learning • The study of the extraction of regularities from raw data
Linguistic theory... The strongest requirement that could be placed on the relation between a theory of linguistic structure and particular grammars is that the theory must provide a practical and mechanical method for actually constructing the grammar, given a corpus of utterances. Let us say that such a theory provides us with a discovery procedure.
grammar corpus
A weaker requirement would be that the theory must provide a practical and mechanical method for determining whether or not a grammar proposed for a given corpus is, in fact, the best grammar of the language from which the corpus is drawn (a decision procedure).
yes/no corpus grammar
An even weaker requirement would be that given a corpus and given two proposed grammars G1 and G2, the theory must tell us which is the better grammar....an evaluation procedure.
G1 "G1" or "G2" G2 corpus
The point of view adopted here is that it is unreasonable to demand of linguistic theory that it provide anything more than a practical evaluation procedure for grammars. That is, we adopt the weakest of the three positions described above...
I think that it is very questionable that this goal is attainable in any interesting way, and I suspect that any attempt to meet it will lead into a maze of more and more elaborate and complex analytic procedures that will fail to provide answers for many important questions about the nature of linguistic structure. I believe that by lowering our
sightsto the more modest goal of developing an evaluation procedure for grammars we can focus attention more clearly on truly crucial problems...The correctness of this judgment can only be determined by the actual development and comparison of theories of these various sorts.
Notice, however, that the weakest of these three requirements is still strong enough to guarantee significance for a theory that meets it. There are few areas of science in which one would seriously consider the possibility of developing a general, practical, mechanical method for choosing among several theories, each compatible with the available data. …Noam Chomsky, Syntactic Structures (1957)
That’s precisely one of the concerns of modern machine learning.
Paradigms of machine learning (ML) today Minimum Description Length Neural networks Support vector machines Decision trees
Machine learning provides a way (perhaps the only way) to test the claim that an information-rich UG (Universal Grammar) is necessary to account for the acquisition of human language. • Such a view carries with it the claim that UG is learned on an evolutionary scale by an uncharacterized species-level learning mechanism: not a very plausible idea, but not impossible: something like that has indeed happened for, e.g., vision, but then again languages vary more than vision systems do, and mammals had longer than homo sapiens.
Probabilistic approaches A better name would be QTE: the quantitative theory of evidence. Probability theory is not fuzzy. More importantly, it is not in any fashion inimical to structurally-based theories (a wrong-headed notion that many linguists have picked up). Probabilistic models (or their supporters) sometimes appear to be skeptical about structure, but only because probabilistic modelers want to wring every last bit of quantitative result out of a simpler model before investing themselves in a more complex model, one which may require a great deal of mathematical work to get up and running.
Contrast between generative and probabilistic approach Principal difference: • Generative grammar: given an alphabet, specify explicitly those combinations (representations; strings;…) that are in and those that are out. • Probabilistic model: given an alphabet, specify a distribution over all possible combinations. (terms: distribution; support).
Working with real people using language, it’s often of little interest to distinguish between the In and the Out. • For example, …
Information theory • Information theory's usefulness, and its central tool, the notion of information content: • positive log probability, or • average positive log probability.
Math? The only math you’ll need to feel comfortable with is logarithms: Mainly: -1 * log (x), where x is a number between 0 and 1.
Basic maxim of intelligence in the Universe: • A general goal of quantitative modeling, not just for linguistics, but for intelligence in the universe: • Minimize the complexity of the perceived universe. • This maxim leads to the framework of Minimum Description Length (MDL)
Minimum Description Length You must simultaneously maximize the ability of your theory to accurate describe (model) the data and Minimize the complexity of the theory (so it won’t overfit the data: so you won’t always be looking at the same data over and over and over…sound familiar?)