Day 5: Entropy reduction models; connectionist models; course wrapup

Day 5: Entropy reduction models; connectionist models; course wrapup Roger Levy University of Edinburgh & University of California – San Diego

Today • Other information-theoretic models: Hale 2003 • Connectionist models • General discussion & course wrap-up

Hale 2003, 2005: Entropy reduction • Another attempt to measure reduction in uncertainty arising from a word in a sentence • The information contained in a grammar can be measured by the grammar’s entropy • But grammars usually denote infinite sets of trees! • Not a problem: Grenander 1967 shows how to calculate entropy of any* stochastic context-free branching process • [skip details of matrix algebra…] *a probability distribution over an infinite set is not, however, guaranteed to have finite entropy (Cover & Thomas 1991, ch. 2)

Entropy reduction, cont. • A partial input w1…nrules out some of the possibilities in a grammar • Uncertainty after a partial input can be characterized as the entropy of the grammar conditioned on w1…n • We can calculate this uncertainty efficiently by treating a partial input as a function from grammars to grammars (Lang 1988, 1989) • [skip details of parsing theory…]

Entropy reduction, punchline • Hale’s entropy reduction hypothesis: it takes effort to decrease uncertainty • The amount of processing work associated with a word is the uncertainty-reduction it causes (if any) • Note that entropy doesn’t always go down! • more on this in a moment • ER idea originally attributed to Wilson and Carroll (1954)

Intuitions for entropy reduction • Consider a string composed of n independently identically distributed (i.i.d.) variables, e,g.: • The entropy of the “grammar” is 1.5n, for any n • When you see wi, the entropy goes from 1.5(n-i+1)to 1.5(n-i) • So ER of any word is 1.5 • Note that ER(wi) is not really related to P(wi) (each Xi has 1.5 bits entropy)

Intuitions for entropy reduction (2) • Cases when entropy doesn’t go down: • Consider the following PCFG: ROOT -> a b 0.9 ROOT -> x y 0.05 ROOT -> x z 0.05 • If you see a, the only possible resulting tree is so entropy goes to 0 for ER of 0.57 • If you see x, there are two equiprobable continuations: [] and hence entropy goes to 1 for ER of 0 • Entropy goes up, and what’s more, the more likely continuation is predicted to be harder H(G) ≡ H(ROOT) = 0.57

Entropy reduction applied • Hale 2003 looked at object relative clauses versus subject relative clauses • One important property of tree entropy is that it is very high before recursive categories • NP → NP RC is recursive because a relative clause can expand into just about any kind of sentence • So transitions past the end of an NP have big entropy reduction the reporter who sent the photographer to the editor the reporter who the photographer sent to the editor

Subject RCs (Hale 2003)

Object RCs (Hale 2003)

Entropy reduction in RCs • Crossing past the embedded subject NP is a big entropy reduction • Means that reading the embedded verb predicted harder in the object RC than in the subject RC • No other probabilistic processing theory seems to get this data point right • Pruning, attention shift, competition not applicable: no prominent ambiguity about what has been said • Surprisal probably predicts the opposite: the embedded subject NP makes the context more constrained • Some non-probabilistic theories* do get this right *e.g., Gibson 1998, 2000

Surprisal & Entropy reduction compared • Surprisal and entropy reduction share some features… • High-to-low uncertainty transitions are hard • …and make some similar predictions… • Constrained syntactic contexts are lower-entropy, thus will tend to be easier under ER • ER also predicts facilitative ambiguity, since resulting state is higher-entropy

Surprisal & Entropy reduction, cont. • …but they are also different… • ER compares summary statistics of different probability distributions • Surprisal implicitly compares two distr.’s point-by-point • …and make some different predictions • Surprisal predicts that more predictable words will be read more quickly in a given constrained context • ER penalizes transitions across recursive states such as nominal postmodification (useful for some results in English relative clauses; see Hale 2003)

Future ideas • Deriving some of these measures from more mechanistic views of sentence processing • Information-theoretic measures on conditional probability distributions • Connection with competition models: entropy of a c.p.d. as an index of amount of competition • Closer links between specific theoretical measures and specific types of observable behavior • Readers might use local entropy (not reduction) to gauge how carefully they must read

Connectionist models: overview • Connectionists have also had a lot to say about sentence comprehension • Historically, there has been something of a sociological divide between the connectionists and the “grammars” people • It’s useful to understand the commonalities and differences between the two approaches *I know less about connectionist literature than about “grammars” literature!

What are “connectionist models”? • A connectionist model is one that uses a neural network to represent knowledge of language and its deployment in real time • Biological plausibility often taken as motivation • Mathematically, a generalization of the multiclass generalized linear models covered on Wednesday • Two types of generalization: • additional layers (gives flexibility of nonlinear modeling) • recurrence (allows modeling of time-dependence)

i ranges from 1 to m for an m-class categorization problem Review: (generalized) linear models • A (generalized) linear model expresses a predictor as the dot product of a weight vector with inputs • For a categorical output variable, the class predictors can be thought of as drawing a decision boundary

non-linear decision boundary probabilistic interpretation: close to the boundary means P(z1)≈0.5

Non-linear classification • Extra network layers → non-linear decision boundaries • Boundaries can be curved, discontinuous, … • A double-edged sword: • Expressivity can mean good fit using few parameters • But difficult to train (local maxima in likelihood surface) • Plus danger of overfitting

Schematic picture of multilayer nets output layer hidden layer K … hidden layer 1 input layer

Recurrence and online processing • The networks shown thus far are “feed-forward” – there are no feedback cycles between nodes • contrast with the CI model of McRae et al. • How to capture the incremental nature of online language processing in a connectionist model? • Proposal by Elman 1990, 1991: let part of the internal state of the network feed back into input later on • Simple Recurrent Network (SRN)

Simple recurrent networks output layer hidden layer K … hidden layer J context layer hidden layer J -1 … direct copy input layer

Recurrence “unfolded” • Unfolded over time, the hidden layer at each step becomes the context for the next step • Hidden layer has new role as “memory” of sentence … output w4 output w4 output w2 input w3 input w2 input w1

Connectionist models: overview (2) • Connectionist work typically draws a closer connection between the acquisition process and results in “normal adult” sentence comprehension • Highly evident in Christiansen & Chater 1999 • The “grammars” modelers, in contrast, tend to assume perfect acquisition & representation • e.g., Jurafsky 1996 & Levy 2005 assumed PCFGs derived directly from a treebank

Connectionist models: overview (3) • Two major types of linking hypothesis between the online comprehension process and observables: • Predictive hypothesis: next-word error rates should match reading times (starting from Elman 1990, 1991) • Gravitiational hypothesis (Tabor, Juliano, & Tanenhaus; Tabor & Tanenhaus): more like a competition model • In both cases, latent outcomes of the acquisition process are of major interest • That is, how does a connectionist network learn to represent things like the MV/RR ambiguity?

Connectionist models for psycholx. • Words presented to the network in sequence • The model is trained to minimize next-word prediction error • That is, there is no overt syntax or tree structure built into the model • Performance of model is evaluated on how well it predicts next words • Metrics turns out to be closely related to surprisal n1 v5 # N3 n8v2 V4 # …

Christiansen & Chater 1999 • Bach et al. 1986: cross-dependency recursion seems easier to process than nesting recursion • Investigation: do connectionist models learn cross-dependency recursion better than nesting recursion? Aand has Jantje the teacher the marbles let help collect up. Aand has Jantje the teacher the marbles up collecthelp let.

poor learning with 5 HUs • These were small artificial languages, so both networks learned well ultimately • But the center-embedding recursions were learned more slowly

Connectionism summary • Stronger association posited between constraints on acquisition and constraints on on-line processing among adults • Evidence that cross-serial dependencies are easier to learn from a network architecture • N.B. Steedman 1999 has critiqued neural networks for essentially learning n-gram surface structure • this would do better with cross-serial dependencies • Underlying issue: do networks learn hierarchical structure? If not, what learning biases would make them?

Course summary: Probabilistic models in psycholinguistics • Differences in technical details of models should not obscure major shared features & differences • Shared feature: use of probability to model incremental disambiguation • computational standpoint: achieves broad coverage with ambiguity management • psycholinguistic standpoint: evidence is strong that people do deploy probabilistic information online • the complex houses vs. the corporation fires • the {crook/cop} arrested by the detective • the group (to the mountain) led

Course summary (2) • Differences: connection with observable behavior • Pruning: difficulty when a needed analysis is lost • Competition: difficulty when multiple analysis have substantial support • Reranking: difficulty when distr. over analyses changes • Next-word prediction (surprisal, connectionism) • Note the close relationships among some of these approaches • Pruning and surprisal are both special types of reranking • Competition has conceptual similarity to attention shift

Course wrap-up: general ideas • In most cases, the theories presented here are not mutually exclusive, e.g.: • Surprisal and entropy-reduction are presented as full-parallel, but could also be limited-parallel • Attention-shift effects could coexist with competition, surprisal, or entropy reduction • In some cases, theories say very different things, e.g.: • Ambiguity very different under competition vs. surprisal • the {daughter/son} of the colonel who shot himself… • But even in these cases, different theories may have different natural domains of explanation

Course wrap-up: unresolved issues • Serial vs. parallel sentence parsing • More of a continuum than a dichotomy • Empirically distinguishing serial vs. parallel is difficult • Practical limitations of probabilistic models • Our ability to estimate surface (n-gram) models is probably beyond that of humans (we have more data!) • But our ability to estimate structured (syntactic, semantic) probabilistic models is poor! • Less annotated data, No real-world semantics • Unsupervised methods still poorly understood • We’re still happy if monotonicities match human data

Course wrap-up: unexplored territory • Best-first search strategies for modeling reading times • possible basis for heavily serial computational parsing models • Entropy as a direct measure of various types of uncertainty • Entropy of P( Tree | String ) as a measure of uncertainty as to what has been said • Entropy of P( wi | w1…i-1 ) as a measure of uncertainty as to what may yet be said • Models of processing load combining probabilistic and non-probabilistic factors

Day 5: Entropy reduction models; connectionist models; course wrapup