710 likes | 938 Views
The Harmonic Mind. Paul Smolensky Cognitive Science Department Johns Hopkins University. with:. G é raldine Legendre Donald Mathis Melanie Soderstrom. Alan Prince Peter Jusczyk †. Advertisement. The Harmonic Mind: From neural computation to optimality-theoretic grammar
E N D
The Harmonic Mind Paul Smolensky Cognitive Science Department Johns Hopkins University with: Géraldine Legendre Donald Mathis Melanie Soderstrom Alan Prince Peter Jusczyk†
Advertisement The Harmonic Mind: From neural computation to optimality-theoretic grammar Paul Smolensky & Géraldine Legendre • Blackwell 2002 (??) • Develop the Integrated Connectionist/Symbolic (ICS) Cognitive Architecture • Apply to the theory of grammar • Present a case study in formalist multidisciplinary cognitive science; show inputs/outputs of ICS
Talk Outline ‘Sketch’ the ICS cognitive architecture, pointing to contributions from/to traditional disciplines • Connectionist processing as optimization • Symbolic representations as activation patterns • Knowledge representation: Constraints • Constraint interaction I: Harmonic Grammar, Parser • Explaining ‘productivity’ in ICS (Fodor et al. ‘88 et seq.) • Constraint interaction II: Optimality Theory (‘OT’) • Nativism I: Learnability theory in OT • Nativism II: Experimental test • Nativism III: UGenome
Processing I: Activation • Computational neuroscience ICS • Key sources • Hopfield 1982, 1984 • Cohen and Grossberg 1983 • Hinton and Sejnowski 1983, 1986 • Smolensky 1983, 1986 • Geman and Geman 1984 • Golden 1986, 1988
–λ (–0.9) a1 a2 i1 (0.6) i2 (0.5) Processing I: Activation Processing — spreading activation — is optimization: Harmony maximization
–λ (–0.9) a1 a2 i1 (0.6) i2 (0.5) Processing II: Optimization • Cognitive psychology ICS • Key sources: • Hinton & Anderson 1981 • Rumelhart, McClelland, & the PDP Group 1986 Processing — spreading activation — is optimization: Harmony maximization
a1 and a2must not be simultaneously active (strength: λ) Harmony maximization is satisfaction of parallel, violable constraints –λ (–0.9) a1 a2 a1must be active (strength: 0.6) a2must be active (strength: 0.5) i1 (0.6) i2 (0.5) Optimal compromise: 0.79 –0.21 Processing II: Optimization Processing — spreading activation — is optimization: Harmony maximization
Two Fundamental Questions Harmony maximization is satisfaction of parallel, violable constraints 2. What are the constraints? Knowledge representation Prior question: 1. What are the activation patterns — data structures — mental representations — evaluated by these constraints?
Representation • Symbolic theory ICS • Complex symbol structures • Generative linguistics ICS • Particular linguistic representations • PDP connectionism ICS • Distributed activation patterns • ICS: • realization of (higher-level) complex symbolic structures in distributed patterns of activation over (lower-level) units (‘tensor product representations’ etc.)
σ σ k k æ t æ t σ/rε k/r0 æ/r01 t/r11 [σ k [æ t]] Representation
Constraints • Linguistics (markedness theory) ICS • ICS Generative linguistics: Optimality Theory • Key sources: • Prince & Smolensky 1993 [ms.; Rutgers TR] • McCarthy & Prince 1993 [ms.] • Texts: Archangeli & Langendoen 1997, Kager 1999, McCarthy 2001 • Electronic archive: http://roa.rutgers.edu
σ k æ t *violation ‘cat’ W a[σk [æ t ]] * Constraints NOCODA: A syllable has no coda * H(a[σk [æ t]) = –sNOCODA < 0
Constraint Interaction I • ICS Grammatical theory • Harmonic Grammar • Legendre, Miyata, Smolensky 1990 et seq.
σ H k = H æ t = H(k ,σ) > 0 H(σ, t) < 0 NOCODACoda/t ONSETOnset/k = Constraint Interaction I The grammar generates the representation that maximizes H: this best-satisfies the constraints, given their differential strengths Any formal language can be so generated.
Top-down X Y X Y X Y Bottom-up A B B A A B B A A B B A Harmonic Grammar Parsing • Simple, comprehensible network • Simple grammar G • X → A B Y → B A • Language Parsing
W Simple Network Parser • Fully self-connected, symmetric network • Like previously shown network … … Except with 12 units; representations and connections shown below
Explaining Productivity • Approaching full-scale parsing of formal languages by neural-network Harmony maximization • Have other networks that provably compute recursive functions !productive competence • How to explain?
= Proof of Productivity • Productive behavior follows mathematically from combining • the combinatorial structure of the vectorial representations encoding inputs & outputs and • the combinatorial structure of the weight matrices encoding knowledge
Constraint Interaction II: OT • ICS Grammatical theory • Optimality Theory • Prince & Smolensky 1993
Constraint Interaction II: OT • Differential strength encoded in strict domination hierarchies: • Every constraint has complete priority over all lower-ranked constraints (combined) • Approximate numerical encoding employs special (exponentially growing) weights • “Grammars can’t count” — question period
Constraint Interaction II: OT • Constraints are universal(Con) • Candidate outputs are universal (Gen) • Human grammars differ only in how these constraints are ranked • ‘factorial typology’ • First true contender for a formal theory of cross-linguistic typology • 1st innovation of OT: constraint ranking • 2nd innovation: ‘Faithfulness’
The Faithfulness / Markedness Dialectic • ‘cat’: /kat/ kæt*NOCODA— why? • FAITHFULNESSrequires pronunciation = lexical form • MARKEDNESS often opposes it • Markedness-Faithfulness dialectic diversity • English: FAITH≫ NOCODA • Polynesian: NOCODA≫ FAITH(~French) • Another markedness constraint M: • Nasal Place Agreement [‘Assimilation’] (NPA): ŋg ≻ŋb, ŋd velar nd ≻ md, ŋd coronal mb ≻nb, ŋb labial
Optimality Theory • Diversity of contributions to theoretical linguistics • Phonology • Syntax • Semantics • Here: New connections between linguistic theory & the cognitive science of language more generally • Learning • Neuro-genetic encoding
Nativism I: Learnability • Learning algorithm • Provably correct and efficient(under strong assumptions) • Sources: • Tesar 1995 et seq. • Tesar & Smolensky 1993, …, 2000 • If you hear A when you expected to hear E, increase the Harmony of A above that of E by minimally demoting each constraint violated by A below a constraint violated by E
☺☞ Constraint Demotion Learning If you hear A when you expected to hear E, increase the Harmony of A above that of E by minimally demoting each constraint violated by A below a constraint violated by E Correctly handles difficult case: multiple violations in E
Nativism I: Learnability • M ≫ F is learnable with /in+possible/→impossible • ‘not’ = in- except when followed by … • “exception that proves the rule, M = NPA” • M ≫ F is not learnable from data if there are no ‘exceptions’ (alternations) of this sort, e.g., if lexicon produces only inputs with mp, never np: then M andF, no M vs. F conflict, no evidence for their ranking • Thus must have M ≫ F in the initial state, ℌ0
Nativism II: Experimental Test • Collaborators • Peter Jusczyk • Theresa Allocco • (Elliott Moreton, Karen Arnold)
Nativism II: Experimental Test • Linking hypothesis: More harmonic phonological stimuli ⇒Longer listening time • More harmonic: • M ≻ *M, when equal on F • F ≻ *F, when equal on M • When must chose one or the other, more harmonic to satisfy M: M ≫ F • M = Nasal Place Assimilation (NPA)
Nativism III: UGenome • Can we combine • Connectionist realization of harmonic grammar • OT’s characterization of UG to examine the biological plausibility of UG as innate knowledge? • Collaborators • Melanie Soderstrom • Donald Mathis
Nativism III: UGenome • The game: take a first shot at a concrete example of a genetic encoding of UG in a Language Acquisition Device — no commitment to its (in)correctness • Introduce an ‘abstract genome’ notion parallel to (and encoding) ‘abstract neural network’ • Is connectionist empiricism clearly more biologically plausible than symbolic nativism? No!
The Problem • No concrete examples of such a LAD exist • Even highly simplified cases pose a hard problem: How can genes— which regulate production of proteins — encode symbolic principles of grammar? • Test preparation: Syllable Theory
Basic syllabification: Function • ƒ: /underlying form/ [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.]
Basic syllabification: Function • ƒ: /underlying form/ [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.] • Basic CV Syllable Structure Theory • Prince & Smolensky 1993: Chapter 6 • ‘Basic’ — No more than one segment per syllable position: .(C)V(C).
Basic syllabification: Function • ƒ: /underlying form/ [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.] • Basic CV Syllable Structure Theory • Correspondence Theory • McCarthy & Prince 1995 (‘M&P’) • /C1V2C3C4/ [.C1V2.C3 V C4]
Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output — “no deletion” [M&P 95: ‘MAX’]
Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment [every syllable position in the output is filled by an input segment] — “no insertion/epenthesis” [M&P 95: ‘DEP’]
Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment • ONSET: No V without a preceding C
Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment • ONSET: No V without a preceding C • NOCODA: No C without a following V
C V Network Architecture • /C1 C2/ [C1 V C2] /C1 C2 / [ C1 V C2 ]
s2 i 2 1 s1 Local: fixed, gene-tically determined Content of constraint 1 Global: variable during learning Strength of constraint 1 Network weight: Network input: ι = WΨ a Connection substructure
C 1 1 1 1 V 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 PARSE • All connection coefficients are +2
C V ONSET • All connection coefficients are 1
Activation: Stochastic, binary Boltzmann Machine/Harmony Theory dynamics (T 0); maximizes Harmony: • Learning: Gradient descent in error: • During the processing of training data in phase P, whenever unit φ (of type Φ) and unit ψ (of type Ψ) are simultaneously active, modify si by ε . Network Dynamics
Crucial Open Question(Truth in Advertising) • Relation between strict domination and neural networks? • Apparently not a problem in the case of the CV Theory