Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 10. Phonology • Foundations of distinctive feature theory • Writing, phonemes, and the place-manner distinction • A universal encoding of phonemes: Bell • A universal binary encoding of phonemes: Jakobson • Distinctive features and speech perception • Information theory: Shannon • Independence of feature errors: Miller • Dependence of distinctive feature acoustics!! • Distinctive features and speech articulation • Rule-based modification of distinctive features: Halle • Syllable boundaries and word boundaries • Psychological processing, the lexicon, and surface phonology • Spreading of autosegmental distinctive features: Goldsmith • Unification of perceptual and articulatory accounts: Stevens • The quantal theory of speech production • Articulator-free features, articulator-bound features, and landmarks • Language-specific features • Redundant features

A Brief History of Writing(Diamond, Guns, Germs, and Steel, 1999) • Writing developed 5000 years ago independently in at least three places • Mesopotamia (modern Iraq): recording commercial transactions • Egypt (first dynasty, 3150 BCE): recording imperial conquest • China (Shang dynasty): recording divinations • Symbols were meaning-based (word glyphs) • A word glyph can also be used to record similar-sounding nonsense words or foreign words • “Narmur,” first pharoah of Egypt, 3150 BCE • “Meiguo” = “America” in modern Chinese • Phoenicians (3000 years ago) were the first to abandon word-glyphs in favor of a purely sound-based writing system • Phoenicians built trading empires based in Lebanon (Tyre and Sidon) and Carthage; their empires were independent until Roman conquest • “Phoenician” is the Greek name for them • The 22 symbols of their alphabet were based on word glyphs (e.g., “alpha”=“ox”), but they only used the 22 that were necessary to write all sounds of their language, and discarded the remainder • “Phoneme” means, roughly, “Phoenician unit”

A Few Phonemic & Syllabic Writing Systems • Semitic, related to Phoenician symbols: • Hebrew (א,ב,ג,ד,ה – alef, bet, gimel, dalet, he) • Greek (a, b, g, d, e - alpha, beta, gamma, delta) • Cyrillic (а, б, в, г, д – a, be, ve, ghe, de) • Latin (a, b, c, d, e, …) • Arabic (ا, ب, ت, ج, ح – alef, beh, teh, jeem, hah) • Nubian (modern Ethiopia; independently developed in response to Egyptian writing) • Indian: Hindi, Punjabi, Bengali, Gujarati, Grantha, … • East Asian: Korean (hangul), Japanese (hiragana, katakana), Tagalog (baybayin) • American Indian: Cherokee (independently designed in response to English)

Manner/Place Distinction: Hangul(King Sejong of Joseon, Hunmin jeongeum, 1446) • Origin of Hangul • Attributed to: King Sejong of Joseon (Korea), 15th century • Goal: Literacy for commoners who could not read Chinese • Therefore: System should be as easy as possible to memorize • Innovation: manner/place distinction • Place of articulation encoded using symbols representing the shape of the articulator • 5 places of articulation: labial (/m/:ㅁ) alveolar (/n/:ㄴ), dental? (/s/:ㅅ), velar (/g/:ㄱ), glottal (/ŋ/:ㅇ) • Manner of articulation is changed from nasal/fricative to stop/affricate by adding one stroke • /m/→/b/, /n/→/d/, /s/→/ǰ/ • Obstruent is made unvoiced encoded by adding a second stroke • /b/→/p/, /d/→/t/, /ǰ/→/č/, /ŋ/→/h/ • Exception: /g/=base symbol for velar place, /k/ has just one extra stroke

Universal Encoding of Phonemes Using a Manner/Place Encoding(Bell, Visible Speech, 1876) • Goal: an international phonetic alphabet • If any language in the world distinguishes two sounds, then Bell’s alphabet should distinguish them • Proposed encoding: • Base symbol = half-moon or “C” shape • Place encoding: angle of the signal • Manner and voicing: extra strokes across the symbol • (End result: system was scrapped because it was too expensive to typeset using movable-type printing. Modern IPA was developed instead).

Universal Binary Encoding of Phonemes(Jakobson, Fant, and Halle, 1952) • Jakobson’s minimalist program: • An 8-way place distinction is composed of exactly 3 binary distinctive features • Distinctive features are not arbitrary; they are based on physical properties of the sound (its articulation or its acoustics) • Contribution of the engineer (Fant): • Distinctive features are based on spectral shape • Jakobson’s consonantal distinctive features (in most cases, the first feature is more frequently used): • Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/) • Distributed vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or one narrow peak, like /č,š,ţ,ş,k,h/, or ) • Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy frication, like /f,,h/ • Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/) • Sonorant vs. Obstruent (unobstructed voicing, like /m,n,ng/, vs. obstructed voicing, like /p,b,f,v,t,d,s,z,k,g/)

Binary Encoding of Phonemes(Jakobson, Fant, and Halle, 1952) • Jakobson’s minimalist program: • All phoneme distinctions are binary. For example, if there are 10 places of articulation in the world’s languages, there must be ceil(log2(10))=4 binary distinctive features to encode place of articulation. • Distinctive features are not arbitrary; they are based on physical properties of the sound (its articulation or its acoustics) • Apparent contribution of the engineer (Fant): • Distinctive features are based on spectral shape • Jakobson’s consonantal distinctive features: • Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/) • Diffuse vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or one narrow peak, like /č,š,ţ,ş,k,h/, or ) • Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy frication, like /f,,h/ • Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/) • Nasal vs. Non-nasal (unobstructed voicing, like /m,n,ng/, vs. obstructed voicing, like /p,b,f,v,t,d,s,z,k,g/) • … etcetera. There were 12 features in the original set.

Speech Perception:Distinctive Features as an Encoding for a Communications Channel

Information Theory(Shannon, 1948) • A Mathematical Definition of the word “Information” • “Entropy” = the degree to which the outcome of some random process is impossible to predict • H = − ∫p(x)log2(p(x))dx = −E[log2p(x)] • Why it makes sense: the entropy of a coin thrown twice is 2X the entropy of a coin thrown once (p(x1,x2)=p(x1)p(x2), so H(x1,x2)=H(x1)+H(x2)) • One “bit” = the amount of entropy in one unbiased coin toss • “Information” = the amount by which your uncertainty about x is reduced if you know y: • I(x,y) = H(x)−H(x|y)

Information Theory(Shannon, 1948) • “Channel capacity” = the rate at which information can be conveyed through a noisy channel, in bits/second • CAB=maxA(I(x,x*)) for x=input to the channel per second, A(x)=symbols transmitted over channel, B(x)=recovered symbols at receiver, x*=maximum posterior probability estimate of x given A*(x). • Shannon’s theorem: • Given a decoding machine with enough memory, it is always possible to achieve channel capacity using an A(x) of the following form: • Encode x using the smallest possible number of bits • Add redundancy until bit rate is reduced to CAB, then transmit.

Miller’s Linguistic Interpretation of Information Theory(Miller and Nicely, 1955) • The “channel” for speech = acoustic channel • Information can be conveyed over 32 critical bands of the human ear • Rate at which information can be transmitted in each band depends on SNR in band • The minimal binary encoding of speech = distinctive features • The “encoding” is the speech production mechanism: slow movements of the tongue and lips add redundancy matched to the noise conditions of the channel • When the channel is bad, speaker slows down and speaks clearly • When channel is good, the main source of error is pronunciation variability, i.e., talker leaves out phonemes that are only important for redundancy

In the Perceptual Space, Distinctive Feature Errors are Independent(Miller and Nicely, 1955) • Experimental Method: • Subjects listen to nonsense syllables mixed with noise (white noise or BPF) • Subjects write the consonant they hear • Results: p(q*|q,SNR,BPF) ≈ Pi p(fi* | fi,SNR,BPF) q* = consonant label heard by the listener q = true consonant label F*=[f1*,…,f6*] = perceived distinctive feature labels F=[f1,…,f6] = true distinctive feature labels [±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]

Consonant Confusions at -6dB SNR Distinctive Features: ±nasal, ±voiced,±fricative,±strident

In the Acoustic Space, Distinctive Features are Not Independent(Volaitis and Miller, 1992) • [±voiced] for English stops is mainly cued by voice onset time (VOT). • VOT is also an cue for place of articulation: velar > alveolar > labial • p(place,voicing|VOT) ≠ p(place|VOT) p(voicing|VOT)

Speech Production:Distinctive features explain the related pronunciations of morphologically related words

The English Plural • Three standard plural endings: • Cat → cats (/s/) • Dog → dogs (/z/) • Dish → dishes (/Əz/) • Observation: you can predict the correct plural by observing the last phoneme of the word • Algorithm #1: Create a lookup table

The English Plural(Chomsky and Halle, Sound Pattern of English, 1968) • Algorithm #2: Create 3 rules • Use /Əz/ if the last phoneme is [+strident] • PLURAL → /Əz/ | [+strident]a • Otherwise, use /s/ if last phoneme is [-voiced] • PLURAL → /s/ | [-voiced]a • Otherwise, use /z/. • PLURAL → /z/ • General form of phonological rules: a becomes b in the context gad: • a→b| gad

Stress Shift in English • Every English word has one stressed syllable. Usually but not always it is the antepenultimate syllable (3rd from the end) if the word has 3 or more syllables: • Professor -- (an exception to the antepenult rule) • Establishment • Some suffixes cause the stress to shift: • Professorial • Establishmentarian • Some don’t • Professorship • Establishmentarianism (stays where it was before ism) • Prefixes never cause a stress shift • Unprofessorial • Antidisestablishmentarianism

Morphemes: Roots, Suffixes, and Prefixes • Suffixes and prefixes are examples of “bound morphemes” • “Morpheme” = a phoneme sequence with a meaning • Root words (establish) • Words that are part of compounds, e.g., rain|fall • Suffixes and prefixes • “Bound morpheme” = word that never occurs on its own • In English: most suffixes and prefixes • In Chinese: perhaps some words that used to be independent, but now only appear as bound morphemes, e.g., “hu” in “hutong”?

Psychological Processing and the Lexicon • Experiment: play a word. Subject hits “W” if it’s a meaningful word (“empower”), “N” if it’s a nonsense word (“empriffle”). • Measure the subject’s reaction time. Adjust for word length. • Result: subjects recognize two-morpheme words (“dis+like”) just as fast as one-morpheme words (“power”). • Apparent conclusion: Chomsky & Halle’s rules are usually “precompiled,” not applied in real time. • If a listener has heard the word frequently enough, it is stored in her mental lexicon as a whole word. • If it’s not stored in her mental lexicon, she tries to figure it out by morphological parsing (“zookeepership”). • If morphological parsing fails, then she concludes it’s a nonsense word.

Inter-Word Phonology • Many words change form in particular contexts • “this ship” → “thish ship” • Usually: Only a very small number of phonological rule types can apply across word boundaries. • Place assimilation: /s/ in “this” takes the palatal place of /sh/ in “ship” • Manner assimilation: “in the”→“in ne” with a dental /n/ • In Chinese: tone sandhi (tones change because of tone context) • These changes occur very frequently • Perhaps: MOST of the changed forms are very uncommon, so they are not stored in the mental lexicon, but SOME of the most common and most severely reduced forms may actually get their own lexical entry so the listener can respond more quickly: • “did you” → “didja” • “I don’t know” → “ǣəo” • These may be comparable to “multiword” lexical entries in a speech recognition lexicon

Autosegmental Phonology(Goldsmith, 1975) • Inter-word phonological rules all have a simple form: manner or place assimilation • Hypothesis: instructions to the speech articulators are arranged in “autosegmental tiers,” i.e., on a kind of musical score with asynchronous rows • Assimilation = feature spreading /s/ /sh/ /sh/ /sh/ [-nasal] [-nasal] [-nasal] [-nasal] [+strident] [+strident] [+strident] [+strident] [+blade] [+blade] [+blade] [+anterior] [-anterior] [-anterior]

Quantal Theory:Distinctive Features are not “just perceptual” or “just articulatory:” they arise from the relationship between articulation and perception

The Speech Chain(Stevens, Speech Communication, 1999)

The Speech Chain Speech Production Planning Speech Production Continuous Muscle Activation Levels a(t) Discrete Inputs (Words, Phonemes, Distinctive Features) Acoustic Signal x(t) Speech Perception Auditory Perception Auditory Nerve Signals y(t) Discrete Outputs (Words, Nonsense)

Nonlinearities in the Speech Chain • The mappings P:a(t)→x(t) (speech production) and R:x(t)→y(t) (perception) are highly nonlinear • We have very good models, going back to 1940s • Pick a particular vector a(t); we can usually estimate ∇P(a(t)), the local gradient of P. (Likewise R). • There are many sigmoidal nonlinearities in both P and R. • Articulator position, a(t), can vary quite a bit without affecting the acoustics, as long as a(t) stays within the stable region (a<a1* or a>a2*) • If a(t) crosses the transition region, acoustics change a lot! Acoustics, x(t) Articulation, a(t) a1* a2* Stable region Stable region

The Quantal Theory of Speech Production(Stevens, 1989) • The distinction between a>a2* and a<a1* is a robust distinction • In the case of P: robust to minor pronunciation variability • In the case of R: robust also to lots of added noise • Therefore: • Put this distinction into Shannon’s communication alphabet in order to maximize the mutual information I(y,a) between produced speech and perceived speech. Acoustics, x(t) Articulation, a(t) Stable region Stable region a1* a2*

The Quantal Theory of Speech Production(Stevens, 1989) • Hypothesis: Every binary distinctive feature, in every language of the world, is a distinction (a>a2* vs. a<a1*) near a sigmoidal nonlinearity of P, or a distinction (x>x2* vs. x<x1*) near a sigmoidal nonlinearity of R. • Different languages choose different nonlinearities to focus on, but the number of useful sigmoids in P and R is finite, thus the number of distinctive features in the world’s languages is finite. Acoustics, x(t) Articulation, a(t) Stable region Stable region a1* a2*

Examples • Feature [anterior]: a nonlinearity of P • “Alveolar ridge” is a sigmoid-shaped bump in hard palate • Moving tongue tip back 1cm, over the alveolar ridge, increases front cavity length by 2cm, causing big change in the front cavity resonance frequency • Result: /s/ → //, [+anterior] → [−anterior] • Feature [sonorant]: a nonlinearity of P • Opening the soft palate by just 2mm during /d/ allows sonorant voicing to continue throughout closure • Result: energy increases 20-30dB during closure • /d/→/n/, [−sonorant] → [+sonorant] • Feature [back]: a nonlinearity of R • When |F2-F1|<3 Bark, both formants excite the same neurons, causing the perception of a single broad formant peak • When |F2-F1|>3 Bark, 2 distinct formant peaks are perceived • |F2-F1|<3 Bark: [+back] vowels /a,o,u/ • |F3-F2|<3 Bark: [−low,−back] vowels /i,e/ • All formant peaks distinct: [+low,−back] vowels /ae,E/

What About Unused Sigmoids? • If a language doesn’t use a particular sigmoid, its listeners can do one of two things: • Learn to ignore that distinction. • Use that sigmoid to “enhance” some other distinction.

1. Perceptual Magnet Effect(Kuhl, 1992) • Perceptual magnet effect shows that, from infancy, babies learn a neural map that enhances linguistically useful sigmoids, and smooths out the less useful sigmoids. • Experiment: ask listeners to determine whether vowels are “same” or “different. • Result: accuracy is best near the boundary between phonemes in your own language. Location of boundary is native-language-dependent.

2. Enhancement Features(Stevens and Keyser, Language, 1989) • Example: [voiced] in English. • Based on [voiced] distinction in Latin, which is a distinction between stops with/without voicing during closure. • Long VOT enhances the perceptual sense that a stop is devoiced, therefore, over some period of history, [-voiced] stops became [+aspirated]. • Some languages distinguish [+voiced,+aspirated], [+voiced,−aspirated], [−voiced,−aspirated], and [−voiced,−aspirated] stops. Such languages could not use [+aspirated] to enhance the perception of [−voiced]. • In modern English, the “enhancing” feature is so strong that the “primary” feature (closure voicing) is often dropped.

Articulator-Bound and Articulator-Free Features • Articulator-bound features can only be implemented by one articulator • [anterior] is bound to the tongue blade • [front] is bound to the tongue body • [voiced] is bound to the vocal folds • [nasal] is bound to the soft palate • Articulator-free features can be implemented by the lips, tongue blade, or tongue body • Key articulator-free features: [sonorant,continuant] • [+sonorant,+continuant] = vowel or glide • [+sonorant,-continuant] = nasal • [-sonorant,-continuant] = stop • [-sonorant,+continuant] = fricative

Landmarks • “Primary articulator” of a consonant is the articulator that implements [−sonorant] or [−continuant] for that phoneme • Lips, tongue blade, or tongue body • “Implements [−continuant]” = articulator closes completely • At the moment when the primary articulator closes, there is a BIG change in the acoustics • [+sonorant]→[−sonorant]: 10dB at low freqs • [+continuant]→[−continuant]: 10dB at high freqs • This particular nonlinear change is called an “acoustic landmark”

Summary • Brief history: • Writing: ~5000 years, Phonemes: ~3000 years • Manner-place notation: ~500 years, Binary encoding: ~50 years • Speech perception: • Errors in the perception of different distinctive features are independent • … even though the acoustic correlates of different distinctive features are NOT independent. • Speech production: • Phonemic relationships among morphologically related words can be explained using distinctive features • Inter-word phonology is simple but universal: spreading of manner or place features on autosegmental tiers • Quantal theory: • Distinctive features arise from sigmoids in the mapping • Any language chooses a subset of sigmoids • Other sigmoids are ignored, or else used for feature enhancement • A very important sigmoid: primary articulator closure produces landmarks

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog