Speech Generation: From Concept and from Text

Speech Generation: From Concept and from Text Julia Hirschberg CS 6998

Today • TTS • CTS

Traditional TTS Systems • Monologue • News articles, email, books, phone directories • Input: plain text • How to infer intention behind text?

Human Speech Production Levels • World Knowledge • Semantics • Syntax • Word • Phonology • Motor Commands, articulator movements, F0, amplitude, duration • Acoustics

TTS Production Levels: Back End and Front End • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Word • Phonology intonation assigment • F0, amplitude, duration • Acoustics synthesis

Text Normalization • Context independent: • Mr., 22, $N, NAACP, MAACO VISA • Context-dependent: • Dr., St., 1997, 3/16 • Abbreviation ambiguities: How to resolve? • Application restrictions – all names? • Rule or corpus-based decision procedure (Sproat et al ‘01)

Part-of-speech ambiguity: • The convict went to jail/They will convict him • Said said hello • They read books/They will read books • Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94)

Word Pronunciation • Letter-to-Sound rules vs. large dictionary • O: _{C}e$  /o/ hope • O  /a/ hop • Morphological analysis • Popemobile • Hoped • Ethnic classification • Fujisaki, Infiniti

Rhyming by analogy • Meronymy/metonymy • Exception Dictionary • Beethoven • Goal: phonemes+syllabification+lexical stress • Context-dependent too: • Give the book to John. • To John I said nothing.

Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…

Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?

Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?

The TTS Front End Today • Corpus-based statistical methods instead of hand-built rule-sets • Dictionaries instead of rules (but fall-back to rules) • Modest attempts to infer contrast, given/new • Text analysis tools: pos tagger, morphological analyzer, little parsing

TTS Back End: Phonology to Acoustics • Goal: • Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) • Convert to an acoustic signal (spectrum, pitch, duration, amplitude) • From phonetics to signal processing

Phonological Modeling: Duration • How long should each phoneme be? • Identify of context phonemes • Position within syllable and # syllables • Phrasing • Stress • Speaking rate

Phonological Modeling: Pitch • How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? • Contour or target models for accents, phrase boundaries • Rules to align phoneme string and smooth • How does F0 align with different phonemes?

Phonetic Component: Segmentals • Phonemes have different acoustic realizations depending on nearby phonemes, stress • To/to, butter/tail • Approaches: • Articulatory synthesis • Formant synthesis • Concatenative synthesis • Diphone or unit selection

Articulatory Synthesis-by-Rule • Model articulators: tongue body, tip, jaw, lips, velum, vocal folds • Rules control timing of movements of each articulator • Easy to model coarticulation since articulators modeled separately • But: sounds very unnatural • Transform from vocal tract to acoustics not well understood • Knowledge of articulator control rules incomplete

Formant (Acoustic) Synthesis by Rule • Model of acoustic parameters: • Formant frequencies, bandwidths, amplitude of voicing, aspiration… • Phonemes have target values for parameters • Given a phonemic transcription of the input: • Rules select sequence of targets • Other rules determine duration of target values and transitions between

Speech quality not natural • Acoustic model incomplete • Human knowledge of linguistic and acoustic control rules incomplete

Concatenative Synthesis • Pre-recorded human speech • Cut up into units, code, store (indexed) • Diphones typical • Given a phonemic transcription • Rules select unit sequence • Rules concatenate units based on some selection criteria • Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures

Issues • Speech quality varies based on • Size and number of units (coverage) • Rules • Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters • How much the signal must be modified to produce the output

Coding Methods • LPC: Linear Predictive Coding • Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation • Robotic • More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration • PSOLA (pitch synchronous overlap/add): • No waveform decomposition

Delete/repeat pitch periods to change duration • Overlap pitch periods to change F0 • Distortion if large F0, durational change • Sensitive to definition of pitch periods • No coding (use natural speech) • Avoid distortions of coding methods • But how to change duration, F0, amplitude?

Corpus-based Unit Selection • Units determined case-by-case from large hand or automatically labeled corpus • Amount of concatenation depends on input and corpus • Algorithms for determining best units to use • Longest match to phonemes in input • Spectral distance measures • Matching prosodic, amplitude, durational features???

TTS Back End: Summary • Speech most natural when least signal processing: corpus-based unit selection and no coding….but….

TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:

Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices?

TTS vs. CTS • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy

Next Week • Read • Discussion questions • Write an outline of your class project and what you’ve done so far

Speech Generation: From Concept and from Text