300 likes | 327 Views
Delve into the intricacies of TTS technology, exploring levels of speech production, world knowledge, semantics, phonology, and more. Learn about text normalization, ambiguity resolution, pronunciation modeling, intonation assignment, and the evolution of TTS front and back ends.
E N D
Speech Generation: From Concept and from Text Julia Hirschberg CS 6998
Today • TTS • CTS
Traditional TTS Systems • Monologue • News articles, email, books, phone directories • Input: plain text • How to infer intention behind text?
Human Speech Production Levels • World Knowledge • Semantics • Syntax • Word • Phonology • Motor Commands, articulator movements, F0, amplitude, duration • Acoustics
TTS Production Levels: Back End and Front End • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Word • Phonology intonation assigment • F0, amplitude, duration • Acoustics synthesis
Text Normalization • Context independent: • Mr., 22, $N, NAACP, MAACO VISA • Context-dependent: • Dr., St., 1997, 3/16 • Abbreviation ambiguities: How to resolve? • Application restrictions – all names? • Rule or corpus-based decision procedure (Sproat et al ‘01)
Part-of-speech ambiguity: • The convict went to jail/They will convict him • Said said hello • They read books/They will read books • Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94)
Word Pronunciation • Letter-to-Sound rules vs. large dictionary • O: _{C}e$ /o/ hope • O /a/ hop • Morphological analysis • Popemobile • Hoped • Ethnic classification • Fujisaki, Infiniti
Rhyming by analogy • Meronymy/metonymy • Exception Dictionary • Beethoven • Goal: phonemes+syllabification+lexical stress • Context-dependent too: • Give the book to John. • To John I said nothing.
Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…
Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?
Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?
The TTS Front End Today • Corpus-based statistical methods instead of hand-built rule-sets • Dictionaries instead of rules (but fall-back to rules) • Modest attempts to infer contrast, given/new • Text analysis tools: pos tagger, morphological analyzer, little parsing
TTS Back End: Phonology to Acoustics • Goal: • Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) • Convert to an acoustic signal (spectrum, pitch, duration, amplitude) • From phonetics to signal processing
Phonological Modeling: Duration • How long should each phoneme be? • Identify of context phonemes • Position within syllable and # syllables • Phrasing • Stress • Speaking rate
Phonological Modeling: Pitch • How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? • Contour or target models for accents, phrase boundaries • Rules to align phoneme string and smooth • How does F0 align with different phonemes?
Phonetic Component: Segmentals • Phonemes have different acoustic realizations depending on nearby phonemes, stress • To/to, butter/tail • Approaches: • Articulatory synthesis • Formant synthesis • Concatenative synthesis • Diphone or unit selection
Articulatory Synthesis-by-Rule • Model articulators: tongue body, tip, jaw, lips, velum, vocal folds • Rules control timing of movements of each articulator • Easy to model coarticulation since articulators modeled separately • But: sounds very unnatural • Transform from vocal tract to acoustics not well understood • Knowledge of articulator control rules incomplete
Formant (Acoustic) Synthesis by Rule • Model of acoustic parameters: • Formant frequencies, bandwidths, amplitude of voicing, aspiration… • Phonemes have target values for parameters • Given a phonemic transcription of the input: • Rules select sequence of targets • Other rules determine duration of target values and transitions between
Speech quality not natural • Acoustic model incomplete • Human knowledge of linguistic and acoustic control rules incomplete
Concatenative Synthesis • Pre-recorded human speech • Cut up into units, code, store (indexed) • Diphones typical • Given a phonemic transcription • Rules select unit sequence • Rules concatenate units based on some selection criteria • Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures
Issues • Speech quality varies based on • Size and number of units (coverage) • Rules • Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters • How much the signal must be modified to produce the output
Coding Methods • LPC: Linear Predictive Coding • Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation • Robotic • More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration • PSOLA (pitch synchronous overlap/add): • No waveform decomposition
Delete/repeat pitch periods to change duration • Overlap pitch periods to change F0 • Distortion if large F0, durational change • Sensitive to definition of pitch periods • No coding (use natural speech) • Avoid distortions of coding methods • But how to change duration, F0, amplitude?
Corpus-based Unit Selection • Units determined case-by-case from large hand or automatically labeled corpus • Amount of concatenation depends on input and corpus • Algorithms for determining best units to use • Longest match to phonemes in input • Spectral distance measures • Matching prosodic, amplitude, durational features???
TTS Back End: Summary • Speech most natural when least signal processing: corpus-based unit selection and no coding….but….
TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:
Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices?
TTS vs. CTS • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy
Next Week • Read • Discussion questions • Write an outline of your class project and what you’ve done so far