1 / 49

FLST: Text-to-Speech Synthesis

FLST: Text-to-Speech Synthesis. Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2013/. Speech synthesis: Ambition and dilemma. Ambition of speech synthesis: modeling the production side of the most complex human cognitive ability

delu
Download Presentation

FLST: Text-to-Speech Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FLST: Text-to-Speech Synthesis Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2013/

  2. Speech synthesis: Ambition and dilemma • Ambition of speech synthesis: • modeling the production side of the most complex human cognitive ability • Dilemma of speech synthesis: • emulate a human speaker or reader, without • world knowledge • language comprehension • speech organs • achieve optimal intelligibility and naturalness • Speech synthesis: an impossible task!?

  3. Human-machine dialog (1)

  4. Human-machine dialog (2)

  5. Mechanical systems Wolfgang von Kempelen (1770)

  6. Mechanical systems Wolfgang von Kempelen (1791): speaking machine http://www.youtube.com/watch?v=zYRVqrfY3tQ

  7. Electrical systems Dudley (1939): the Voder

  8. Formant synthesis Gunnar Fant (1953): OVE I, serial filters John Holmes (1973): parallel filters

  9. Formant synthesis • Acoustic-parametric synthesis • modeling the acoustic properties of speech sounds

  10. Formant synthesis Prof. Stephen Hawking and speech synthesizer (DECtalk DTC01) DecTalk Infovox http://www.youtube.com/watch?v=J-8a55jeR-A (1:13 – 1:32) http://www.youtube.com/watch?v=wlrOKpQ6UBI

  11. Source-filter model of speech production

  12. Articulatory synthesis Vocal Tract Lab (2007) http://www.vocaltractlab.de/ IP Köln (1995) • Articulatory synthesis • modeling components of the speech production system • voice source, articulators, 3D vocal tract, etc.

  13. Synthesis methods • Acoustic-parametric synthesis • Articulatory synthesis • Concatenative synthesis • uses segments of natural speech, concatenated and resequenced to synthesize the intended utterance • e.g. diphone synthesis, unit selection synthesis, statistical parametric (HMM-based) synthesis

  14. Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45]

  15. Phone synthesis

  16. Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45] • diphones? [Ger: 2025]

  17. Hadifix Festival SVOX Bell Labs Diphone synthesis

  18. Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45] • diphones? [Ger: 2025] • triphones? [Ger: 91,125] • syllables? [Ger: 12,500+]

  19. Concatenative, diphone synthesis • Synthesis by re-sequencing and concatenating selected units of natural speech (typically: diphones) + units comprise dynamic phone-to-phone transitions + units cover local coarticulatory effects  longer-range coarticulation not covered signal processing at least for smoothing concatention  signal processing for prosodic modifications  compromise between coverage and inventory size • Standard synthesis technique in the 1990s • suboptimal naturalness • stable, predictable quality

  20. Unit selection synthesis • Dynamic selection of units at synthesis run-time • "The best solution to the synthesizer problem is to avoid it." [Carlson & Granström, 1991] • overcome restrictions by a fixed unit inventory • unit inventory: large corpus of recorded natural speech • select the smallest number of the longest units covering the target phone sequence • variable unit size (segments, syllables, words, ...) • reduce perceptual impression of lack of naturalness caused by number of concatenations and signal processing

  21. Unit selection synthesis • Inventory construction off-line and run-time unit selection • preserve natural speech as much as possible • ideal world: target utterance available in corpus • unfortunately: ideal case is extremely improbable, due to complexity/combinatorics of language and speech • however, longer units may be available in corpus • most extreme strategy (CHATR, Black & Taylor 1994 …) • no modification by signal processing • listener will tolerate occasional glitches, if overall synthesis quality approaches natural speech

  22. Unit selection based on cost functions • Minimize two cost functions, simultaneously and globally (viz. for the entire utterance) • target cost (unit distortion): how suitable is the candidate? • concatenation cost (join cost, continuity distortion): how smooth is the concatenation with adjacent units?

  23. Selection algorithm target costs concatenation costs Minimize Ct and Cc[Hunt & Black 1996]

  24. Selection algorithm sequence of target units lattice of candidate units Minimize Ct and Cc[Hunt & Black 1996]

  25. Example: Word-based unit selection I have time on Monday I have time on Monday I have on Monday I on Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance

  26. Example: Word-based unit selection I have time on Monday I have time on Monday S E I have on Monday I on direction of nodes (time) Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance

  27. Example: Word-based unit selection I have time on Monday I have time on Monday S E I have on Monday I on direction of nodes (time) Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance

  28. Speech corpus design and size • Quality of speech corpus (recordings, annotation, coverage) has tremendous effect on synthesis quality • Corpus size is single most important quality factor • Some data points: • IBM/Cambridge: ~60 min. (ASR corpora) • CHATR: phonetically balanced sentences, radio news, isolated sentences: 40 min. (Eng.), 20 min. (Jap.) • "bring a novel … of their own choice" [Campbell 1999] • AT&T 1999: news stories and system prompts, ~2 hrs. • SmartKom: open+closed domains, 160 min. • typical corpus size today: 10++ hrs.

  29. Unit Selection: SmartKom

  30. Unit Selection: demos • example speech output from several systems: • CHATR (1996) • AT&T (2001) • Festival (2004) • SmartKom (2005) • Loquendo (2010) • BOSS (pol., 2009)

  31. Unit selection synthesis: Summary • Synthesis by re-sequencing and concatenating units selected at run-time from corpus of natural speech + facilitates long units without concatenation + reduces need for signal processing + preserves natural speech waveforms tends to produce unstable, unpredictable quality inflexible w.r.t. speaking style and speaker voice • Standard synthesis technique in the 2000s • in competition with HMM-based synthesis (statistical parametric speech synthesis, HMM = Hidden Markov models)

  32. Unit selection vs. HMM-based synthesis • Unit selection approach • high-quality speech synthesized by concatenation of natural waveforms • building several voices requires large amount of speech data • HMM-based approach • probabilistic formulation of corpus-based synthesis • generate speech from a model • speech parameters generated from statistics • change of voice quality or speaker ID by transforming HMM parameters based on small amount of data

  33. Statistical parametric synthesis: Summary • HMM-based synthesis system +trainable and flexible + small footprint + smooth and stable speech generation (too smooth?) vocoder-based, buzzy "voice" quality • Research questions • how to parameterize speech waveforms? • how to model extracted parameter trajectories? • how to recover speech parameter trajectories? • how to improve voice source modeling?

  34. TTS: System components text linguistic text analysis prosody control speech synthesis synthetic speech

  35. TTS: Processing tasks Will this course on TTS end on 02-09-2014 at 5:45pm? Endet dieser Kurs, TTS, am 9.2.1999 um 17.45 Uhr? properties of text properties of voice

  36. TTS: Processing tasks Will this course, on TTS, end on 02-09-2014 at 5:45pm? Will this course [comma] on TTS [comma] end on the ninth of February two thousand and fourteen at five fourty-five p m [question mark] _ wIl DIs kors On ti: ti: Es End On D@ naInT @v fEbru@ri: At faIv forti: faIv pi: Em _ ((_ wIl DIs kors) (?On ti: ti: Es) (?End On D@ naInT @v fEbru@ri:) (?At faIv forti: faIv pi: Em _))

  37. TTS: Processing tasks ((_ wIl DIs kors) (?On ti: ti: Es) (?End On D@ naInT @v fEbru@ri:) (?At faIv forti: faIv pi: Em _)) * H- * H- * * * H- * * * H% F0

  38. TTS: Linguistic text analysis text normalization • lexical & morphological analysis • lexicon lookup • morphological analysis • syntactic analysis • prosodic analysis • phrasing • accenting • phonological analysis • pronunciation • syllabification

  39. Morphology: Derivation and Compounding • Problem for TTS: unknown words (i.e. words not explicitly listed in the system's dictionary) • unlimited vocabulary • practically unlimited lists of (e.g.) names • productive word formation processes • productive compounding (e.g. German) • Donaudampfschiffahrtsgesellschaftskapitän • Unerfindlichkeitsunterstellung • Oberweserdampfschiffahrtsgesellschaftskapitänsmützen-beratungsteekränzchen • Morphological analysis of compounds and other "unknown" words is indispensable in TTS

  40. Morphological word model: WFST • Example: decomposing Unerfindlichkeitsunterstellung • (correct) morphological decomposition: un[pref] + er[pref] + f'ind[root] + lich[suff] + keit[suff] + s[fuge] + unter[pref] + st'ell[root] + ung[suff] [#] <3.2> • WFST:

  41. Segment of finite-state grammar for decomposing morphological complex words in German

  42. Syllable model • Approx. 12,500 distinct syllables in English, German (some say >40k) • despite phonotactic restrictions on phone combinations • most syllables are lexically accounted for (names!) • Implementation of syllable model as finite-state grammar (Bell Labs TTS) • syllabification of phone sequences in phonological component • syllable model as part of morphological word model, operating on annotated orthography • (application: hyphenation of orthographic words)

  43. Syllable structure (German)

  44. TTS: Prosody control • duration modeling • segmental durations • syllable durations • pause durations • local speaking rate • intonation modeling • phrasing • accenting • amplitude modeling

  45. TTS: Synthesis • concatenative synthesis • unit selection • unit concatenation • or rule-based synthesis • acoustic trajectories • articulatory trajectories • signal generation synthetic speech signal

  46. The tone of voice

  47. Required and suggested reading • TTS overview paper: • Robert Clark, Korin Richmond, Simon King (2007): "MultiSyn: Open-domain unit selection eech synthesis system". Speech Communication 49, 317-330. • TTS text book (not required for this class): • Paul Taylor (2009): Text-to-Speech Synthesis. Cambridge University Press.

  48. Exercises (to prepare for Dec 13) TTS systems: overview and quality assessment Look for demo pages of commercial and non-commercial TTS systems onthe Web, in particular systems offering interactive demos. Try to assess the overall quality of these TTS systems. Select two systems for a side-by-side comparison of their performance. Alternatively, select two or three languages rendered by the same TTS system. Try to perform a diagnostic evaluation of TTS system components. Design test sentences to test the performance on different tasks, such as: resolution of complex alphanumeric expressions (e.g. dates, time, currency), pronunciation of names, pronunciation of complex words (e.g. compounds), prosodic phrasing and accenting, sentence mode detection, etc. Take notes of strengths and weaknesses of the systems and try to determine which system component is responsible for certain mistakes.

  49. Thanks!

More Related