1 / 59

Perspectives for Articulatory Speech Synthesis

Perspectives for Articulatory Speech Synthesis. Bernd J. Kröger Department of Phoniatrics, Pedaudiology, andCommunication Disorders University Hospital Aachen and RWTH Aachen, Germany. University Hospital Aachen Germany. Bernd J. Kröger bkroeger@ukaachen.de.

wind
Download Presentation

Perspectives for Articulatory Speech Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perspectives for Articulatory Speech Synthesis Bernd J. Kröger Department of Phoniatrics, Pedaudiology, andCommunication Disorders University Hospital Aachen and RWTH Aachen, Germany

  2. University Hospital Aachen Germany Bernd J. Kröger bkroeger@ukaachen.de

  3. Examples: ASS of Peter Birkholz (2003-2007) Univ. of Rostock application: railway announcement system application: dialog system

  4. Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  5. Outline • Introduction: Perspectives • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  6. Perspectives for Articulatory Speech Synthesis? • Commercial or technical vs. scientific: Is high quality articulatory speech synthesis a realistic goal? Yes! If we have it: Advantage in comparison to current corpus-based synthesis methods: Variability • Different voices simply by parameter variation (sex, age, voice quality)  no need for different corpora • Individual differences in articulation  no need for different corpora e.g.: degree of nasalization individual sound / syllable realizations • Different languages  no need for different corpora

  7. Engwall, KTH Stockholm (1995-2001) Perspectives for Articulatory Speech Synthesis? • Commercial or technical vs. scientific goals: • audiovisual speech synthesis: modeling 3D talking heads • towards “the virtual human” (avatars) • towards “humanoid robots” Need for more natural talking heads

  8. Perspectives for Articulatory Speech Synthesis? • Scientific perspectives: ASS may help to collect and condense knowledge of speech production: • … of articulation (sound geometries, speech movements, coarticulation) • … of vocal tract acoustics • … of control concepts: different approaches exist: • neural control: self-organization, training algorithms (Kröger et al. 2007) • gestural control: concept for articulatory movements (Birkholz et al. 2006, Kröger 1998) • segmental control (Kröger 1998) • corpus-based control (Birkholz et al. this meeting)

  9. Outline • Introduction: Components of ASS-Systems • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  10. area function  tube model vocal tract and glottis model + aerodynamic-acoustic simulation control module speech signal Components of articulatory speech synthesis

  11. Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  12. Outline • Introduction • Vocal tract models: types • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  13. Badin et al. (2003): gridline system Stevens & House (1955) Flanagan et al. (1980) Dang et al. (2004) Engwall (2003) Different types of vocal tract models 3D • Statistical models: Parameters derived on basis of statistical analysis e.g. of MR/CT/X-Ray-image-corpra (Maeda 1990, Badin et al. 2003) • Geometrical models: vocal tract shape is described by a-priori defined parameters area-function related: Stev&House 1955 articulator related: Mermelstein 1973, Birkholz 2007 • Biomechanical models: Modeling of articulators using finite element methods (Dang 2004, Engwall 2003, Wilhelms-Tricarico 1997) • 1D, 2D, 2D+, 3D models 2D preferred for ASS 1D 3D 2D+

  14. Geometrical 3D vocal tract model: Birkholz (2007) Example: [a] [i] [schwa] based on MRI-data of one speaker (and CT data of an replica of teeth and hard palate)

  15. Vocal tract parameters (a priori) 23 basis parameters: • Lips (2 DOF), • Mandible (3 DOF), • Hyoid (2 DOF), • Velum (1 DOF), • Tongue (12 DOF), !!! • Minimal cross-sectional areas (3 DOF) • …

  16. Meshes of the vocal tract model (Birkholz 2005)

  17. Figure of the complete vocal tract model (Birkholz 2005)

  18. Variation of individual parameters Variation of the lower jaw, leaving all other parameters constant  co-movement of dependent articulators: lips and tongue

  19. Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  20. Aero-acoustic simulation • Four major types of models • Reflection type line analog: forward and backward traveling partial flow or pressure waves are calculated; time domain simulation (e.g., Kelly & Lochbaum 1962; Strube et al. 1989, Kröger 1993); Problem: no variation of vocal tract length; constant tube length • Transmission line circuit analog: digital simulation of electrical circuit elements: time domain simulation (e.g. Flanagan 1975, Maeda 1982, Birkholz 2005) Problem: modeling frequency dependent losses (radiation, …) • Hybrid time-frequency domain models (Sondhi & Schoeter 1987, Allen & Strong 1985) Problem: flow calculation  modeling aerodynamics; sound sources • Three dimensional FE-modelling of acoustic wave propagation and aerodynamics (e.g. ElMasri et al. 1996, Matsuzaki and Motoki 2000): helpful for exact formulation of aero-acoustics in the vicinity of noise sources (glottis, frication); Problem: complexity and high computational effort; real time synthesis can not be acheived preferred

  21. Extraction of the area function: Birkholz (2007) cross-sections: perpendicular to airflow vary in shape! Midline of VT cross-sectional area values from glottis to mouth  area function Note: Area function can not be calculated from 2D vocal tract models: From midsagittal data we can not deduce cross sectional data!

  22. Illustration: Calculation of the area function for KL-synthesis: Kröger (1993): needs constant VT-length End: discete areafunction Begin: VT model discrete area function  defining tube sections for the acoustic model Green: Midsagittal view White: gridline system for calcu-lation of area function continuous area function (varying vocal tract length) … for a complete sentence: “Das ist mein Haus” (“That’s my house”)

  23. Second Example : From VT over area function to vocal tract tubes now: for a transmission line circuit model (Birkholz 2005): …. vocal tract tubes can vary in length: vocal tract: pharyngeal and oral part trachea / subglottal system mouth opening branch glottis Teeth nasal cavity and sinuses (i.e. indirectly coup-led nasal cavi-ties: Dang & Honda (1994)

  24. Next step: From tube model to acoustic signal using the transmission line circuit analog (Birkholz 2005): On the basis of … Acoustic parameters of a tube section  parameters of lumped elements of the electrical transmission line Geometrical parameters of an (elliptical) tube sections Length: l cross-sectional area: A perimeter: S (elliptic small - round) mass (enertia), compressibility and losses of air column  calculation of pressure and flow for each tube section

  25. Illustration: … Calculation of the acoustic speech signal: Kröger (1993) lung pressure, vocal fold tension, glottal aperture for whole utterance white: progress of calculation (progress bar) tube section model (area function) oral part nasal part (red arrows: insertion of Bernoulli pressure drop and of noise source) time line for complete utt. Instantaneous acoustic signal (20ms-window) … for a complete sentence: “Das ist mein Haus” (“That’s my house”)

  26. lung pressure, vocal fold tension, glottal aperture for whole utterance white: progress of calculation (progress bar) tube section model (current area function) acoustic signal just calculated (20ms-window) Display of air flow and air pressure calculated along the transmission line: Kröger (1993) magenta: pressure blue: flow red: glottal mass pair light blue: force on the mass pair strong acoustic excitation at time instant of glottal closure (after glottal closing phase) high flow values during glottal opening current pressure values of each tube section … for one glottal cycle within a complete sentence: “Das ist mein Haus” (“That’s my house”)

  27. Summarizing: Vocal tract models and acoustic simulation Vocal tract models: • Area function is the basis for the calculation of the acoustic signal. Calculation of area function can not be done in 2D models  this disqualifies 2D-VT models for articulatory-acoustic speech synthesis • Parametric VT-models should be preferred currently for building up high-quality articulatory speech synthesizers. Advantages: • low computational effort for calculation vocal tract geometries • strong flexibility to get auditory satisfying sound targets • in future these models should be replaced by statistically based models and by biomechanical models

  28. Summarizing: Vocal tract models and acoustic simulation Acoustic simulation: • Problems occurring with different acoustic models: • Variation of length of tube sections • Modeling frequency dependent losses • Computational effort • Conclusion: The transmission line circuit analog (e.g. Birkholz et al. 2007) allows a compromise between quality and computational effort: Real time synthesis should be possible in the near future on normal PC’s using TLCA

  29. Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  30. preferred Glottis models • Self-oscillating models (e.g. Ishizaka Flanagan 1972 two mass model and derivates) • physiological control parameters: vocal fold tension, glottal aperture … • calculation of glottal area waveform (low, up) and glottal flow • Parametric glottal area models (e.g. Titze 1984 and derivates) • glottal waveform (opening-closing movement) is given • Calculation of glottal flow • Parametric glottal flow models (e.g. LF-model 1985) • Acoustical relevant control parameters: F0, open quotient, maximum negative peak flow (time derivative of glottal flow), return phase ….  direct control of acoustic voice quality

  31. normal loud breathy creaky Different phonation types using a self-oszillating model (Kröger 1997) Using a self-oszillating model with a chink able to produce: • normal phonation • loud phonation • breathy phonation • creaky phonation able to produce: • F0 contours • voiced-voiceless contrast simply two control parameters: - vocal fold tension - glottal aperture the model: extended by a chink, leak

  32. Mechanisms for the generation of noise Noise is produced at narrow passages within the VT: • Separate: • volume velocity sources (no obstacle case) • pressure sources (obstacle case, Stevens 1998)

  33. Occurrence of noise sources in the VT Birkholz (2007) Occur simultaneously at different places within the VT: • pressure sources: lung section (no noise), epiglottis, at obstacles (e.g. teeth) • volume velocity sources: at the exit of each VT constriction Controlled by degree of VT constriction and amplitude of air flow

  34. Voiceless excitation of the vocal tract • Noise is produced at narrow passages within the VT • The mechanisms of noise generation are not completely understood (no satisfying 3D FE models solving the Navier-Stokes equation) • Current solution: parameter optimization • The art to construct a good noise source model is to • find the right places for insertion of noise sources • optimize parameters (spectral shape, strength,…) of the source noise

  35. Noise source parameter optimization: examples • Synthesis examples (real – synthetic - /aCa/), Birkholz (2005): /f/ /s/ /sh/ /ch/ /x/ • But compare with Mawass, Badin & Bailly (2000): /f/ /s/ /sh/

  36. Summarizing: glottis models and noise source models • Take self-oscillating vocal fold models; can be used for high quality articulatory speech synthesis • vocal fold tension  mainly determines f0 • glottal aperture  voice qualities: pressed – normal – breathy • glottal aperture  segmental changes: glottal stop – voiced – voiceless • Take simple noise models (pressure and velocity sources) • 3D acoustic noise source models (solving the Navier-Stokes equation) currently are not satisfying.

  37. Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  38. Generation of speech movements • Starting with segmental input: text  sound chain, phoneme chain (text2phoneme-conversion) • But: How to convert a chain of segments (phones)  articulatory movements ? • Theoretically and practically elegant solution: concept of articulatory gestures: bivalent character: • discrete phonlogical units • quantitative units for controlling articulatory movements

  39. O a do la la ap fc fc fc nc og ov og og V-row discrete gestures vocal organ target C-row 1 C-row 2 phonological plan -> motor plan (discrete) Example: „Kompass“ k O m p a s segments glottis, velopharyngeal port Quantitative control units motor plan

  40. From discrete to quantitative realisation of a gesture dorsal full closing gesture: {fcdo} Quantiative gestural parameters: Ton, Toff, Ttarg, voc_org, loctarg activation interval time function for articulator movement

  41. 9 steps: Modeling reduction is easily possible: Example: “mit dem Boot” (Kröger 1993) increase in speech rate  increase in gestural overlap : quantitative  segmental changes : qualitative motor plan not reduced all gestures still exist! fully reduced

  42. Connected speech using gestural control: Examples (1) „Der Zug...“ „Guten Tag...“

  43. Connected speech using gestural control: Examples (2) „Nächster Halt...“ „Nächster Halt...“

  44. Summarizing: control concepts Gesture based control concept can be used: • Links phonemes to articulation: bivalent character of gestures: • discrete phonological units • quantitative units for motor control (activation interval, targets, transition velocities, …) • gestures quantitatively comprise • description of target-directed movement • definition of the target itself (not incompatible with target concepts) • gestures model segmental changes (assimilations, elisions) occurring in reduction by increase in temporal overlap of gestures • How to deduce rules for coordination of speech gestures for syllables, words, complete utterances?

  45. Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

  46. Note We have a lot of knowledge concerning the plant: • articulatory geometries • speech acoustics Birkholz et al. (2007) We have much less knowledge concerning neural control of speech articulation

  47. Note We have a lot of knowledge concerning the plant: • articulatory geometries • speech acoustics problem no problem Homer Simpson We have much less knowledge concerning neural control of speech articulation

  48. Idea • Copy or mimic speech acquisition: • Start like a toddler with babbling: i.e. explore your vocal apparatus and combine motor states with resulting sensory states (auditory, somato-sensory) • Imitation: copying mothers (caretakers) speech signals is now possible, since auditory-to-motor relations are already trained • Idea: to build-up a corpus of trained speech items (is known as the mental syllablary, postulated by Levelt and Wheeldon 1994)  idea: corpus-based neuro-articulatory speech synthesis Is based purely on acoustic data; articulatory data are not needed (EMA…): Toddler are able to learn to speak from acoustic stimulation

  49. phonemic map motor planning phonetic map somatosensory-phonetic proc. somatosensory map motor execution primary motor map neuro-muscular processing external speaker (mother) to comprehension from mental lexicon and syllabification: phonological plan cortical auditory state auditory-phonetic processing auditory map infrequent syllables frequent syllables premotor prosody temporal lobe high-order primary ssst. frontal lobe motor plan parietal lobe high-order primary subcortical motor execution (control and corrections) cerebellum basal ganglia thalamus primary motor cortical motor state subcortical and peripheral somatosensory receptors and preprocessing auditory receptors and preprocessing skin, ears and sensory pathways articulatory state muscles and articulators: tongue, lips, jaw, velum … articulatory signal acoustic signal peripheral Neurophonetic model of speech production: DFG grant KR1439/13-1 2007-2010

  50. Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions

More Related