1 / 15

Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System

Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System. Slava Shechtman. IBM Haifa Research Laboratory. Outline. CART intonation modeling Maximal Likelihood Dynamic intonation model Dynamic observations Maximum-likelihood solution

anne-sharpe
Download Presentation

Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System Slava Shechtman • IBM Haifa Research Laboratory

  2. Outline • CART intonation modeling • Maximal Likelihood Dynamic intonation model • Dynamic observations • Maximum-likelihood solution • Microprosody preservation technique • Implementation and preliminary results • Future research directions

  3. Semantic data Syntactic data Phonetic context Syllable location grow duration tree grow pitch tree CART prosody modeling language data Speech corpus with pitch data duration tree pitch tree

  4. Q1 Q3 Q2 Basic CART intonation model • Rough, but simple and automatic • Extract semantic, syntactic and phonetic features from the TTS Front-end (per syllable) • POS, word stress, syllable stress • Sentence type, phrase type • Syllable location • Phonetic context • 3 log-pitch observations per syllable (in a sonorant part of syllable) • Mean pitch values are associated with tree leaves to represent the target intonation (implicit i.i.d. assumption)

  5. Basic application of CART intonation model • Use mean log-pitch values to estimate target pitch for concatenated segments • Use distance from the target pitch cost as an additive factor in the overall segment selection cost • Optionally, use the above target pitch curve for speech synthesis (after smoothing and/or combination with the actual pitch from the selected segments)

  6. S1 S2 S3 Maximal Likelihood Dynamic intonation model • Model cross-syllable dynamic observations as well as intra-syllable observations • Maximum Likelihood solution, based on HMM synthesis approach (Tokuda et al) • convenient framework for combining both instantaneous and differential observations in order to obtain the most-likely smooth parameter contour, for a given clustering. • May be applied over the regular CART trees

  7. Pairs of observation points for difference calculation Dynamic features for CART intonation modeling • Extend the static observation vectors for n-th syllable, • Add four time-normalized differences of static observations • Guarantee non-zero time interval between the observation instances • New observation vector (→)

  8. Maximal Likelihood Dynamic intonation model • Assume a cluster sequence Q is predetermined by CART • Each cluster is modeled by a single 7-dim Gaussian ( ) • Concatenated observations: • Concatenated static observations: • Sparse (block diagonal) linear transformation:

  9. Maximal Likelihood Dynamic intonation model • The log-likelihood of O sequence is given by • Where

  10. Maximal Likelihood Dynamic intonation model • Likelihood Minimization with respect to static observations C • An efficient time-recursive solution exists (Tokuda et al, 1996) • Jointly determine full utterance pitch curve. • The solution depends both on individual CART cluster models and on their sequence in the synthesized sentence

  11. Maximal Likelihood Dynamic intonation model • Smoothes abrupt changes existing in the mean solution • Controlled by the scaling factor inside dynamic observations • Allows usage of larger CART trees for fine clustering (→)

  12. Calculated Target Pitch Combined Pitch Calculated Target Pitch Natural pitch Microprosody preservation • Improve rough pitch curve resolution • Keep original fine pitch structure inside the contiguous portion of speech to increase naturalness, but be aligned with the target intonation curve • Compensate for the imperfectness of the CART model and feature extraction

  13. Mean solution vs. ML dynamic intonation model • Mean solution : • ML dynamic solution

  14. Incorporation within CTTS system • Applied on embedded version of IBM CTTS system with sub-phoneme basic concatenation unit (regularly one third of a phoneme) • (A): CART mean solution as a target pitch, smoothed original pitch curve as a synthesis pitch. • (B): dynamic ML CART solution as a target pitch, use the microprosody preservation technique to combine original and target pitches • TTS experts + native speakers subjective results

  15. Summary and further research directions • Dynamic ML CART intonation model was proposed and shown to perform better then the baseline CART intonation. • It was successfully combined with the original pitch curve using microprosody preservation technique. • Further research • Alternative dynamic features • Statistical microprosody modeling for very-small-footprint voices • Adaptive microprosody incorporation

More Related