1 / 14

On Use of Temporal Dynamics of Speech for Language Identification

On Use of Temporal Dynamics of Speech for Language Identification. Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Group http://www.asp.ogi.edu. Target language model. (+). Speech signal. Segmentation. units.

denise
Download Presentation

On Use of Temporal Dynamics of Speech for Language Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Grouphttp://www.asp.ogi.edu

  2. Target languagemodel (+) Speechsignal Segmentation units score + Backgroundmodel (-) OGI-4 – ASP System • Goal • Convert the speech signal into a sequence of discrete sub-word units that can characterize the language • Approach • Use temporal trajectories of speech parameters to obtain the sequence of units • Model the sequence of discrete sub-word units using a N-gram language model • Sub-word units • TRAP-derived American English phonemes • Symbols derived from prosodic cues dynamics • Phonemes from OGI-LID

  3. Short-term analysis classifier Frequency Time phone Temporal patterns paradigm American English Phoneme Recognition • Phoneme set • 39 American English phonemes (CMU-like) • Phoneme Recognizer • trained on NTIMIT • TRAP (Temporal Patterns) based • Speech segments for training obtained from energy-based speech/nonspeech segmentation • Modeling • 3-gram language model

  4. English Phoneme System Merger Band Classifier 1 Viterbi search Band Classifier 2 frequency Band Classifier N time • Merger • MLP (897x300x39) • Viterbi search • Penalty factor tuning : deletions = insertions • Training • NTIMIT • Temporal trajectories • 23 mel-scale frequency band • 1 s segments of log energy trajectory • Band classifiers • MLP (101x300x39) • Hidden unit nonlinearities: sigmoids • Output nonlinearities: softmax

  5. Prosodic Cues Dynamics • Technique • Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units • Approach • Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing • Label the segment by the direction of change of the parameter within the segment

  6. Duration The duration of the segment is characterized as “short” (less than 8 frames) or “long” 10 symbols Broad-phonetic-category (BFC) Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model BFC TRAPS Setup Input temporal vectors 15 bark-scale frequency band energy 1s segments of log energy trajectory Mean and variance normalized Dimension reduction:DCT Band classifiers MLP (15x100x7) Hidden units: sigmoid Output units: softmax Merger MLP (105x100x7) Prosodic Cues Dynamics

  7. OGI-4 – ASP System EER30s=17.8% EER30s=41.4% EER30s=19.3% EER30s=32.1%

  8. OGI-4 – ASP System EER30s=17.8%

  9. Post-Evaluation – Phoneme System • Speech-nonspeech segmentation using silence classes from TRAP-based classification • TRAPs classifier • Temporal trajectory duration - 400ms • 3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands • The trajectories of 3 bands are projected into a DCT basis (20 coefficients) • Viterbi search tuned for language identification • Training data • CallFriend training and development sets

  10. Post-Evaluation – Phoneme System 34% relative improvement EER30s=12.7%

  11. Post-Evaluation – Prosodic Cues System • No energy-based segmentation • Unvoiced segments longer than 2 seconds are considered non-speech • No broad-phonetic category labeling applied • Rate of change plus the quantized duration (10 tokens) • Training data • CallFriend training and development sets

  12. Post-Evaluation – Prosodic Cues System 30% relative improvement EER30s=22.2%

  13. Fusion - 30 sec condition • Fusing the scores from the prosodic cues system • with TRAP-derived phonemes: EER30s= 10.5%(17% relative improvement) • with OGI-LID derived phonemes: EER30s= 6.6% 14% relative improvement • TRAP-derived phoneme system fused with OGI-LID: • EER30s= 6.2%19% relative improvement EER30s=5.7% 26% relative improvement

  14. Conclusions • Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language • Two techniques for deriving the sequences of symbols investigated • segmentation and labeling based on prosodic cues • segmentation and labeling based on TRAP-derived phonetic labels • The introduced techniques combine well with each other as well as with the more conventional language ID techniques

More Related