1 / 65

Informative Dialect Identification

Informative Dialect Identification. Nancy Chen Oct. 31, 2008. Dialects, Accents, and Languages. Language Recognizer or L1 detector?. Language Recognizer. Indian English. Hindi. Automatic Speech Recognizers. I only understand English. You are speaking a foreign language. Indian English.

adolph
Download Presentation

Informative Dialect Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informative Dialect Identification Nancy Chen Oct. 31, 2008

  2. Dialects, Accents, and Languages

  3. Language Recognizer or L1 detector? Language Recognizer Indian English Hindi

  4. Automatic Speech Recognizers I only understand English. You are speaking a foreign language. Indian English

  5. Traditional Automatic Recognizers speech 18, 53, … • Big black box • Input features not intuitive • Not F0, F1, F2 • Thousands of Gaussians, each with 40+ dimensions • Efficiently process lots of data • Hard to interpret models and results • Training data ~ 100+ hrs Traditional automatic recognizers

  6. Linguistic Studies Spread the peanut butter • Few speakers • 20-30 at most • Perceptual analysis takes much time and effort • Phonological rules Linguistic studies

  7. American English Speaker Spread the peanut butter • Voiceless stop consonants are unaspirated when preceded by fricatives • “p” in spread sounds more like “b” • Intervocalic /t/ flapped when followed by unstressed syllable • “t” in butter does not produce intra-oral pressure Linguistic studies

  8. Indian English Speaker I can’t spread the peanut butter with Harr • Voiceless stop consonants are always unaspirated • /p/, /t/, /k/ sound like /b/, /d/, /g/ • Inter-dental fricatives become stop-like • “the” sounds like “de” • Alveolar consonants /t/, /d/, /n/ are retroflex • /w/ /v/ • British English influence • Rhoticity gone when “vowel + /r/” • /ae/  /a/; e.g., bath, can’t Linguistic studies

  9. Goal Spread the peanut butter speech 18, 53, … Traditional automatic recognizers Linguistic studies Informative dialect identification …… [t][er][dx][er] …… speech American English

  10. Potential Applications • Forensic phonetics • Speaker recognition and characterization • Automated speech recognition and synthesis • Accent training education • Articulatory and phonological disorder diagnosis

  11. Challenges • Automatic phone recognition limitations • State-of-the-art “phone recognition” accuracy only 50-60% • Commercial speech recognition rely heavily on grammar and social context • Inadequately capture dialect differences • e.g. retroflex [t] recognized as typical [t], [r], [ax], … • Sub-dialects within Indian English

  12. Related Research • Automatic speech recognition for non-native speech(Fung 2005, Livescu 2000) • Accent classification(Angkititrakul, Hansen 2006) • Language identification(Li, Ma, Lee 2007)

  13. Techniques • Acoustic modeling (e.g., Torres-Carrasquillo et al. 2004) • Gaussian mixture models, hidden Markov models • N-grams of phonetic units (e.g., Zissman 1995) • Models the “grammar” of phones • PRLM (Phone Recognition followed by Language Modeling) • Our approach: acoustic modeling of dialect-discriminating phonetic contexts

  14. Terminology & Notation • Monophone • e.g. [t], [a] • Biphone: a monophone in the context of other phones • Phonetic notation: • e.g. [k-r] is an [r] preceded by [k] • e.g. [t+a] is [t] followed by [a] • Mathematical notation: biphone variable b is phone  followed by phone ; ,  ={monophone set} • Only consider two dialects d={d1, d2} • d1: American English • d2: Indian English

  15. Finding Dialect-Specific Phonological Rules • Supervised Learning • If phone transcriptions are available • Unsupervised Learning • If no phone transcriptions are available

  16. Supervised Classification • Extract phonological rules • Adapt biphone models • Dialect recognition task via likelihood ratio test

  17. Supervised Rule Extraction: Example 1 Indian English Phone recognizer  wine American English Phone recognizer  vine • Recognition accuracy of the recognizer-hypothesized [v] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [v] differs across dialects

  18. Supervised Rule Extraction: Example 2 Indian English Phone recognizer  pat American English Phone recognizer  bat • Recognition accuracy of the recognizer-hypothesized [b] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [b] differs across dialects

  19. Supervised Rule Extraction: Example 3 Indian English Phone recognizer  beats American English Phone recognizer  butter • Recognition accuracy of the recognizer-hypothesized [dx+er] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [dx+er] differs across dialects

  20. Rule Extraction Criteria • Biphone b is dialect-discriminating for dialect d1 and d2 if • The recognition accuracy of biphone b in dialect d1 is different from that in dialect d2 • The occurrence frequency of biphone b is sufficient equations

  21. Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model

  22. Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model adapt American-English-specific monophone model American-English-specific biphone model

  23. Dialect Recognition: likelihood scores Log Likelihood American-English biphone models Test utterance Indian-English biphone models Log Likelihood

  24. Dialect Recognition: likelihood ratio test Log Likelihood Log Likelihood Ratio Test American-English biphone models Test utterance Indian-English biphone models Log Likelihood

  25. Dialect Recognition: decision making Log Likelihood Log Likelihood Ratio Test American-English biphone models Detection Error Analysis Test utterance Indian-English biphone models Log Likelihood Threshold Determination Dialect decision

  26. Unsupervised Classification • Unsupervised rule extraction • Adapt all biphone models • Prune out non-dialect-specific biphone models • Dialect recognition via likelihood ratio test

  27. Retaining Biphone Models: Example Dialect-netural monophone model American English

  28. Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English

  29. Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model

  30. Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model

  31. Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model

  32. Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model The larger the log likelihood ratio of biphone [dx+er], the more dialect-specific [dx+er] is of American English

  33. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood

  34. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood

  35. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood

  36. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood

  37. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood

  38. Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Keep  Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood equations

  39. Experimental Setup • Training set: • 104 hrs of dialect-marked data without transcriptions • Test set: • 1298 American English trials • 200 Indian English trials • Each trial is 30 seconds • Dialect-neutral monophone HMM models: • trained on 23 hrs of transcribed data • 47 English monophones

  40. Pilot Study: Dialect-Specific [r] Biphones • Recognizer decoded [r] instances were manually labeled in both dialects

  41. Detection Error Trade-off Curve EER= Equal Error Rate

  42. Discussion • [r]-biphones performs at least as well as monophones • [r]-biphones performs better when false alarms are penalized more • [r]-biphones not necessarily interpretable • Phone recognition errors • Rules only learned from minimal transcriptions (~ 1min speech) • Sub-dialect issues with Indian English. Rules derived from speakers with Hindi as first language, but distribution of first language of speakers in test data is unknown.  Study more data with unsupervised algorithm

  43. Unsupervised Learning Experiment • A developmental set (instead of test set) was used to determine the biphone models to retain • The proposed filtered-biphone system uses 25% less biphone models, while EER performance is still comparable to the baseline unfiltered-biphone system

  44. Equal Error Rate (EER) Results Biphone Models Fusion Experiments • Biphone systems are all superior to baseline monophone system • Filtered-biphone system is comparable with unfiltered-biphone system, regardless with or without fusion with PRLM • 29.3% relative gain obtained when proposed unfiltered-biphone system fuses with PRLM.

  45. Detection Error Trade-off

  46. Discussion of Learned Rules • Dialect-discriminating biphones • Flap biphones [dx+r], [dx+axr], [dx+er] • Biphones [ae+s], [ae+th] occurring in “class”, “bath” • Biphones learned in supervised method, e.g. [r+s] • Non-dialect-discriminating biphones • No-speech sounds (e.g., filled pauses, coughing) • /zh/ biphones

  47. What if more biphones are pruned? EER of test set (%) Amount of pruned biphone models determined by developmental set (%)

  48. Contributions • We present systematic approaches to discovering dialect-discriminating biphones, with and without using phone transcriptions • The proposed filtered-biphone system achieves comparable performance to a baseline unfiltered-biphone system despite using 25% less biphone models • Our approach complements other systems. When the filtered-biphone system is fused with a PRLM system, we obtain 29% relative gains • This is a first step towards a linguistically-informative dialect recognition system

  49. Future Work • Investigate corpora with transcriptions to enhance interpretability of phonological rules • Model dialect-specific biphones in other dialects to ensure approach is language/dialect independent • Incorporate more sophisticated techniques to enhance recognition performance • Potential clinical applications: diagnosing articulatory and phonological disorders

  50. Additional Slides

More Related