1 / 53

Application of Shifted Delta Cepstral Features for GMM Language Identification

Application of Shifted Delta Cepstral Features for GMM Language Identification. Masters Thesis Defense Jonathan J. Lareau Rochester Institute of Technology Department of Computer Science Tuesday, Oct. 24, 2006 12:00 pm. Full report can be found at: www.jonlareau.com/JonathanLareau.pdf.

libitha
Download Presentation

Application of Shifted Delta Cepstral Features for GMM Language Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Shifted Delta CepstralFeatures for GMM LanguageIdentification Masters Thesis Defense Jonathan J. LareauRochester Institute of TechnologyDepartment of Computer Science Tuesday, Oct. 24, 2006 12:00 pm Full report can be found at: www.jonlareau.com/JonathanLareau.pdf

  2. What’s the use of LID? • Pre-Processor for Automatic Speech Recognition Algorithms • Routing speech signals for: • Telecommunications • Multimedia • Human-Computer interfaces • Security applications.

  3. Overview • Study the use of different types of Shifted Delta (SD) feature vectors for telephone speech language identification (LID). • 6 types of feature vectors • Uniform GMM pattern recognition algorithm.

  4. Additionally… • Heuristic speech enhancement pre-processor • No phonemically labeled training data • Original code written in MATLAB 7 • Also Uses: • NETLAB [Nabney, 2002] • RASTA-MAT [Ellis, 2005]

  5. Methods for LID • Phonemic Recognition followed by Language Modeling (PRLM) [Zissman, 1993] • Predominant method for ASR/LID • Works well, but laborious + time consuming • Difficult to extend

  6. Why use Gaussian Mixture Models? • Alternative to PRLM methods • Avoids the laborious phonemic labeling required by PRLM techniques • Comparatively easy to extend

  7. Method - Feature Vectors • Pre-Processing • Pre-Emphasis • Cepstral Speech Enhancement • Base Feature Extraction • Post-Processing • Cepstral Mean Subtraction • Shifted Delta Operation • Silence Removal

  8. Pre-Emphasis Filtering (Pre-Processor) • Speech has Natural attenuation of approximately 20dB/decade, [Picone, 1993] • Pre-emphasis filter, Hpre(z), flattens speech spectrum Hpre(z) = 1 + apre z-1

  9. Pre-Emphasis Filtering

  10. Heuristic Speech Enhancement (Pre-Processor)

  11. Base Feature Vector Extraction • All based on Cepstral Coefficients • Linear Predictive (LP-CC) • All-pole filter models formant envelope • Mel-Frequency (MF-CC) • Psycho-Acoustic frequency scaling • Mimic Response of human ear • Perceptual Linear Prediction (PLP-CC) • Psycho-Acoustic scaling followed by Linear Prediction

  12. Cepstral Mean Subtraction (Post-Processor) • A simplistic method to reduce channel effects. • Once Cepstral feature vectors are calculated using MF, LP, or PLP: • mean feature vector for the entire utterance is subtracted off

  13. Shifted Delta Cepstra • Pseudo-prosodic feature vectors from acoustic feature vectors • Quick approximation to true prosodic modeling • ‘Stacks’ blocks of evenly spaced and differenced feature vectors.

  14. Shifted Delta Calculation

  15. Silence Removal

  16. Method - Classification Task • Each language uses a different distribution over the feature space. • Difficult because: • Shape alone doesn’t distinguish between languages • Density information along feature space surface needs to be included

  17. No Simple Decision Boundary …

  18. …so we use Gaussian Mixture Models The M-V-N-PDF is: A Gaussian Mixture Model (GMM) is then: With conditions: • w(j) is mixture weight (prior) for component j • M is model order • p(x|j) is M-V-N PDF for jth component

  19. GMM Density Estimation ex//

  20. Gaussian Mixture Models model distributions as well as shape

  21. Software Architecture - Training

  22. Software Architecture - Testing

  23. Results • OGI Multilanguage Telephone Speech Database [Muthusamy et. al., 1992] • Mutually Exclusive training and test sets. • SD-LP-CC performed best on 3-Language and 10-Language tasks • SD coefficients increased accuracy by ~10% over standard LP and MF feature vectors • Results were consistent and repeatable

  24. Speech Enhancement Results

  25. 3-Language Task Pre/Post Disabled Pre/Post Enabled LP-CC 59.40% SD-LP-CC 71.95% MF-CC 54.86% SD-MF-CC 67.92% PLP-CC 61.51% SD-PLP-CC 63.31% LP-CC 47.15% SD-LP-CC 65.49% MF-CC 52.50% SD-MF-CC 63.26% PLP-CC 47.64% SD-PLP-CC 61.20%

  26. 10-Language Task SD-LP-CC

  27. Consistency of Results

  28. Accuracy Vs. Amount of Training Data Approximate Trend Line Outliers due to High Mixture order, low amount of training data, and stochastic nature of data selection and NETLAB training algorithm. The cutoff point for the amount of training data that must be used in order to assure accurate results increases with the mixture order.

  29. Comparisons • Results agree with previous work on SDC • reported accuracies between 70%-75% [Deller et. al., 2002] [Kohler, 2002] [Reynolds, 2002] • This thesis specifically addresses effects of different derivations of SDC on LID

  30. Comparisons, cont’d… • Also [Wang & Qu, 2003] • Used GMBM-UBBM • 70.128% accuracy • 128 Mixtures • Our algorithm - 71.13% with: • Reduced mixture order • Reduced training data • Without using bigram or universal background modeling

  31. Conclusions • SDC Features improved LID performance over standard features across all categories • SD-LP-CC perform the best overall in both 3 and 10 language tasks.

  32. Future Work • Gender Specific Modeling • Perform hill climbing • Add UBM • KL - Divergence

  33. Bibliography • Daniel P. W. Ellis. Plp and Rasta and mfcc, and inversion in matlab. http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/, 2005. • Ian T. Nabney. NETLAB: Algorithms for pattern Recognition. Advances in Pattern Recognition. Springer, 2002. • H. Hermansky. Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am., vol. 87, no. 4:1738-1752, Apr 1990. • Y. K. Muthusamy, R.A.C. & Oshika, B.T. The OGI multi-language telephone speech corpus Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP 92), Alberta, 1992 • Dan Qu, Bingxi Wang, Automatic language identification based on GMBM-UBBM Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, 26-29 Oct 2003, 722-727 • J.R. Deller Jr., P.A. Torres-Carrasquillo, D.A. Reynolds, Language Identification Using Gaussian Mixture Model Tokenization Proc. International Conference on Acoustics, Speech, and Signal Processing in Orlando, FL, IEEE, 2002, 757-760 • Kohler, M.K., Kennedy, M.A., Language identification using shifted delta cepstra Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on, 2002, 3, 69-72 • D.A. Reynolds, M.A. Kohler, R.J. Greene, J.R. Deller Jr., E. Singer, P.A. Torres-Carrasquillo, Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, Proc. International Conference on Spoken Language Processing in Denver,CO, ISCA, pages 33-36,82-92, September 2002. • Zissman, M., Automatic Language Identification using Gaussian Mixture and Hidden Markov Models, ICASSP, 1993 • Picone, J., Signal Modeling Techniques in Speech Recognition, in Proc. IEEE 81:1215-1247, Sept. 1993

  34. End of Presentation.

  35. Supplemental Slides • Speech Production • Cepstral Coefficients • Feature Calculation • Linear Prediction • Psycho-Acoustic Scaling (Mel-Frequency) • Perceptual Linear Prediction • 3-Language Confusion Matrices

  36. Source-Filter Model http://www.spectrum.uni-bielefeld.de/~thies/HTHS_WiSe2005-06/source-filter.jpg

  37. Cepstral Coefficients • Compact way of representing the formant envelope of a speech signal. Formant envelope information is encoded here in the first few Cepstral coefficients. We use the first 12 coefficients, but omit the very first.

  38. Linear Prediction • Models Formant Envelope as all pole filter. Where S(z) is the Speech waveform, E(z) is the excitation signal and A(z) is the all-pole filter:

  39. Linear Prediction • Linear Predictive coefficients can be found by solving: where R= [R1,R2, . . . ,RP+1] is the auto-correlation vector, a= [a1, a2, . . . , aP+1] is the Linear Predictive coefficient vector, P denotes the model order, [· · ·]−1 denotes the matrix inverse, and ∗denotes the complex conjugate operation.

  40. Linear Prediction

  41. Calculation of LP-CC’s • Create an N-point frequency spectrum by evaluating • Then find the Cepstral Coefficients

  42. LP - CC’s

  43. Mel Frequency Scale

  44. Calculation of MF-CC’s • Filter bank of Mel-Scaled Filters • Sum the energy in each channel Power Spectrum of Input signal: Energy in each channel:

  45. Mel-Frequency Scaling

  46. Cont’d • Then use inverse discrete Cosine Transform of the log10 of channel energies.

  47. MF-CC’s

  48. Perceptual Linear Prediction • First use Perceptual Scaling, such as Mel, on the spectrum. • Then use Linear Prediction to derive the Cepstral Coefficients. • In studies by [Hermansky, 1990], was shown to reduce speaker dependence.

  49. PLP Scaling

  50. PLP-CC’s

More Related