1 / 67

18.0 Some Recent Developments in NTU

18.0 Some Recent Developments in NTU. Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411.

jesusa
Download Presentation

18.0 Some Recent Developments in NTU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 18.0 Some Recent Developments in NTU Reference: 1.“Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411. 2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200. 3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 7. “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing Spontaneous Mandarin Speech with Disfluencies”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

  2. 18.0 Some Recent Developments in NTU Reference:8. “Entropy-based Feature Parameter Weighting for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006. 9. “A New Framework for System Combination Based on Integrated Hypothesis Space,” International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 10. “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006. 11. “Analytical Comparison between Position Specific Posterior Lattices and Confusion Networks Based on Words and Subword Units for Spoken Document Indexing”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007. 12. “A Multi-Modal Dialogue System for Information Navigation and Retrieval across Spoken Document Archives with Topic Hierarchies”, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, San Juan, Nov-Dec 2005. 13. “Efficient Interactive Retrieval of Spoken Documents with Key Terms Ranked by Reinforcement Learning”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 14. “Type-Ⅱ Dialogue Systems for Information Access from Unstructured Knowledge Sources”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.

  3. Reference:15. “Histogram-Based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.957-960. 16. “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006. 18.0 Some Recent Developments in NTU

  4. Role of Spoken Language Processing under Network Environment Internet User Interface Content Analysis User-Content Interaction • User Interface —when keyboards/mice inadequate • Content Analysis — help in browsing/retrieval of multimedia content • User-Content Interaction —all text-based interaction can be accomplished by spoken language

  5. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  6. Decompose the supervectors into sub-supervectors, from which sub-eigenspaces can be constructed, therefore better performance can be obtained with more adaptation data Segmental Eigenvoice

  7. Segmental Eigenvoice (1/3)

  8. Segmental Eigenvoice (2/3)

  9. Segmental Eigenvoice (3/3)

  10. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  11. Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition — to reduce the mismatch between the statistical characterics of training and testing corpora by normalizing the ceptral moments

  12. Cepstral Moment Normalization • Moment Estimation: • Time average : N-th moment of MFCC parameters about the origin • Cepstral Normalization: • For odd order L • For even order N Example: CMS for L=1 Example: CMVN for N=1 and 2

  13. Higher Order Cepstral Moment Normalization (HOCMN) • Aurora 2, Clean Condition Training, Word Accuracy Averaged over 0~20dB and All Types of Noise (sets A,B,C) CTN=HOCMN[1,3,2] CTN=HOCMN[1,2,3] CMVN (l=86) CN (l=86) CMVN CN

  14. Skewness and Kurtosis (1) • Skewness • Third moment about the mean and normalized to the standard deviation • Departure of pdf from symmetry • Positive/negative indicates skew to right/left • Zero indicates symmetric • Kurtosis • Fourth moment about the mean and normalized to the standard deviation • Peaked or “flat with tails of large size” as compared to standard Gaussian • “3” is the fourth moment of N(0,1) • Positive/negative indicates flatter/more peaked

  15. Skewness and Kurtosis (2) • Define: Generalized Skewness of Odd Order L • L not necessarily 3 • Similar meaning as skewness (skew to right or left) except in the sense of L–th moment • Define: Generalized Kurtosis of Even Order N • N not necessarily 4 • Similar meaning as kurtosis (peaked or flat) except in the sense of N–th moment

  16. Normalizing Odd Order Moment is to Constrain the pdf to be Symmetric about the Origin Except in the sense of L-th moment Normalizing Even Order Moment is to Constrain the pdf to be “Equally Flat with Tails of Equal Size” as Compared to a Standard Gaussian Except in the sense of N-th moment Skewness and Kurtosis (3)

  17. Generalized Moments with Non-integer Orders • The Order of Normalized Moments are not necessarily Integers • Generalized Moments • Type 1: • Reduced to odd order moment when u is an odd integer L • (example: L=1 or 3) • Type 2: • Reduced to even order moment when u is an even integer N • (example: N=2 or 4) • HOCMN with Non-integer Moment Orders

  18. Original C0 & C1 HEQ HOCMN PDF Analysis • HEQ • Over fitted to Gaussian • Original statistics lost • HOCMN • Fitting the generalized skewness and kurtosis of a few orders only • Retain more original characteristics

  19. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  20. — prosody may be useful in recognition, and in particular in handling disfluencies in spontaneous speech Use of Prosody in Recognition and Handling Disfluencies in Spontaneous Speech

  21. Pitch-related Features The average pitch value within the syllable The maximum difference of pitch value within the syllable The average of absolute values of pitch variations within the syllable The magnitude of pitch reset for boundaries The difference of such feature values of adjacent syllable boundaries ( P1-P2 , d1-d2 , etc.) A total of 54 pitch-related features were obtained d2 d1 P2 P1 Prosodic Features (І) — Pitch-related Features

  22. syllable boundary pause pause syllable boundary A B C D E a b begin of utterance end of utterance • Pause duration b • Average syllable duration (B+C+D+E)/4 or ( (D+E)/2 + C )/2 • Average syllable duration ratio (D+E)/(B+C) or(D+E)/2 /C • Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b • Lengthening C / ( (A+B)/2 ) • Standard deviation of feature values Prosodic Features (Ⅱ) —Duration-related Features • Duration-related Features • A total of 38 duration-related features were obtained

  23. Recognition Framework with Prosodic Modeling • Two-pass Recognition • Rescoring Formula: Prosodic model λl ,λp: weighting coefficients

  24. Prosodic Feature Extraction from Paths in the Word Graph Define Tone variable LW (LW boundaries ) Directly take the LW boundaries as a prosodic cue Lj : the length of the j-th word

  25. GMM fjk classifier Baye’s Rule fjk fjk’= classifier GMM fjk Prosodic modeling • GMM • Decision Tree • Hybrid

  26. Do you import * uhn export products? reparandum reparandum resumption The disfluency interruption point (IP) (*) optional editing term resumption optional editing term Examples of Disfluencies in Spontaneous Speech • Overt Repair 是(shi4) 進口(jin4kou3) 嗯(EN) 出口(chu1kou3) 嗎(ma1) is import [discourse export [interrogative particle] particle] • Abandoned Utterances 它(ta1) 有(you3) 一個(yi2ge5) 呃(E) 那邊(ne4bian1) it has one [discourse particle] there 有個(you3ge5) 度假村(du4jian4cun1) 嘛(MA) has a resort [discourse particle ] • It has a *eh there is a resort there.

  27. Rescoring with IP information Recognition Results Spontaneous Speech Recognition with Disfluency Interruption Point (IP) Detection word n-grams when crossing IP boundaries ( c: IP class ) IP probability given by detection models

  28. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  29. — contribution of each feature parameter in Viterbi decoding weighted by its entropy with respect to different phone classes Entropy-based Weighted Viterbi Decoding

  30. Basic Idea • If a feature parameter • is discriminative • If not discriminative Entropy-based Weighting observation probability distributions of different classes t: frame index x(t): feature vector d: index of feature parameter in x(t) c: class index • its Entropy value is low • its Entropy value is high

  31. GMMs for Different Classes c “GMM c” is developed for the acoustic class “c” (c = 1, 2, …) Entropy Estimation by GMMs

  32. Entropy-based Weighted Viterbi Decoding • Testing • Viterbi decoding

  33. Experimental Results • MFCC • Consistent improvements for all types of noise and SNR conditions • Similar Results for PLP and Other Features

  34. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  35. properly integrating useful information from different approaches System Combination by Integrated Hypothesis Space and Delicate Rescoring

  36. Conventional System Combination Approaches Confusion Network N-Best Decoder 1 result Alignment Module Voting Module Inner Word graph Input Speech Decoder N 1.Alignment Algorithms 2.Distortion introduced

  37. Produce Integrated Hypothesis Space with detail time information Perform Delicate Rescoring on the Integrated Hypothesis Space Proposed Approach Direct Integration of Individual Hypothesis Space Decoder 1 Delicate Rescoring Integrated Hypothesis Space result Rescoring Input Speech Decoder N

  38. Hypothesis Space Integration • Merged Word Graph • If Two Word Arcs from Different Systems are Equal • Define: • Others System 1 W4 W5 W1 W6 W4 W10 W2 W10 W6 W8 W4 W10 W7 W2 System 2 W6 W10 W8 W4 W10 q1=q2 ≡ pw1=pw2 , w1=w2 , ts1=ts2 , te1=te2 W7 W3 W4 W10 W9 S(q=q1+q2)=combine(S(q1), S(q2)) if q1=q2 W8 S(q=qi)=S(qi) W4 W5 W1 W6 W4 W10 W2 W6 W10 W8 W4 W10 W3 W7 W4 W10 W9 W8

  39. Borrowing the Concept of Expected Phone Accuracy in MPE Training Decoding Procedure au sic_iu h_a u 豪雨 sic_i t_a au u 陶藝 1 5 au 5/6=0.83 -1+0.83 =-0.17 t_a 1/6=0.17 -1+2*0.17 =-0.66 Delicate Rescoring Example (Ⅰ) – Expected Phone Accuracy Score (EPA) y: word sequence for a path qk : the kth word in the path y

  40. Borrowing the Concept from Minimum Time Frame Error Decoding frame level loss function P(q’) is available from the process of calculating consensus scores Decoding Procedure Delicate Rescoring Example (Ⅱ) – Time Frame Error Score (TFE) y: word sequence for a path qk : the kth word in the path

  41. For Chinese language SER and CER make better sense due to the word segmentation problem For SER (for syllables), CER (for characters), proposed approach is significantly better than ROVER upper bound Alignment distortion TFE has best performance Discriminative Decoding Experimental Results

  42. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  43. automatic generation of titles, summaries and semantic structures for multimedia documents Multimedia Content Analysis for Efficient Browsing and Retrieval

  44. Written Documents are Better Structured and Easier to Browse — in paragraphs with titles — easily summarized and shown on the screen — easily decided at a glance if it is what the user is looking for Multimedia/Spoken Documents are just Video/Audio Signals — not easy to be summarized and shown on the screen — the user can’t go through each one from the beginning to the end during browsing — better approaches for efficient browsing and retrieval are needed Difficulties in Browsing Multimedia/Spoken Documents

  45. Integration Relationships among the Involved Technology Areas Semantic Analysis Information Indexing, Retrieval And Browsing Key Term Extraction from Spoken Documents Keyterms/Named Entity Extraction from Spoken Documents

  46. Hierarchy of Research Areas Applications Applied Technologies 13 12 14 4 15 Spoken Document Understanding and organization Multimedia Technologies Speech-based Information Retrieval Spoken Dialogue Dictation & Transcription Distributed Speech Recognition and Wireless Environment Multilingual Speech Processing Integrated Technologies Speech Recognition Core 11 10 2 3 1 Text-to-speech Synthesis Speech/ Language Understanding Linguistic Processing & Language Modeling Decoding & Search Algorithms Acoustic Processing: features, modeling, etc. Wireless Transmission & Network Environment Information Indexing & Retrieval Basic Technologies 9 8 7 5 6 Robustness: noise/channel feature/model Keyword Spotting Hands-free Interaction: acoustic reception microphone array, etc. Speaker Adaptation & Recognition Spontaneous Speech Processing: pronunciation modeling disfluencies, etc. Prosodic Modeling

  47. improved spoken document retrieval with higher accuracy and better user-content interaction Improved and Interactive Spoken Document Retrieval

  48. Lattice: W2 W1 End node Start node W5 W3 W4 W9 W10 W8 W6 W7 W8 Time index All paths: W1W2, W3W4W5, W6W8W9W10, W7W8W9W10 • PSPL: • Locate a word in a segment according to the order of the word in a path • CN: • Cluster several words in a segment according to similar time spans and word pronunciation PSPL structure: CN structure: W1: prob W2: prob W5: prob W10: prob W1: prob W4: prob W9: prob W2: prob W4: prob W3: prob W9: prob W3: prob W8: prob W5: prob W6: prob W6: prob W8: prob W10: prob W7: prob W7: prob cluster 1 cluster 2 cluster 3 cluster 4 cluster 1 cluster 2 cluster 3 cluster 4 Lattices, Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)

  49. OOV word W=w1w2w3w4 and a lattice L of document D wi: subword units W never appears in L Never find D under PSPL But W=w1w2w3w4 is hidden in L at subword level Subword-based PSPL (S-PSPL) w3w4b Word Lattice L: w3w4bcd aw1w2 w1w2 w3w4e w2w3 OOV/Rare Word Problem Time index

  50. Subword-based PSPL and CN w1_3 w2_1 • Lattice Represented by Subword Arcs: w2_2 w1_2 w2_3 w1_1 w2_4 w3_1 w5_2 w5_3 w5_4 w3_2 w4_1 w4_2 w51 w10_2 w6_1 w7_1 w10_1 w9_2 w6_2 w8_1 w7_2 w8_2 w9_1 w8_1 w8_2 Time index S-PSPL structure: S-CN structure: w5_4: prob w2_4: prob w1_1: prob w1_2: prob …. w1_1: prob w1_2: prob …. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. cluster 8 cluster 8 cluster 2 cluster 2 cluster 1 cluster 1

More Related