1 / 21

Mr. Yik-Cheung Tam Dr. Brian Mak

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training. Mr. Yik-Cheung Tam Dr. Brian Mak. Outline. Motivation Overview of MCE training Problem using N-best hypotheses Alternative:1-nearest hypothesis What? Why? How? Evaluation

wallis
Download Presentation

Mr. Yik-Cheung Tam Dr. Brian Mak

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak

  2. Outline • Motivation • Overview of MCE training • Problem using N-best hypotheses • Alternative:1-nearest hypothesis • What? • Why? • How? • Evaluation • Conclusion

  3. MCE Overview • The MCE loss function: • Distance measure: • G(X) may be computed using the N-best hypotheses. • l(.) = 0-1 soft error-counting function (Sigmoid) • Gradient descent method to obtain a better estimate.

  4. Trainable region Problem Using N-best Hypotheses • When d(X) gets large enough, • It falls out of the steep trainableregion of Sigmoid.

  5. What is 1-nearest Hypothesis? • d(1-nearest) <= d(1-best) • The idea can be generalized to N-nearest hypotheses.

  6. Using 1-nearest Hypothesis • Keep the training data inside the steep trainable region. Trainable region

  7. How to Find 1-nearest Hypothesis? • Method 1 (exact approach) • Stack-based N-best decoder Drawback: • N may be very large => memory problem • Need to limit the size of N. • Method 2 (approximated approach) • Modify the Viterbi algorithm with a special pruning scheme.

  8. Approximated 1-nearest Hypothesis • Notation: • V(t+1, j) : accumulated score at time t+1 and state j • : transition probability from state i to j • : observation probability at time t+1 and state j • : accumulated score of the Viterbi path of the correct string at time t+1. • Beam(t+1) : beam width applied at time t+1

  9. Approximated 1-nearest Hypothesis (.) • There exists some “nearest” path in the search space (shaded area).

  10. System Evaluation

  11. Corpus: Aurora • Aurora • Noisy connected digits derived from TIDIGIT. • Multi-condition training: (Train on noisy condition) • {subway, babble, car, exhibition} x {clean, 20, 15, 10, 5} (5 noise levels) • 8440 training utterances. • Testing: (Test on matched noisy condition) • Same as above except with additional samples with 0 and –5 dB (7 noise levels) • 28,028 testing utterances.

  12. System Configuration • Standard 39-dimension MFCC (cep + D + DD) • 11 Whole-word digit HMM (0-9, oh) • 16 states, 3 Gaussians per state • 3-state silence HMM, 6 Gaussians per state • 1-state short pause HMM tied to the 2nd state of the silence model. • Baum-Welch training to obtain the initial HMM. • Corrective MCE training on HMM parameters.

  13. System Configuration (.) • Compare 3 kinds of competing hypotheses: • 1-best hypothesis • Exact 1-nearest hypothesis • Approx. 1-nearest hypothesis • Sigmoid parameters: • Various (control slope of Sigmoid) • Offset = 0

  14. Experiment I: Effect of Sigmoid slope • Learning rate = 0.05, with different • 0.1 (best test performance) • 0.5 (steeper) • 0.02, 0.004 (more flat) Baseline: 12.71% 1-best: 11.01% Approx. 1-nearest: 10.71% Exact 1-nearest: 10.45%

  15. Effective Amount of Training Data • Soft error < 0.95 is defined to be “effective”. • 1-nearest approach has more training data when the Sigmoid slope is relatively steep. Exact. 1-nearest (67%) Approx. 1-nearest (51%) 1-best (40%)

  16. Experiment II: Compensation With More Training Iterations • With 100% effective training data, apply more training iterations: • = 0.004, learning rate = 0.05 • Result: Slow improvement compared to the best case. Exact 1-nearest with gamma = 0.1

  17. Experiment II: Compensation Using a Larger Learning Rate • Use a larger learning rate (0.05 -> 1.25) • Fix = 0.004 (100% effective training data) • Result: 1-nearest approach is better than one-best approach after compensation.

  18. Using a Larger Learning Rate (.) • Training performance: MCE loss versus # of training iterations. Approx. 1-nearest 1-best Exact. 1-nearest

  19. Using a Larger Learning Rate (..) • Test performance: WER versus # of training iterations. 1-best (11.55%) Approx. 1-nearest (10.70%) Exact. 1-nearest (10.79%)

  20. Conclusion • 1-best and 1-nearest methods were compared in MCE training. • Effect of Sigmoid slope. • Compensation on using a flat sigmoid. • 1-nearest method is better than 1-best approach. • More trainable data are available in the 1-nearest approach. • Approx. and exact 1-nearest methods yield comparable performance.

  21. Questions and Answers

More Related