1 / 32

in

Incorporating Acoustic Context. in. Automatic Speech Recognition. Nelson Morgan, ICSI and UC Berkeley. Talk Outline. What I mean by “Acoustic Context” Dynamic Features Temporal Filtering Multi-frame analysis Multi-stream analysis Hidden Dynamic Models Conclusions.

skip
Download Presentation

in

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Acoustic Context in Automatic Speech Recognition Nelson Morgan, ICSI and UC Berkeley

  2. Talk Outline • What I mean by “Acoustic Context” • Dynamic Features • Temporal Filtering • Multi-frame analysis • Multi-stream analysis • Hidden Dynamic Models • Conclusions

  3. 1952 Bell Labs Digits • Possibly the first word (digit) recognizer • Approximated energy in formants over word • Insensitive to amplitude, timing variation • Clumsy technology

  4. Digit Patterns Axis Crossing Counter HP filter (1 kHz) (kHz) 3 Limiting Amplifier Spoken 2 Digit 1 200 800 (Hz) Axis Crossing Counter LP filter (800 Hz) Limiting Amplifier

  5. Processing the Streams • Hard limiting and counting for each • Simple form of acoustic context: evaluating features for entire word • Replaced by framewise analysis

  6. Framewise Analysis of Speech Frame 1 Frame 2 Feature VectorX1 Feature VectorX2

  7. Acoustic Context • Evaluating costs based on observationsover time intervals greater than the“typical” frame (20-30 ms) • Not the same as phonetic context(models for sound units in particularcontexts)

  8. Statistical ASR • i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) • P(X|M )  P(X|Q ) [Viterbi approx.]where Q is the best state sequence in M • approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i i i M i i M i

  9. 1 2 1 2 1 1 1 2 1 2 2 Markov model q q 1 2 P(x ,x |q ,q )  P( q ) P(x |q ) P(q | q ) P(x | q )

  10. Markov model (graphical form) q q q q 1 2 3 4 x x x x 1 2 3 4

  11. Beyond a single frame Acoustic context can be increased beyond the single frame by: • Temporal processing before frame analysis • Multiple frames used in observation • Both mechanisms used together

  12. Spectral vs Temporal Processing Analysis (e.g., cepstral) frequency Spectral processing Time Processing (e.g., mean removal) frequency Temporal processing

  13. Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter

  14. “Delta” impulse response .2 .1 0 -2 -1 0 1 2 frames -.1 -.2

  15. Temporal Filtering • Filtering of short term spectral or cepstral components • Simple noncausal case: mean removal • Generalization: bandpass or highpass filter (e.g., RASTA) • Data-driven design: LDA -> filters

  16. Linear Discriminant Analysis (1D) X 2 X 1

  17. Linear Discriminant Analysis (multi-D) x 1 x 2 y 1 = X x y 3 2 x 4 Transformation to maximize ratio: between-class variance x 5 within-class variance

  18. LDA for Temporal FilterDesign Single variable over time x 1 x 2 y y 1 1 = X x y 3 2 x 4 x 5

  19. LDA-based temporal filters

  20. Multi-frame analysis • Incorporate multiple frames as a single observation • LDA the most common approach • Neural networks • Bayesian networks (graphical models, including Buried Markov Models)

  21. LDA for Multiple FrameTransformation All variables for several frames x 1 x 2 y 1 = X x y 3 2 x 4 x 5

  22. Multi-layer perceptron

  23. Buried Markov Models

  24. Multi-stream analysis • Multi-band systems • Multiple temporal properties • Multiple data-driven temporal filters

  25. Multi-band analysis

  26. Temporally distinct features

  27. Combining streams

  28. Another context model: articulators • Natural representation of context • Production apparatus has mass, inertia • Difficult to accurately model • Can approximate with simple dynamics

  29. Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)

  30. Hidden Dynamic Models SEGMENTATION TARGET SWITCH TARGET VALUES FILTER NEURAL NETWORK SPEECH PATTERN

  31. Sources of Optimism • Comparatively new research lines • Many examples of improvements • Moore’s Law  much more processing • Points toward joint development of front end and statistical components

  32. Summary • Acoustic context is already incorporatedin simple forms of temporal signalprocessing (CMS, deltas) • Generalized forms can help more • Study of the interaction betweentraditional ASR strata may help indesign of this component

More Related