1 / 50

Speech Enhancement EE 516 Spring 2009

Speech Enhancement EE 516 Spring 2009. Alex Acero. Outline. A model of the acoustical environment Simple things first! Microphones Echo cancellation Microphone arrays Single channel noise suppression. Additive noise. Stationary noise: properties don’t change over time:

Ava
Download Presentation

Speech Enhancement EE 516 Spring 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech EnhancementEE 516 Spring 2009 Alex Acero

  2. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  3. Additive noise • Stationary noise: properties don’t change over time: • White noise x[n] • flat power spectrum • Samples are uncorrelated • White Gaussian Noise • Pdf is Gaussian (see chapter 10) • Typical noise is colored • Pink noise: low-pass in nature • Non-stationary: properties changes over time • Babble noise • Cocktail party effect

  4. Reverberation • Impulse response of an average office

  5. Model of the Environment h[m] x[m] + y[m] n[m]

  6. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  7. Cepstral Mean Normalization Compute mean of cepstrum And subtract it from input CMN robust to channel distortion Normalizes average vocal tract or short filters Average must include > 2 sec of speech

  8. RASTA • CMN is a low-pass filter with rectangular window • Can use other low-pass filters too • RASTA filter is band-pass

  9. Retrain with noisy data • Mismatches between training and testing are bad for pattern recognition systems • Retrain with noisy data • Approximation: add noise to clean data and retrain

  10. Multi-condition training • Very hard to predict exactly the type of noise we’ll encounter at test time • Too expensive to retrain the system for each noise condition • Train system offline with several noise types and levels

  11. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  12. b Microphone Preamplifier b ZM RL v(t) h + G - ~ Condenser Microphone

  13. Mic opening Diaphragm Ommidirectional microphones • Polar response

  14. source r1 r2  r (–d, 0) (d, 0) Bidirectional microphones Speech sound wave from the front Noise sound wave from the side

  15. Bidirectional microphones • bidirectional microphone with d=1 cm at 0 • Solid line corresponds to far field conditions ( ) and the dotted line to near field conditions ( )

  16. Unidirectional microphones Speech sound wave from the front Noise sound wave from the side

  17. Output voltage Diaphragm Magnet Coil Dynamic microphones

  18. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  19. Acoustic Echo cancellation Loudspeaker x[n] Acoustic path H Adaptive filter Speech signal s[n] d[n] + + e[n] Local noise r[n] v[n] - Microphone

  20. + + Line echo cancellation x[n] Hybrid circuit H Adaptive filter Speaker B Speaker A d[n] s[n] e[n] r[n] v[n] - Noise

  21. Least Mean Squares (LMS) • Given input • Estimate output • Compute error • Update filter • Need to tune step size

  22. Normalized LMS • Make step size adaptive to ensure convergence • Where we track the input energy

  23. f(x) x1 x0 Recursive Least Squares (RLS) • Newton Raphson • New weights • Faster convergence, but more CPU intensive

  24. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  25. Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m Angle 0 400Hz 880Hz 4400Hz 8000 Hz M2 S M1  M0 M-1 M-2

  26. Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m. Angle 30 400Hz 880Hz 4400Hz 8000 Hz M2 S M1  M0 M-1 M-2

  27. WITTY: Who Is Talking To You?

  28. Bone microphone for noise robust ASR • Conventional microphones are sensitive to noise • Bone microphones are more noise resistant, but distort the signal • Not enough data to retrain recognizer with bone microphone • Fusion between acoustic microphone and bone microphone

  29. Acoustic Microphone

  30. Bone Microphone

  31. Microphone fusion

  32. Relationship between acoustic mic and bone mic Acoustic Contact

  33. Relationship between acoustic mic and bone mic

  34. WITTY: Who is talking to you?

  35. Blind source separation • Linear mixing • Estimate filter • Separate signals • Using assumption signals are independent • Do gradient descent:

  36. y1[n] z1[n] z1[n] y1[n] h11[n] h11[n] h12[n] h12[n] h21[n] h21[n] y2[n] z2[n] y2[n] z2[n] h22[n] h22[n] + + + + Blind source separation Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where  is a HMM. Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations

  37. Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

  38. Spectral subtraction Corrupted signal Power spectrum but So Estimate noise power spectrum from noisy frames Estimate clean power spectrum as

  39. Spectral subtraction Keep original phase Ensure it’s positive

  40. Aurora2 • ETSI STQ group • TIDigits • Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB • Set A: subway, babble, car, exhibition • Set B: restaurant, airport, street, station • Set C: one noise from set A and one noise from set C • Aurora 3 recorded in car (no digital mixing!) • Aurora4 for large vocabulary • Advanced Front-End (AFE) standard (2001) uses a variant of spectral subtraction

  41. Aurora 2 (Clean training) Using SPLICE algorithm

  42. Aurora 2 (multi-condition training) Using SPLICE algorithm

  43. Wiener Filtering • Find linear estimate of clean signal • MMSE (Minimum Mean Squared Error) • Wiener-Hopf equation • In Freq domain • If noise and signal are uncorrelated

  44. Wiener Filtering • Find linear estimate of clean signal • If noise and signal are uncorrelated • With • Compare with Spectral Subtraction

  45. Spectral Subtraction

  46. Vector Taylor Series (VTS) • Acero, Moreno • The power spectrum, on the average • Taking logs • Cepstrum is DCT (matrix C) of log power spectrum

  47. Vector Taylor Series (VTS) • x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and • Expand y in first-order Taylor series

  48. Vector Taylor Series • Distribution of corrupted log-spectra • Noise with mean of 0dB and std dev of 2dB • Speech with mean of 25dB • Montecarlo simulation • Std dev: 25dB 10dB 5dB

  49. Phase matters Corrupted signal Spectrum But is only an approximation

  50. Noise HMM Speech HMM Observations Non-stationary noise • Speech/Noise decomposition (Varga et al.)

More Related