Speech Enhancement EE 516 Spring 2009

Speech EnhancementEE 516 Spring 2009 Alex Acero

Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression

Additive noise • Stationary noise: properties don’t change over time: • White noise x[n] • flat power spectrum • Samples are uncorrelated • White Gaussian Noise • Pdf is Gaussian (see chapter 10) • Typical noise is colored • Pink noise: low-pass in nature • Non-stationary: properties changes over time • Babble noise • Cocktail party effect

Reverberation • Impulse response of an average office

Model of the Environment h[m] x[m] + y[m] n[m]

Cepstral Mean Normalization Compute mean of cepstrum And subtract it from input CMN robust to channel distortion Normalizes average vocal tract or short filters Average must include > 2 sec of speech

RASTA • CMN is a low-pass filter with rectangular window • Can use other low-pass filters too • RASTA filter is band-pass

Retrain with noisy data • Mismatches between training and testing are bad for pattern recognition systems • Retrain with noisy data • Approximation: add noise to clean data and retrain

Multi-condition training • Very hard to predict exactly the type of noise we’ll encounter at test time • Too expensive to retrain the system for each noise condition • Train system offline with several noise types and levels

b Microphone Preamplifier b ZM RL v(t) h + G - ~ Condenser Microphone

Mic opening Diaphragm Ommidirectional microphones • Polar response

source r1 r2  r (–d, 0) (d, 0) Bidirectional microphones Speech sound wave from the front Noise sound wave from the side

Bidirectional microphones • bidirectional microphone with d=1 cm at 0 • Solid line corresponds to far field conditions ( ) and the dotted line to near field conditions ( )

Unidirectional microphones Speech sound wave from the front Noise sound wave from the side

Output voltage Diaphragm Magnet Coil Dynamic microphones

Acoustic Echo cancellation Loudspeaker x[n] Acoustic path H Adaptive filter Speech signal s[n] d[n] + + e[n] Local noise r[n] v[n] - Microphone

+ + Line echo cancellation x[n] Hybrid circuit H Adaptive filter Speaker B Speaker A d[n] s[n] e[n] r[n] v[n] - Noise

Least Mean Squares (LMS) • Given input • Estimate output • Compute error • Update filter • Need to tune step size

Normalized LMS • Make step size adaptive to ensure convergence • Where we track the input energy

f(x) x1 x0 Recursive Least Squares (RLS) • Newton Raphson • New weights • Faster convergence, but more CPU intensive

Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m Angle 0 400Hz 880Hz 4400Hz 8000 Hz M2 S M1  M0 M-1 M-2

Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m. Angle 30 400Hz 880Hz 4400Hz 8000 Hz M2 S M1  M0 M-1 M-2

WITTY: Who Is Talking To You?

Bone microphone for noise robust ASR • Conventional microphones are sensitive to noise • Bone microphones are more noise resistant, but distort the signal • Not enough data to retrain recognizer with bone microphone • Fusion between acoustic microphone and bone microphone

Acoustic Microphone

Bone Microphone

Microphone fusion

Relationship between acoustic mic and bone mic Acoustic Contact

Relationship between acoustic mic and bone mic

WITTY: Who is talking to you?

Blind source separation • Linear mixing • Estimate filter • Separate signals • Using assumption signals are independent • Do gradient descent:

y1[n] z1[n] z1[n] y1[n] h11[n] h11[n] h12[n] h12[n] h21[n] h21[n] y2[n] z2[n] y2[n] z2[n] h22[n] h22[n] + + + + Blind source separation Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where  is a HMM. Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations

Spectral subtraction Corrupted signal Power spectrum but So Estimate noise power spectrum from noisy frames Estimate clean power spectrum as

Spectral subtraction Keep original phase Ensure it’s positive

Aurora2 • ETSI STQ group • TIDigits • Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB • Set A: subway, babble, car, exhibition • Set B: restaurant, airport, street, station • Set C: one noise from set A and one noise from set C • Aurora 3 recorded in car (no digital mixing!) • Aurora4 for large vocabulary • Advanced Front-End (AFE) standard (2001) uses a variant of spectral subtraction

Aurora 2 (Clean training) Using SPLICE algorithm

Aurora 2 (multi-condition training) Using SPLICE algorithm

Wiener Filtering • Find linear estimate of clean signal • MMSE (Minimum Mean Squared Error) • Wiener-Hopf equation • In Freq domain • If noise and signal are uncorrelated

Wiener Filtering • Find linear estimate of clean signal • If noise and signal are uncorrelated • With • Compare with Spectral Subtraction

Spectral Subtraction

Vector Taylor Series (VTS) • Acero, Moreno • The power spectrum, on the average • Taking logs • Cepstrum is DCT (matrix C) of log power spectrum

Vector Taylor Series (VTS) • x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and • Expand y in first-order Taylor series

Vector Taylor Series • Distribution of corrupted log-spectra • Noise with mean of 0dB and std dev of 2dB • Speech with mean of 25dB • Montecarlo simulation • Std dev: 25dB 10dB 5dB

Phase matters Corrupted signal Spectrum But is only an approximation

Noise HMM Speech HMM Observations Non-stationary noise • Speech/Noise decomposition (Varga et al.)

Speech Enhancement EE 516 Spring 2009

Speech Enhancement EE 516 Spring 2009

Presentation Transcript

EE 516 Lecture 1

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Enhancement

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Coding EE 516 Spring 2009

EE 6331, Spring, 2009 Advanced Telecommunication

Wavelet-Based Speech Enhancement

Speech Enhancement

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Enhancement for ASR

Wearable Speech Enhancement

EE 6331, Spring, 2009 Advanced Telecommunication

JRN 516, Spring 2009

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication