Deep Neural Networks for Supervised Speech Separation

Deep Neural Networks for Supervised Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University

Outline of presentation • Separation as DNN based classification • Ideal binary mask and speech intelligibility • Extensive training for better generalization • Speech intelligibility tests on hearing impaired listeners • On training targets • Masking vs. mapping, and binary vs. ratio masking • Speech intelligibility tests on new noise segments • Time-domain signal reconstruction • Related efforts

Auditory masking phenomenon Definition: “The process by which the threshold of audibility for one sound is raised by the presence of another (masking) sound” (American Standards Association, 1960) A basic phenomenon in auditory perception Roughly speaking, a strong sound masks a weaker one within a critical band

Ideal binary mask as a separation goal Motivated by the auditory masking phenomenon and auditory scene analysis, we suggested the ideal binary mask as a main goal of CASA (Hu & Wang, 2001; 2004) The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR gain (Li & Wang, 2009) Maximal articulation index (AI) in a simplified version (Loizou & Kim, 2011) It does not actually separate the mixture! 4

IBM illustration

Subject tests of ideal binary masking • IBM separation leads to large speech intelligibility improvements • Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners (Brungart et al.’06; Li & Loizou’08; Cao et al.’11; Ahmadi et al.’13), and above 9 dB for hearing-impaired (HI) listeners (Anzalone et al.’06; Wang et al.’09) • Improvement for modulated noise is significantly larger than for stationary noise • With the IBM as the goal, the speech separation problem becomes a binary classification problem • This new formulation opens the problem to a variety of pattern classification methods

Speech perception of noise with binary gains • Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) • IBM modulated noise for ??? Speech shaped noise

DNN as subband classifier (Wang & Wang’13) • Why deep neural network (DNN)? • Automatically learn more abstract features as the number of layers increases • More abstract features tend to be more invariant to superficial variations • Y. Wang and Wang (2013) first introduced DNN to address the speech separation problem • DNN is used for as a subband classifier, performing feature learning from raw acoustic features

DNN as subband classifier (Wang & Wang’13)

Extensive training with DNN • Training on 200 randomly chosen utterances from both male and female IEEE speakers, mixed with 100 environmental noises (Hu’04) at 0 dB (~17 hours long) • Six million fully dense training samples in each channel, with 64 channels in total • Evaluated on 20 unseen speakers mixed with 20 unseen noises at 0 dB

DNN-based separation results • Comparisons with a representative speech enhancement algorithm (Hendriks et al. 2010) • Using clean speech as ground truth, on average about 3 dB SNR improvements • Using IBM separated speech as ground truth, on average about 5 dB SNR improvements

Speech intelligibility evaluation • We subsequently tested speech intelligibility of hearing-impaired listeners (Healy et al.’13) • A very challenging problem: “The interfering effect of background noise is the single greatest problem reported by hearing aid wearers” (Dillon’12) • Two stage DNN training to incorporate T-F context in classification

An illustration A HINT sentence mixed with speech-shaped noise at -5 dB SNR

Results and sound demos • Both HI and NH listeners showed intelligibility improvements • HI subjects with separation outperformed NH subjects without separation

On training targets (Y. Wang et al.’14) • What supervised training aims to learn is important for speech separation or enhancement • Different training targets lead to different mapping functions from noisy features to separated speech • Different targets may have different levels of generalization • While the IBM is first used in supervised separation, the quality of separated speech is a persistent issue • What are reasonable alternative targets? • Target binary mask (TBM) (Kjems et al.’09; Gonzalez & Brookes’14) • Ideal ratio mask (IRM) (Srinivasan et al.’06; Narayanan & Wang’13; Hummersone et al.’14) • STFT spectral magnitude (Xu et al.’14; Han et al.’14)

Different training targets • TBM: similar to the IBM except that interference is fixed to SSN • IRM • β is a tunable parameter, and a good choice is 0.5 • With β = 0.5, the IRM becomes a square root Wiener filter, which is the optimal estimator of the power spectrum

Different training targets (cont.) • Gammatone frequency power spectrum (GF-POW) • FFT-MAG of clean speech • FFT-MASK

Illustration of various training targets Factory noise at -5 dB

Evaluation methodology • Learning machine: DNN with 3 hidden layers, each with 1024 units • Target speech: TIMIT • Noises: SSN + four nonstationary noises from NOISEX • Training and testing on different segments of each noise • Trained at -5 and 0 dB, and tested at -5, 0, and 5 dB • Evaluation metrics • STOI: standard metric for predicted speech intelligibility • PESQ: standard metric for perceptual speech quality • SNR

Comparisons • Comparisons among different training targets • Trained and tested on each noise, but different segments • An additional comparison for the IRM target with multi-condition (MC) training of all noises (MC-IRM) • Comparisons with different approaches • Speech enhancement (Hendriks et al.’10) • Supervised NMF: ASNA-NMF (Virtanen et al.’13), trained and tested in the same way as supervised separation

STOI comparison for factory noise Spectral mapping NMF & SPEH Masking

PESQ comparison for faculty noise Spectral mapping NMF & SPEH Masking

Summary among different targets • Supervised separation yields substantial improvements over unprocessed mixtures • Among the two binary masks, IBM estimation performs better in PESQ than TBM estimation • Ratio masking tends to perform better than binary masking, mainly for speech quality • IRM, FFT-MASK, and GF-POW produce comparable PESQ results • FFT-MASK is better than FFT-MAG for estimation • Many-to-one mapping in FFT-MAG vs. one-to-one mapping in FFT-MASK, and the latter should be easier to learn • Estimation of spectral magnitudes or their compressed version tends to magnify estimation errors

Generalization to new noise segments • While previous speech intelligibility results are impressive, a major limitation is that training and test noise samples were drawn from the same noise segments • Speech utterances were different • Noise samples were randomized • We have recently addressed this limitation through extensive training (Healy et al.’15) • IRM estimation rather than IBM estimation • Frame-level estimation rather than subband classification • Training on the first 8 minutes of two nonstationary noises (20-talker babble and cafeteria noise) and test on the last 2 minutes of the noises • Noise perturbation is used to enrich noise samples for training • See Chen et al.’s poster this afternoon

DNN based IRM estimation

Results and demos • HI listeners showed intelligibility improvements with both noises at both SNRs • NH listeners showed intelligibility improvements for babble noise, but not for cafeteria noise

Time-domain signal reconstruction • Current supervised separation learns to estimate either an ideal mask or spectral magnitudes • An intermediate stage: after estimation, speech is resynthesized in a separate step along with noisy phase • Can we use DNN to directly reconstruct the clean time-domain speech signal? • i.e., using time-domain clean speech as the training target? • Direct reconstruction of the time-domain signal can potentially lead to improved perceptual quality • Intelligibility is largely coded by spectral envelopes (Plomp’02) • It is assumed that quality is coded in temporal fine structure

Background analysis • Touched upon before by Tamura and Waibel (1988), who used an (tiny) MLP to predict clean waveform from noisy waveform • We will show that this predicting clean waveform does not work well • Incorporate domain knowledge into network training. We know that: • T-F masking is effective for suppressing background noise • Frequency domain to time domain conversion can be done by inverse transforms • We propose a new DNN architecture that unifies T-F masking and speech resynthesis (Y. Wang & Wang’15)

DNN architecture Speech resynthesis T-F masking

DNN architecture (cont.) • Introducing a masking layer and an inverse transform layer • We perform masking in the DFT domain, so the resynthesis is conveniently implemented as an inverse DFT • Both forward and backward pass can be written as matrix operations, enabling efficient backprop training • Key differences from standard mask estimators • Speech resynthesis and masking are jointly trained • T-F mask is not predefined (e.g. IRM), but rather automatically learned in light of the loss function of interest, i.e. task-dependent masking • Phase information is explicitly used in learning, and one could use enhanced phase instead of noisy phase

DNN training • Input feature: a complementary feature set (Wang et al.’13) plus gammatone filterbank energies • Training target: windowed clean waveform in each frame • The standard mean squared error between the estimated and clean signal is used as the loss function • In testing, we use the trained network to directly predict the windowed clean waveform snippets in each frame, which are overlap-added to produce the final time-domain reconstruction of the entire utterance

Experimental setting • Use the TIMIT corpus • Randomly choose 2000 utterances for training • Test on the TIMIT core test set (192 utterances) • Speaker-independent setting • Four non-stationary noises • Factory, 100-talker babble, engine, and operation room (called “oproom”) noises from NOISEX-92 • Train on first half of the noise, and test on the second half • Each noise is 4 minutes long • Training on -5 and 0 dB, and test on -5, 0 and 5 dB

Experimental setting • Comparisons among: • IBM-DNN: DNN based IBM estimator • IRM-DNN: DNN based IRM estimator • IFFT-DNN: the proposed system • DT-DNN: use a standard DNN to predict clean waveform from noisy waveform • ASNA-NMF • DT-DNN uses noisy waveform as input feature, while the other DNN systems use the same combined feature set mentioned before • All DNNs use 3 hidden layers, each having 1024 rectified linear units • We use a composite measure OVRL (Hu & Loizou’08), ranging from 1 to 5, to measure objective quality, and STOI

Visualization • Masking layer activations obtained on a TIMIT utterance mixed with the factory noise at 5 dB • The learned activations look like a ratio mask

Results -5 dB test set • IFFT-DNN significantly outperforms IRM-DNN in OVRL (e.g. Engine), and produces slightly worse or comparable STOI • DT-DNN does not work well; the obtained STOIs are even worse than mixtures • ASNA-NMF works reasonably well, but is significantly worse than the proposed system

Results on 5 dB test set • Intelligibility is not a concern in relatively high SNR conditions • IFFT-DNN produces similar STOI results to IRM-DNN, while still achieving better OVRL results

Demos With 5 dB cafeteria noise Mixture IRM-DNN IFFT-DNN Clean Mixture IRM-DNN IFFT-DNN Clean

Joint binaural and monaural classification for reverberant speech separation • Speech separation has traditionally used binaural cues • CASA, ICA, beamforming • We recently use binaural (ITD and ILD) and monaural (GFCC) features to train a DNN classifier to estimate the IBM (Jiang et al.’14) • DNN-based classification produces excellent results in low SNR, reverberation, and different spatial configurations • With reasonable coverage of training configurations, the approach generalizes well to untrained configurations • The inclusion of monaural features improves separation performance when target and interference are close, potentially overcoming a major hurdle in beamforming and other spatial filtering methods

Special session this afternoon: DNN for supervised speech separation/enhancement • Gao et al. • Incorporating a DNN-based voice activity detector to provide frame-level speech presence probabilities, used to weight speech estimates • Substantial separation performance improvements at low SNRs • Weninger et al. • Recurrent neural networks (long short-term memory) for speech enhancement • Integration with an ASR backend leads to the state-of-the-art robust ASR results on the CHiME-2 corpus • Kim & Smaragdis • Improving already trained DNNs in unknown environments • Stacking an AutoEncoder on top of a DenoisingAutoEncoder to fine-tune the latter for adaptive separation • Chen et al. • Noise perturbation techniques to enlarge noise samples for DNN training, which leads to better separation performance

DNN or supervised learning for related tasks • Dereverberation, and joint dereverberation and denoising (Han et al.’14; ‘15) • Multipitch tracking (Lee and Ellis’12, Huang & Lee’13, Han & Wang’14) • Voice activity detection (e.g. Zhang et al.’13) • SNR estimation (Papadopoulos et al.’14)

Conclusion • Formulation of separation as classification or mask estimation enables the use of supervised learning • Extensive training with DNN is a promising direction • This approach has yielded the first demonstrations of speech intelligibility improvement in noise • Ratio masking should be preferred over binary masking for better speech quality • Mask estimation is preferred over spectral mapping • New DNN architecture to directly reconstruct time-domain clean signal • Joint training of a masking layer and an inverse DFT layer leads to better speech quality

Deep Neural Networks for Supervised Speech Separation

Deep Neural Networks for Supervised Speech Separation

Presentation Transcript

CSC 578 Neural Networks and Deep Learning

Supervised Learning Artificial Neural Networks Support Vector Machines

Supervised Training of Neural Networks

Chap 9: Supervised Learning Neural Networks

CSC 578 Neural Networks and Deep Learning

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

Neural Networks and Deep Learning

Lecture 7 Artificial neural networks: Supervised learning

Predicting Signal Peptides using Deep Neural Networks

INFO331 Machine learning. Neural networks. Supervised learning in neural networks.MLP and BP

Part-Of-Speech Tagging using Neural Networks

Speech Recognition through Neural Networks

Neural Networks for Optimization

Emotional Speech Analysis using Artificial Neural Networks

Supervised Learning in Neural Networks

Neural Networks

Chapter 9: Supervised Learning Neural Networks

Deep neural networks for spike sorting: exploring options

Parallel Systems to Compute Deep Neural Networks