280 likes | 300 Views
Explore computational auditory scene analysis, sparse decomposition, statistical blind source separation, and beamforming as approaches to address speech separation challenges in noisy car environments.
E N D
Approaches of Interest in Blind Source Separation of Speech Julien Bourgeois DAIMLERCHRYSLER AG Research and Technology, RIC/AD 1
Background - Need of speech-based Human-Machine Interface in cars. - Road noise, passengers speech create adverse conditions to Automatic Speech Recognition. 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans 3
Computational Auditory Scene Analysis (CASA) Generalities Aim: get an algorithmic description of higher auditory functions. Strong biological inspiration. One or two sensors (microphones) are considered. Mic signal is filtered like in a human ear. Variations on a Segmentation - Grouping scheme. 4
CASA - Segmentation Frequency Index Time Segmentation is based on temporal continuity. 5
CASA - Grouping Frequency Index Time Grouping rules are (1) harmonicity and (2) synchronous start or end. These rules agree with certain psychoacoustical phenomena. 6
CASA - Audio example mixture separated
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Sparse Decomposition - Generalities 2 sensors x1 and x2 of N acoustic sources si are given. Aim : Find an invertible transform T so that the N sources are disjoint in the transformed domain. DUET : T = STFT works !! (Windowed Short Term Fourier Transform) Indeed, statistically S1(w,t)S2(w,t) is small. 7
Angle(X1(w,t)/X2(w,t))/w [Group delay] Group delay 1 Group delay 2 Sparse Decomposition - DUET Assumption : “At each point (w,t) of the spectrogram, only one source is active.” Which source Si is active at (w,t) ? Look at the phase between X1(w,t) and X2(w,t). Frequency Index Time Then set Si(w,t) = X1(w,t) 8
Sparse Decomposition - Audio Example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Statistical Blind Source Separation Assumption: “The sources are decorrelated.” or “The sources are independent.” ICA = Independent Component Analysis Generally needs (at least) as many sensors as sources. Permutation and scale ambiguities: If s1 and s2 are independent, so are s2andb s1 9
Statistical Blind Source Separation Mixture model: x(n) = A(0)s(n) + ... + A(K)s(n-K) = A*s (n) (TF) X(w,t) = A(w)S(w,t) Separation filters W: find W(w)so that the components of Y(w,t) = W(w)X(w,t) are independent or decorrelated. (Y estimates the sources S). For a decorrelation criterion, the output Y is decorrelated at each t. One can find W minimizing the off-diagonal terms of RYY(w,t) = E[Y(w,t)YH(w,t)] jointly for all t. 10
Statistical Blind Source Separation Very few assumption on the sources. But: In frequency domain, the ambiguities occur independently at each frequency bin w. Can be CPU-expensive because of iterative optimization. 11
Statistical Blind Source Separation Audio example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future Plans
Beamforming - Array signal processing Spatial locations of the sources (direction of arrival - D.O.A.) are mapped on delays between sensors. Array signal processing addresses 3 estimation problems: 1) number of sources, 2) their spatial locations, 3) spatial filtering. Can require more sensors than sources, depending on the spatial resolution. s1 s2 x1 xi xN xi(t) = s1(t-d1,i ) + s2(t-d2,i ) 12
Beamforming - Source Location 1/Energy-Based: Search for the delays dithat maximizesy2 y(t) = x1(t+d1 ) + ... + xN(t+dN ) [output of a delay-sum beamformer] 2/Correlation Based:Search for the delay d that maximizes E[xi (t)xj (t-d )], for some pairs (i,j) 3/ High Resolution:X(w,t) = A(w)S(w,t) The eigendecomposition ofRXX=A RSS AHprovides information on A, i.e. on the source location. diagonal if the sources are decorrelated 13
x1 xi xN ... ... dN d1 di F1 ... Fi ... FN + Beamforming - Spatial Filtering direction of interest 1/Data-Independant: e.g. delay sum beamforming 2/Statistically optimal: Constrain the response in the direction of interest and minimize the output power 14
Beamforming - Audio example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Conclusion & Questions Different definitions of “source”. Perceptual,Topological, Statistical, Spatial: Complementary approaches. No perfect solution to the cocktail party problem. 15
Future plans in Hoarse Combination of existing methods: DUET if the sources are disjoint ICA or beamforming if they overlap Investigation of specific open questions Estimation of the number of sources at each (w,t) point. Sparse Decomposition: Optimal transform T ? Extension to more than 2 mics ? Theoretical Boundaries ? Equivalencies between these approaches (e.g. Second Order BSS and Beamforming) ? 16
Short Bibliography CASA Guy J Brown, Martin Cooke. Computational Auditory Scene Analysis. Computer Speech and Language, vol. 8, no. 4, pp. 297-336, 1994. A. S. Bregman. “Auditory Scene Analysis”, MIT Press, Cambridge, MA, 1990. Guoning Hu and DeLiang Wang, Monaural speech separation, NIPS 2002
Short Bibliography Sparse Decomposition - DUET M. Zibulevsky, B. A. Pearlmutter, P. Bofill, and P. Kisilev, "Blind Source Separation by Sparse Decomposition", chapter in the book: S. J. Roberts, and R.M. Everson eds., Independent Component Analysis: Principles and Practice, Cambridge, 2001. O. Yilmaz and S. Rickard, Blind Separation of Speech Mixtures via Time-Frequency Masking, Submitted to the IEEE Transactions on Signal Processing, November 4, 2002 Jourjine, S. Rickard, and O. Yilmaz, Blind Separation of Disjoint Orthogonal Signals: Demixing N Sources from 2 Mixtures, Proceedings of the 2000 IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP2000), Volume 5, Pages 2985-2988, Istanbul, Turkey, June 2000
Short Bibliography Statistical Blind Source Separation - ICA Lucas Parra, Clay Spence, "Convolutive blind source separation of non-stationary sources", IEEE Trans. on Speech and Audio Processing pp. 320-327, May 2000 Te-Won Lee, Independent Component Analysis: Theory and Applications Kluwer Academic Publishers, September 1998
Short Bibliography Beamforming B.D. van Veen and K.M. Buckley, ``Beamforming: A Versatile Approach to Spatial Filtering,'' IEEE ASSP Magazine, vol.5, pp. 4-24, Apr. 1988. M. Brandstein and H. Silverman, "A practical methodology for speech source localization with microphone arrays," Computer, Speech and Language, vol. 11, no. 2, pp. 91--126, 1997. D. Ward and M. Brandstein (Eds.), 'Microphone Arrays: Techniques and Applications', Springer, Berlin, 2001, pp. 231-256.