Speech Recognition

Speech Recognition Feature Extraction

Speech recognition simplified block diagram Training Speech Capture Feature Extraction Models Pattern Matching Process Results Text

Speech capture • Use good quality noise cancelling mic • Use bandwidth of 4kHz for phone • Use bandwidth of 8kHz for desktop • Sample at 8kHz or 16 kHz • Alias filter the input • Avoid background noise • Speak clearly but naturally

Spectral Features • Need to extract key frequency components • Visible in a spectrogram – 2d real time examples

Feature extraction • Need to extract frequency content (spectrogram) • Matching on raw data is inefficient • Much of the data is redundant for information • Analyse the signal and extract key features • The same word spoken by different people looks very different in time domain • In the frequency domain, patterns are more evident • Generally use Mel Frequency Cepstral Coefficients

The process • MFCCs are short-term spectral features They are calculated as follows • Divide signal into frames • For each frame, obtain the amplitude spectrum • Take the natural logarithm • Convert to Mel spectrum (cepstrum) • Take the discrete cosine transform (DCT)

Divide signal into frames Apply window function – typically Hamming window • Select about 25mS of speech data and window it to cleanly cut it out of the data stream • Shift window by about 10mS and do the same continuously

Why Hamming? Why not rectangular?

Now have a series of vectors being produced If sampling at 8kHz then sample period = 125uS Vector size = 25mS/125uS = 25000 / 125 = 200 element array

Feed the speech frame into an FFT to get frequency component of that slice • Calculate the power of the spectrumfor each element of the vector • s[k]=(Real X[k])2 + (Imag X[k])2 where X is FFT coef • Use a set of filters to split up frequency bands • Typically use mel scale filter to match the Basilar Membrane. Get energy in each band • Sphinx III uses 40 filters over 8kHz bandwidth

Frequency response is non-linear • Mel(ody) = 1127.01048 x log_e(1+f/700) • f = 700(e^{m x 1127.01048} – 1) • Bark =13 x arctan(0.76f x 1000) + 3.5 x arctan((f x 7500)^2)

Calculate mel spectrum by multiplying the power spectrum by each of the of the triangular mel weighting filters and integrating the result.

Calculate the mel cepstrum • A DCT is applied to the natural logarithm of the mel spectrum to obtain the mel cepstrum. C=num of cepstral coefficients required (n=0 to 12 to get 13 for Sphinx III) and L is the number of filter banks and S[i] is the mel spectrum coefficient – one for each filter output. n is usually less than C as the DCT has the effect of compressing the spectrum such that the bulk of the information is in the first few coefficients. Sphinx III uses 40 filters but keeps only the first 13 cepstral coefficients.

Default values for the SPHINX III front-end

Typical Feature Extraction Block Diagram

Speech Recognition