1 / 32

Speech & Audio Coding

Speech & Audio Coding. TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg. Outline. Part I - Speech Speech History of speech synthesis & coding Speech coding methods Part II – Audio Psychoacoustic models MPEG-4 Audio. Speech Production.

socrates
Download Presentation

Speech & Audio Coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg

  2. Outline • Part I - Speech • Speech • History of speech synthesis & coding • Speech coding methods • Part II – Audio • Psychoacoustic models • MPEG-4 Audio

  3. Speech Production • The human’s vocalapparatusconsists of: • lungs • trachea (wind pipe) • larynx • contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through • oral tract • nasal tract

  4. The Speech Signal 1

  5. The Speech Signal

  6. The Speech Signal • Elements of the speech signal: • spectral resonance (formants, moving) • periodic excitation (voicing, pitched) + pitch contour • noise excitation (fricatives, unvoiced, no pitch) • transients (stop-release bursts) • amplitude modulation (nasals, approximants) • timing

  7. The Speech Signal Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels. F1-F2 chart Formant positions

  8. VODER – the architecture History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis-by-synthesis system developed • by Homer Dudley of AT&T labs - VODER

  9. History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system developed • by Homer Dudley of AT&T labs - VODER

  10. OVE formant synthesis (Gunnar Fant,KTH), 1953

  11. History of Speech - Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system Homer • Dudley of AT&T labs - VODER • 1952 - delta modulation proposed, differential PCM invented • 1957 - -law encoding proposed (standardised for telephone network • in 1972 (G.711)) • 1974 - ADPCM developed • 1984 - CELP vocoder proposed (majority of coding standards for speech • signal today use a variation on CELP)

  12. Source-filter Model of Speech Production • Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract. • The gain controls Av and AN determine the intensity of voiced and unvoiced excitation. • The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs). • This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles.

  13. Speech Coding Strategies 1. PCM • Invented 1926, deployed 1962. • The speech signal is sampled at 8 kHz. • Uniform quantization requires >10 bits/sample. • Non-uniform quantization (G.711, 1972) • Quantizing y to 8 bits -> 64 kbit/s.

  14. Speech Coding Strategies 2. Adaptive DPCM • Example: G.726 (1974) • Adaptive predictor based on six previous differences. • Gain-adaptive quantizer with 15 levels ) 32 kbit/s.

  15. Speech Coding Strategies 3. Model-based Speech Coding • Advanced speech coders are based on models of how speech is produced: Excitationsource Vocaltract

  16. An Excitation Source Noisegenerator Pulsegenerator Pitch

  17. g1 g2 gn BP BP BP Vocal Tract Filter 1: A Fixed Filter Bank

  18. Vocal Tract Filter 2: A Controllable Filter

  19. Linear Predictive Coding (LPC) • The controllable filter is modelled as yn =  ai yn-i + Gnwhere n is the input signal and yn is the output. • We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv). • Typically the source signal is divided in short segments and the parameters are estimated for each segment. • Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

  20. Typical Scheme of an LPC Coder Noisegenerator Vocal tractfilter Pulsegenerator Pitch Filter coeffs v/uv Gain

  21. Estimating the Parameters • v/uv estimation • Based on energy and frequency spectrum. • Pitch-period estimation • Look for periodicity, either via the a.c.f our some other measure, for examplethat gives you a minimum value when p equals the pitch period. • Typical pitch-periods: 20 - 160 samples.

  22. Estimating the Parameters • Vocal tract filter estimation • Find the filter coefficients that minimize the error2 = ( yn -  ai yn-i + Gn )2 • Compare to the computation of optimal predictors (Lecture 7).

  23. Estimating the Parameters • Assuming a stationary signal:where R and p contain acf values. • This is called the autocorrelation method.

  24. Estimating the Parameters • Alternatively, in case of a non-stationary signal:where • This is called the autocovariance method.

  25. Example • Coding of parameters using LPC10 (1984):

  26. The Vocal Tract Filter • Different representations: • LPC parameters • PARCOR (Partial Correlation Coefficients) • LSF (Line Spectrum Frequencies)

  27. Code Excited Linear Prediction Coding (CELP) Encoding: • LPC analysis) V(z) • Define perceptual weighting filter. This permits more noise at formant • frequencies where it will be masked by the speech • Synthesise speech using each codebook entry in turn as the input to V(z) • Calculate optimum gain to minimise perceptually weighted error energy • in speech frame • Select codebook entry that gives lowest error • Performance: • 16kbit/s: MOS=4.2, • Delay=1.5 ms, 19 MIPS • 8 kbit/s: MOS=4.1, • Delay=35 ms, 25 MIPS • 2.4kbit/s: MOS=3.3, • Delay=45 ms, 20 MIPS • Transmit LPC parameters and codebook index • Decoding: • Receive LPC parameters and codebook index • Re-synthesise speech using V(z) and codebook entry

  28. Examples • G.728 • V(z) is chosen as a large FIR-filter (M ¼ 50). • The gain and FIR-parametrers are estimated recursively from previously received samples. • The code book contains 127 sequences. • GSM • The code book contains regular pulse trains with variabel frequency and amplitudes. • MELP • Mixed excitation linear prediction • The code book is combined with a noise generator.

  29. Other Variations • SELP – Self Excited Linear Prediction • MPLP – Multi-Pulse Excited Linear Prediction • MBE – Multi-Band Excitation Coding

  30. Quality Levels

  31. Subjective Assessment • In digital communications speech quality is classified into four general • categories, namely: broadcast, network or toll, communications, and synthetic. • Broadcast wideband speech – high quality ”commentary” speech – generally • achieved at rates above 64 kbits/s. • MOS (Mean Opinion Score): result of averaging opinions scores for a set of • between 20 – 60 untrained subjects. • They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent). • MOS of 4 or higher defines good or tool quality (network quality) • - reconstructed signal generally indistinguishable from the original. • MOS between 3.5 – 4.0 defines communication quality – telephone • communications • MOS between 2.5 – 3.5 implies synthetic quality

  32. Subjective Assessment • DRT (Diagnostic Rhyme Test): listeners should recognise one of the two • possible words in a set of rhyming pairs (e.g. meatl/heat) • DAM (Diagnostic Acceptability Measure) - trained listeners • judge various factors e.g. muffledness, buzziness, intelligibility Quality versus data rate (8kHz sampling rate)

More Related