1 / 20

Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne

Speech over Packet Networks Variable Jitter Buffering Decoder-Based Time-Scaling Performance Analysis. Performance Analysis of a Decoder-Based Time-Scaling Algorithm for Variable Jitter Buffering of Speech over Packet Networks. Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation

thane-gay
Download Presentation

Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech over Packet NetworksVariable Jitter BufferingDecoder-Based Time-ScalingPerformance Analysis Performance Analysis of aDecoder-Based Time-Scaling Algorithmfor Variable Jitter Buffering ofSpeech over Packet Networks Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne Montréal (Québec) Canada H3R 2H6 Philippe.Gournay@USherbrooke.ca

  2. Voice over Packet Networks • Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter) • VoIP receivers generally use a jitter buffer to control the effect of the jitter • The jitter buffer works by introducing an additional playout delay • The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits • Packets that arrive before their playout time are temporarily stored in a reception buffer

  3. Transmission delay 0 1 2 n n+1 Sender 0 1 2 Receiver 0 1 2 n n+1 Playout Playout delay Voice over Packet Networks (2) Fixed transmission time (no jitter) a fixed playout delay is enough to produce a sustained flow of speech to the listener

  4. 0 1 2 n n+1 Sender 0 1 2 Receiver Transmission delay 0 1 2 Playout Playout delay Dn-1 Dn Dn+1 Voice over Packet Networks (3) Variable transmission delay (some jitter), fixed playout delay some packets (n, n+1) arrive too late to be decoded

  5. Jitter Buffering Strategies • Fixed jitter buffer • Playout delay chosen at the beginning of the conversation • Variable Jitter buffer • Talk-spurt based: playout delay changed at the beginning of each silence period • For quickly varying networks: better results are obtained when the playout delay is also adapted during active speech

  6. Playout Delay Adaptation • Using past jitter values, estimate the “ideal” playout time Pi+1 of frame number i+1. • Send frame number i to the decoder, requesting it to generate an output frame of length Ti=Pi+1-Pi. • The actual playout time of packet i+1 is Pi+1=Pi+Ti, where Ti is the actual length of frame i. Iterate from step 1. ^ ^ ^ The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate

  7. Time Scaling Inside the Decoder … presents several advantages over “standalone” time scaling (SOLA, TDHC, …) : • Uses the decoder’s internal parameters: • Pitch period • In VMR-WB: voicing classification • Regulates the processor load (esp. for shorter frames) • Number of operations per second tends to increase as the frame length decreases • Some complexity is saved during the synthesis operation • Improves quality • smoothing performed by the synthesis filters

  8. General Principle • Time scaling is performed in the excitation domain • The adaptive codebook is updated before time scaling to keep the encoder and decoder synchronized • Frames are modified depending on their voicing classification • In VMR-WB, voicing classification is a part of the bitstream • Not all frames are modified • Voiced frames can only be modified by a multiple number of pitch periods • Concealed frames can also be time-scaled

  9. Inactive Frames • The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples. • Output frame duration limited to between 0 and 40ms (twice the standard duration)

  10. Unvoiced Frames • Plosive frames and frames that are too voiced are not modified • To lengthen frames: • Insert zeroes between the original excitation samples • Adjust gain to preserve average energy per sample • To shorten frames: • Remove samples from the excitation signal • Frame duration limited to between 10 and 40ms

  11. Voiced Frames • Onset frames and frames that are not voiced enough are not modified • To lengthen frames • Use the long-term predictor to duplicate some pitch cycles • To shorten frames • Remove selected pitch cycles • Frame duration limited to between 0 and 40ms

  12. Past Excitation Current Excitation (i) 1 2 3 4 Subframes T0 1 2 3 4 To Lengthen Voiced Frames Voiced frames are lengthened by repeating selected pitch cycles

  13. Experimental Results • Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech) • Subjective quality: • Excellent for “fast playback” (up to twice the normal speed) • Small degradation for “slow playback” (down to half the normal speed) • Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time • Much better than losing a few frames…

  14. Distribution of Unmodified Frames Required frame length (40ms) is twice the standard frame length Total number of frames: 22803 Number of unmodified frames: 4085 (18%) Distribution of unmodified frames: 1. Voiced frames Onsets: 814 (20% of unmodified) Not voiced enough: 935 (23%) 2. Unvoiced frames Plosive: 1850 (45%) Too voiced: 486 (12%)

  15. Frame Length Distribution for Modified Voiced Frames Desired frame length was 640 samples = 40 ms

  16. Time Required for a 50-ms Increase of the Playout Delay Experiment done for 8000 different active speech frames

  17. Maximum Complexity and Corresponding Frame Length *: Modified voiced frames not allowed to be less than 10 ms

  18. Optimize the Complexity • Lengthening does not increase complexity • Shortening CNG frames does not increase complexity • Shortening active speech frames increases complexity • Good compromise: • Increase the playout delay as soon as it is necessary • Decrease the playout delay during inactive periods • Playout delay adaptation requires no additional complexity

  19. Audio Demonstration

  20. Summary • Adaptive jitter buffering requires a means of time scaling of speech • Time scaling can be done in the decoder’s “excitation domain” • This approach is very efficient in terms of both quality and reactivity • It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling

More Related