350 likes | 582 Views
Perceptual In-Set Speaker ID using Neutral Speech and Lombard Speech. Ayako Ikeno and John H.L. Hansen. Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.
E N D
Perceptual In-Set Speaker ID using Neutral Speech and Lombard Speech Ayako Ikeno and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A. IAFPA-2006 July 23-26, 2006
CRSS & Speech Processing Overview • Previous Studies on Stress & Lombard Effect • Perceptual Speaker ID with Lombard Speech • Speech Corpus - UTScope • Experimental Setup • Results • Summary & Impact Outline
Overview of CRSS-Hansen Research: Dialect & Accent UAE, Egypt, Palestine, etc. Cuba, Peru, Puerto Rico Cambridge, Irish, Welsh, etc. In-Set / Out-of-Set Speaker Detection Normalization: Speaker, Environment, Language UTDrive & CU-Move: In-Vehicle Voice Navigation SPOKEN DOCUMENT RETRIEVAL http://SpeechFind.utdallas.edu Speech Under Stress Speech Enhancement
CHANNEL STRESS LOMBARD ENVIRONMENT NOISE EFFECT NOISE SPEECH SPEECH RECOGNITION VOICE AMERICAN COMMUNICATIONS SPEAKER ENGLISH HUMAN CHANNEL LANGUAGE ACCENT (AUDITORY) RECOGNITION SPEAKER RECOGNITION Why Speech Systems Break? “ · S B P PEAKER ASED ROBLEMS · C B E ONTEXT ASED FFECTS S & E TRESS MOTION H (E +10,000; J 120) OMONYMS NGLISH APANESE L E / N OMBARD FFECT OISE C : ONFUSABLE P T D SYCHOLOGICAL ASK EMANDS (T , S , S ; C , K ) AKE TAKE TRAIGHT AKE ATE A /L CCENT ANGUAGE A : MBIGUOUS S D PEAKER IFFERENCES J ? EET YET ( , , ) AGE SEX VOCAL TRACT " ' " . " " IT S OURS VS IT SOURS S S PONTANEOUS PEECH " " . " " NICE GUYS VS NICE SKIES " U m , I j u s t w a n n a , I j u s t w a n t t o s a y , I d o n ' t k n o w w h a t I w a n t t o s a y . " · C B OMMUNICATION ASED · E B NVIRONMENTAL ASED M ICROPHONE A N COUSTIC OISE V C OICE OMPRESSION R R OOM EVERBERATION C /M C HANNEL OBILE ELLULAR P T D HYSICAL ASK EMANDS F i l e : 1 9 9 8 _ W h y R e c o g B r e a k D i s k : P w r B o o k ( j h l h )
Speaker Stress Noise Microphone Speech Production: Phonetics & Acoustics Speech Physiology Acoustic Speech Waveform Neutral Stress
DOES STRESS VARIABILITY IMPACT SPEAKER RECOGNITION? • Limited Research on Speaker Recognition over Stress, Lombard Effect, etc. • NATO RSG.10 Report showed probe experimental results with SUSAS corpus NATO, 2000
Pitch Glottal Spectral Slope EFFECT OF STRESS ON SPEECH FEATURES (earlier studies by Hansen (1988), 200 speech features, 10,000 stat. tests) • Formant Location
Phone Duration RMS Intensity EFFECT OF STRESS ON SPEECH FEATURES
Probability distribution • Detection (ROC) curves STRESS DETECTION USING PITCH • Conditional Gaussian fit (Zhou, Hansen 1997) • Classification error rate • Neutral vs. Loud: 7.24% (Neutral), 8.28% (Loud) • Neutral vs. Lombard: 20.69% (Neutral), 19.31% (Lombard)
Individual Feature Pitch Glottal Spectral Slope Intensity Phone Duration Formant Location 1st formant 2nd formant Feature Fusion Duration + Intensity + mean Pitch Stress/Neutral Error Rates 6 21% 18 36% 18 36% 28 46% 38 46% 50 58% 0 17% PAST STRESS DETECTION STUDIES USING TRADITIONAL FEATURES
TEAGER ENERGY OPERATOR • Discrete time and Continuous time TEO: where, is Teager Energy Operator TEO-CB-Auto-Env: Critical Band based TEO AUTOcorrelation ENVelope Critical Frequency 17 Band Partition = based on Auditory Perception Ref: Zhou, Hansen,Kaiser, IEEE Transactions on Speech & Audio Processing, vol. 9(2): 201-216, March 2001
Assessment for NATO SUSC-0Military Cockpit Recordings Neutral HMM Model vs. Stress trained HMM Model
Detection of Speech Under Stress: WRAIR GOAL:(1) Identify, Model, and Classify Speech Under Stress in Military-Related Task Conditions, and (2) Improve Automatic Speech Coding under Stress APPROACH: • Effective “Soldier of the Quarter Board” Paradigm • Monitor and Track “Biometrics” of Stress: Heart rate, blood pressure, stress hormones, psychometrics. • Engineering: Focus on NONLINEAR Air Turbulent Model – Teager Energy Operator; Identify Stress Dependent Performance across Speakers, phonemes Rahurkar, Hansen, Meyerhoff, Saviolakis, Koenig, "Frequency Distribution based Weighted Sub-Band Approach for Classification of Emotional/Stressful Content in Speech," Interspeech, pp.721-724, Geneva, Switzerland, Sept. 2003 (another paper at Interspeech-2005)
First observed by Etienne Lombard in 1911 • Change in speech production in response to noise to increase communication performance • “Lombard Test” - standard test for hearing loss in U.S. (ASHA) – measure dB-SPL change in speech production • Hansen (1988) evaluation of 200 features with +10,000 statistical tests on 11 different stressed speech conditions to quantify changes in speech production Lombard Effect
Speech Corpus: UTScope • Audio samples for the perceptual experiment were extracted from UTScope corpus. Speech under COgnitive and Physical stress & Emotion • Consists of 4 Domains • Lombard Effect – noise levels & types • Physical Stress – stair climbing/stepper • Cognitive Stress – driving (simulator & actual) • Emotion (Angry, Fear, Anxiety, Frustration) IAFPA-06: focus on “Lombard Effect”
Goal: obtain Lombard Speech at different noise levels • Quantify ground truth with biometric analysis • Lombard Effect Speech –9 conditions (3 noise, 3 levels) Lombard Effect Highway Noise (windows ½ open) 70,80,90 dB-SPL Large Crowd Noise 70,80,90 dB-SPL Pink Noise 65,75,85 dB-SPL 1 sec. duration
UTScope Corpus LARGE CROWD NOISE 70, 80, 90 dB-SPL HIGHWAY DRIVING, WINDOWS HALF OPEN 70, 80 ,90 dB-SPL PINK NOISE 65, 75, 86 dB-SPL UTScope PURETONE HEARING SCREENING NOISE LEVELS CALIBRATED WITH QUEST SLM OPEN-AIR HEADPHONES FOR SPEECH FEEDBACK
UTScope Corpus 20 TIMIT SENTENCES 5 DIGIT STRINGS 1 MINUTE SPONTANEOUS SPEECH UTScope 8-CHANNEL DAT RECORDER 100 SPEAKERS P-MIC CLOSE-TALKING MIC FAR-FIELD MIC
The ASHA-certified sound booth and recording equipments UTScope Data Collection
Spectrogram Examples Male – Neutral Male – Lombard • Lombard Effect impacts Temporal and Spectral Structure (as expected) • Evaluation: Perceptual Experiments to assess Speaker Recognition
Perceptual In-Set Speaker ID • Listener Test • Speakers • Corpus: UTScope • Native US English speakers • Female speakers only • Speech Conditions • Noise Type • Highway driving • Noise Level • 90dB-SPL
Perceptual In-Set Speaker ID • Speech Materials • Read speech • TIMIT sentences: phonetically balanced • 3 sentences per audio sample (.wav, 16k Hz) • Ref: Basketball can be an entertaining sport. My problem is, the cat’s meow always hurts my ears. The causeway ended abruptly at the shore. • Test: Youngsters commonly love chocolate and candies as treats. December and January are nice months to spend in Miami. There were other farmhouses nearby.
Perceptual In-Set Speaker ID • Listener Test • Listeners(12: 2f/10m May ‘06, -- 41 as of July ‘06) India(4), China(1), Korea(1), Mexico(1), Pakistan(1), Thai(1), Turkey(1) US(1), Vietnam(1) • Task: In-set vs. Out-of-set Speaker Identification • Reference/Training • 12 In-set Female speakers • Test • 8 In-Set speakers • 4 Out-of-Set speakers
In-Set Speaker ID User Interface • Reference audio: Neutral Lombard • Test audio: Neutral Lombard
In-Set Speaker ID Results • The effect of speech condition: significant (p=.0024). • Mismatched condition(NL-LD) accuracy: chance level (52%). • Lombard speech(LD-LD, 79%): higher accuracy than neutral speech (NL-NL, 67%). • Lombard effect may emphasize the speech characteristics, and improve accuracy on perceptual speaker ID.
Closed Set Speaker ID Accuracy (%) Automated System Performance (SUSAS Corpus) The trend hold the same for the automated system. LOSS 5-74% Angry 62% Lombard 48% Loud 74% (See Hansen, et.al, The Impact of Speech Under `Stress' on Military Speech Technology, NATO Research & Tech. Org. RTO-TR-10, March 2000).
In-Set Speaker ID Results • In-Set accuracy: affected by the speech condition significantly (p<.0001). • Out-of-Set accuracy: the effect of speech condition marginally significant (p=.0450). • Mismatched condition: • In-Set accuracy below chance level. • Out-of-Set accuracy & false alarm rate high. • Matched conditions: In-Set accuracy & false acceptance rate higher.
In-Set Speaker ID Results • Accuracy, Confidence Ratings, and Token Coverage • Confidence rating 3 (somewhat sure) and above: • Accuracy: 72% or higher. • Token coverage: 95% or more. • Confidence ratings can be used to filter listener responses for higher accuracy without losing the token coverage significantly.
Summary & Conclusions • Results: • Mismatched condition: • The ID accuracy is at a chance level • The Out-of-Set accuracy and false alarm rate are high. • Matched conditions: • The ID accuracy is higher with Lombard speech than with neutral speech
Summary & Conclusions • Production analysis: Characteristics of speech features change in different manners depending on the noise type and noise level (Varadarajan and Hansen, 2006). • Implications: The ID accuracy is higher with Lombard speech if the noise type and level match for the reference and test audio. • Future Work: The effect of noise type & noise level on perceptual speaker ID.
Candidate Monitoring Speaker State: Monitoring Speaker State: Soldier Of Quarter (SOQ) CORPUS BOARD SETUP • Test Time AB: T-7 day C: T-20 min D: D.o.Board E: T+20 min FG: T+7day • 6 Questions. • Data Recorded • Speech. • Biometrics
Monitoring Speaker State: Monitoring Speaker State: BIOMETRIC MEASURES Change +34.2% +33.0% +22.5% HR = Heart Rate (beats per minute). sBP = Systolic Blood Pressure. dBP = Dystolic Blood Pressure. * Chemical Analysis of Saliva & Blood Samples was also done
Baseline Weighted CB Monitoring Speaker State: Monitoring Speaker State: Classification Error Results for Closed Speaker Set Stress Neutral