Automatic Burst Location

Automatic Burst Location Alina Khasanova Prosody-ASR Group February 12, 2009

Objective: Develop a procedure for accurate automatic location of the point of release of stops in order to study the temporal and spectral properties of plosives {p,t,k,b,d,g} in #CV and VC# contexts in spontaneous speech. Database: Buckeye corpus •hand-segmented for phone boundaries •the release point is not labeled

Nature of the problem Acoustically, a plosive is characterized by two stages: (1) Acoustic closure: an interval of very low, relatively constant level of energy (the duration of full constriction) (2) Release: a sudden increase in energy across the full frequency range, i.e, a transient, followed by a period of frication (or, in case of aspirated stops, frication + aspiration) Hence, a natural way to find the point of release is to look for a large jump in energy after a continuous period of “silence”.

Previous work on the problem • Engineers (Niyogi et al.1999, Hosom & Cole 2000) Task: Detecting plosives in continuous (non-segmented) speech Reasons: Automatic phonetic alignment, measurement of acoustic features (e.g. VOT) for the purposes of ASR Methods: (1) Identify candidates based on such measures as: total energy, relative change in energy, high-pass energy, spectral flatness, etc., (2) Classify candidates into plosive and non-plosives (via support vector machines, neural networks, etc.). Measure of success: Correct if within 20 msec of hand-labeled closure-release boundary • Linguists (Yao 2007) Task: Accurately locate the closure-burst boundary Reason: Study the duration of closure and VOT, and how a number of linguistic and extra-linguistic factors affect them Method: Mel spectral templates and similarity scoring (Johnson 2006) Measure of success: As accurate as possible

Present Method: Stop Vowel Amplitude normalization Pre-emphasis Preprocessing Spectrogram Calculating: Average energy (AvgV) Maximum energy (MaxV) Spectrogram: 64 point DFT 4 ms Hamming window,2 ms step Calculating ave. energy >3KHz (HPE) (overlapping frames) Calculating HPE deltas (Del) (successive 4 ms frames) HPE>0.05*MaxV & Del>AvgV HPE>0.05*MaxV & Del>0.75*DelMax Y Y N N Closure Burst Burst Closure Last frame= boundary Last frame= boundary

Results • Control Dataset: Hand-labeled closure-release boundary in 607 tokens from Buckeye corpus = 300 word-initial, 307 word-final; 5 tokens for each place of articulation; for /p,t,k,d,g/ from 10 subjects (5 m, 5 f), for /b/ from more than 10 subjects, with variety of speech rates • Error Analysis:

How to improve the performance? • Poor performance on ambiguous/non-canonical tokens (e.g. “noisy” closure) • Would like to identify and exclude such tokens • Exclusion metric: • Standard deviation of deltas: Voiceless: exclude if std<3 (Init 14%, Final 42%) Voiced: exclude if std<2 (Init 5%, Final 33%) • Other?

Results after exclusion • Best results (rms): InitialVless 0.0038 msec InitialVoiced 0.0133 FinalVless 0.013 FinalVoiced 0.014

How to improve further? • • • • References: Hosom, J.-P. & Cole, R. (2000) Burst detection based on measurements of intensity discrimination in Sixth Int’l Conf. on Spoken Lang. Processing Niyogi, P., Burges, C, & Ramesh, P. (1999) Distinctive feature detection using support vector machines in Proc. of Int’l Cof. On Acoustics, Speech, & Signal Processing. Yao Y. (2007). Closure Duration and VOT of Word-initial Voiceless Plosives in English in Spontaneous Connected Speech. UC Berkeley Phonology Lab 2007 Annual Report.

Automatic Burst Location