690 likes | 865 Views
Delve into the complexities of ASR, from differentiating speech sounds to handling variations in language, environment, and speakers. Explore methods to reduce these challenges and improve speech recognition accuracy in various applications.
E N D
Automatic Speech RecognitionIntroduction Jan Odijk Utrecht, Dec 9, 2010
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
ASR • Automatic Speech Recognition is the process by which a computer maps an acoustic signal containing speech to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.
ASR-related • Automatic speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Automatic speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
Why is ASR difficult? • All occurrences of a speech sound differ from each other • even when part of the same word type • And when pronounced by the same person • (‘b’ in ‘boom’ is never pronounced twice in exactly the same way) • Each speaker has his own voice characteristics
Why is ASR difficult? • Other problems caused by: • Language: Dutch vs. English vs. … • Accent/Dialect: Flemish vs. NL Dutch, etc. • Gender: Male vs. female • Age: child vs. adult vs. senior • Health: cold, flu, sore throat, etc.
Why is ASR difficult? • Other problems caused by: • Environment: home, office, in-car, in station, etc. • Channel : fixed telephone, mobile phone, multimedia channel, etc. • Microphone(s): telephone mike, close-talk mike, far mike, array microphone, etc.; different mike qualities
Why is ASR difficult? • Confusables: • Zeven vs. negen • Ambiguity • [sã] = cent, (je) sens, sans (French) • Variation • Yes, yeah, yep, ok, okido, fine, etc.
Why is ASR difficult? • Assimilation, deletions, etc • Een => [n], [m], [ŋ] (auto, boek, kast) • Natuurlijk => tuurlijk • Coarticulation • Pronunciation of a sound depends on its environment (sounds preceding/following) • Koel vs. kiel [k] vs. [k’] • Filled pauses, stuttering, repetitions
Why is ASR difficult? • Other sounds • Background noise, music, other people talking, channel noise • Reverberation, echo • Speaker of language X pronouncing words from language Y • Esp. with names (persons, places, …)
How are these problems reduced? • Separate ASR system • for each language • For each accent/dialect (Dutch / Flemish) • For each environment • For each channel and microphone(s) • Use close-talk mike to reduce other sounds and influence of environment • For each speaker (speaker-adaptive/dependent ASR)
How are these problems reduced? • Restricted Vocabulary • Only a limited number of words can be ‘recognized’ by any specific system • Ranging from a dozen to 64k different word forms • Dozen: application in which digits, yes/no and simple commands are sufficient (banking applications, number dialing)
How are these problems reduced? • Restricted Vocabulary • In between: reverse directory application • employee name => phone number • 64k: ‘large vocabulary systems’ • dictation, • (topographic) name recognition
How are these problems reduced? • Small Vocabularies • Is that enough? No, generally not! • Use dialogue to change restricted vocabulary in each dialogue state (dynamic active vocabularies) • Yes/no answer is expected => activate yes/no vocabulary • Digit expected => activate digit vocabulary • Name expected => activate name vocabulary
How are these problems reduced? • 64k Vocabulary (“Large Vocabulary) • Is that enough? • No, generally not • Languages with compounds • Languages with a lot of inflection • Agglutinative languages • => require special measures
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
How does ASR work? • Not possible (yet?) to characterize the different sounds by (hand-crafted) rules • Instead: • A large set of recordings of each sound is made • Using statistical methods a model for each sound is derived (acoustic model) • Incoming sound is compared, using statistics, with acoustic model of a sound
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Feature Extraction • Turning speech signal into something more manageable • Sampling of a signal: transforming into a digital form • For each short piece of speech (10ms) • Compression
Feature Extraction • Extract relevant parameters from the signal • Spectral information, energy, frequency,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)
10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Acoustic Model (AM) • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability (see earlier) • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples
Acoustic Model (AM) • Representation of speech signal • Waveform • Horizontal: time • Vertical: amplitude • Spectogram • Horizontal: time • Vertical: frequency • Color: amplitude of frequency
f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13
S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start
,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop
Acoustic Model: Units • Other possible units • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features • Search algorithm: looks for the best scoring word or word sequence
increase [, I n k R+ I s ,]
Include [, I n k l u: d ,]
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Language Model (LM) • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations
Language Model (LM) • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large text corpus
Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module
Result • N-Best List: • Lists of word sequences with a score • Based on AM and LM • Sorted descending by this score • Maximally N words
Post Processing • Re-ordering of N-best list using other criteria: e.g. credit card numbers, telephone numbers • If one result is needed, select top element • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents” “$ 3.50” • “doctor Jones” “Dr. Jones” • Etc.
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
How to get AM and LM • AM • Annotated speech database, and • Pronunciation dictionary • LM • Handwritten grammar, or • Large text corpus
Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model
Annotated Speech Database • Must contain speech covering • all units: phonemes, context dependent phonemes • population • Region, dialect, age, gender, …) • relevant environment(s) • car, office,.. • Relevant channel(s) • Fixed phone, mobile phone, desktop computer, …
Annotated Speech Database • Must contain transcription of speech • At least orthographic • Must include markers for • Speech by others • Other non-speech sounds • Unfinished words, mispronunciations, stuttering, etc.
Pronunciation Dictionary • List of all words occurring in speech database • With one or more phonetic transcriptions • Or: Grapheme-To-Phoneme (G2P) module • Graphemes => phonemes • E.g. boek => [,b u k ,]
For all utterances in database: Make phonetic transcription of a utterance Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models