1 / 133

Acoustic Databases

Acoustic Databases. Jan Odijk ELSNET Summer School, Prague, 2001. Acknowledgements. Part of the slides have been borrowed from or are based on work by Bart D’Hoore Hugo van Hamme Robrecht Comeyne Dirk van Compernolle Bert van Coile. Overview. What is a speech database?

kesia
Download Presentation

Acoustic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acoustic Databases Jan Odijk ELSNET Summer School, Prague, 2001

  2. Acknowledgements • Part of the slides • have been borrowed from or • are based on work by • Bart D’Hoore • Hugo van Hamme • Robrecht Comeyne • Dirk van Compernolle • Bert van Coile

  3. Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications

  4. Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications

  5. Linguistic Resources(LRs) • Linguistic Resources are sets of language data in machine readable form that can be used for developing, improving or evaluating language and speech technologies. • Some language and speech technologies • Text-To-Speech (TTS) • Automatic Speech Recognition (ASR) • Dictation • Speaker Verification/recognition • Spoken Dialogue • Audio Mining • Machine Translation • Intelligent Content Management • ….

  6. Linguistic Resources(LRs)Major Types • Electronic Text Corpora • Newspapers, magazines, etc. • Usenet texts, e-mail, correspondence • Etc. • Lexical Resources • Monolingual lexicons • Translation lexicons • Thesauri • … • Acoustic Resources • Annotated Speech Recordings • Annotated Recordings of other acoustic signals • Coughing, throat clearing, breathing, … • Door slamming, screeching tires (of a car),…

  7. Types of Linguistic Resources Acoustic Resources • Acoustic Databases (ADBs) • Controlled recording of human speech or other acoustic signals • Enriched with annotations • Recorded in a digital way • Representative of targeted application environment and medium • Balanced for phonemes/phoneme combinations • Speaker parameters, recording quality, environment/medium documented

  8. Types of Linguistic Resources Acoustic Resources • Annotated unstructured recordings • Broadcasted material • Recorded conversations/monologues/speeches etc • Dictated material • Enriched with annotations

  9. Types of Linguistic Resources Acoustic Resources • In-service data • Recorded sessions of interaction humans-running application • Usually by logging a customer system • Enriched with annotations • Used for tuning models, grammars,etc. to specific application

  10. Types of Linguistic Resources Acoustic Resources • Environments • “Quiet” • Studio • Quiet office • Normal office • Noisy • Public place (street, hotel lobby, station, etc.) • Car (running engine 0km/hr, city, highway) • Industrial environment

  11. Types of Linguistic Resources Acoustic Resources • Media • HQ close-talk microphone • Desktop Microphones • Telephone • analog or digital • fixed line or mobile • Wide band microphones • Array microphones • PC/PDA etc. low quality microphone

  12. Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications

  13. Acoustic Resources Use • (for speech synthesis modules in TTS systems) • (as acoustic reference material for pronunciation lexicons) • Mainly for speech recognition • Training and test material for research into new recognition engines and engine features • Training and test material for development of acoustic models • Tuning of acoustic models for specific applications

  14. What is speech recognition? • ASR: Automatic speech recognition • Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. • Speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples.

  15. Elements of a Recognizer

  16. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  17. Feature Extraction • Turning speech signal into something more manageable • Do analysis once every 10ms • Data compression: 220 byte => 50 byte => 4 byte • Sampling of a signal: transforming into a digital form • Extracting relevant parameters from the signal • Spectral information, energy, pitch,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)

  18. 10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), typically 30 ms • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors

  19. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  20. Acoustic Model • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples

  21. f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13

  22. S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start

  23. ,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop

  24. Acoustic Model: Units • Phonemes • Phonemes in context: spectral properties depend on previous and following phoneme • Diphones • Sub-words: syllables, consonant clusters • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above

  25. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  26. Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features (viterbi) • Search algorithm: looks for the best scoring word or word sequence

  27. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  28. Language Model • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations

  29. Language Model • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large corpus

  30. Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module

  31. Post Processing • Re-ordering of Nbest list using other criteria: e.g. account numbers, telephone numbers • Spelling: name search from a list of known names • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents”  “$ 3.50” • “doctor Jones”  “Dr. Jones” • Etc.

  32. Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model

  33. Training of Acoustic Models • Database design • Coverage of units: word, phoneme, context dependent unit • Coverage of population (region, dialect, age, …) • Coverage of environments (car, telephone, office,..) • Database collection and validation • Checking recording quality • Annotation: describing what people said, extra-speech sounds • Dictionaries • Phonetic transcription of words • Multiple transcriptions needed • G2P: automatic transcription

  34. 2.1 -0.2 1.9 . -0.3 10.3 1.2 -0.9 . 0.2 8.1 -0.5 1.3 . 0.2 Feature vectors ... ... ……...

  35. Example: discrete models • A collection of prototypes is constructed (100 to 250) • Each vector is replaced by its nearest prototype

  36. 2.1 -0.2 1.9 . -0.3 10.3 1.2 -0.9 . 0.2 8.1 -0.5 1.3 . 0.2 Feature vectors ... ... ……... ,,,39 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,, Prototypes

  37. Phoneme assignment ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz Prototypes 2276998900023448889211127780128897791237787622

  38. For all utterances in database: Make phonetic transcription of a sentence Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models

  39. Key Element in ASR • ASR is based on learning from observations • Huge amount of spoken data needed for making acoustic models • Huge amount of text data needed for making language models • => Lots of statistics, few rules

  40. Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications

  41. Contents of an ADB • Utterances of different utterance types • Utterance types suited to the intended application domain • Text balanced for phoneme and/or diphone distribution • All enriched with annotations

  42. Contents of an ADBSpontaneous v. Read Utterances • A spontaneous utterance is a response to a question or a request • “In which city do you live?” • “Please spell a letter heading to your secretary” • “Is English your mother tongue?” • “Make a hotel reservation” • A readutterance is an utterance read from a presentation text • “London” • “Dear John” • “Yes” • “Please book me a room for 2 persons with bath. We will arrive ….”

  43. Contents of an ADB • Isolated Phonetically Rich Word • Apple Tree, Lobster • Isolated Digit • 5 • Isolated Alphabet • B • Isolated number (natural number) • 4256

  44. Contents of an ADB • Continuous Digits • 9 1 1 • Continuous Alphabet • Y M C A • Commands • Stop, left, print, call, next

  45. Contents of an ADBConnected Digits • Telephone Numbers • 057/228888 • Credit Card Numbers • 3741 959289 310001 • Pin-codes • 8978 • Social Security Number • 560228 561 80 • Other identification numbers, e.g. sheet id • 012589225712

  46. Contents of an ADBTime and Date Expressions • Time (“analog”, word style) • A quarter past two • Time (“digital”) • 14:15 • 2:15PM • Date (“analog”, word style, absolute) • Friday, June 25th, 1999 • Christmas’ Eve, Easter • Date (“digital”, absolute) • 25/06/99 • Date (“analog”, word style, relative) • Tomorrow, next week, in one month

More Related