1 / 95

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scien

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng (contains electronic versions of papers and links to data)

Audrey
Download Presentation

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scien

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding Spoken Languageusing Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng (contains electronic versions of papers and links to data) Patterns of Speech Sounds in Unscripted Communication - Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000

  2. OR ….

  3. How I Learned to Stop Worrying and Use The Canonical Form

  4. Disclaimer I am a Phonetician - NOT! (many thanks for the invite)

  5. No Scientist is an Island … IMPORTANT COLLEAGUES PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH (SWITCHBOARD) Candace Cardinal, Rachel Coulston, Dan Ellis, Eric Fosler, Joy Holllenback, John Ohala, Colleen Richey STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION Eric Fosler, Leah Hitchcock, Joy Hollenback ARTICULATORY-ACOUSTIC BASIS OF CONSONANT RECOGNITION Leah Hitchcock, Rosaria Silipo AUTOMATIC PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH Shawn Chang, Lokendra Shastri

  6. Germane Publications STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany . Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176. Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATION Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing. Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724. http://www.icsi.berkeley.edu/~steveng

  7. Prologue

  8. Language - The Traditional Perspective The “classical” view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization Phonetic orthography

  9. Language - A Syllable-Centric Perspective A more empirical perspective of spoken language focuses on the syllable as the interface between “sound” and “meaning” Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

  10. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

  11. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  12. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

  13. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

  14. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time

  15. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form

  16. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely

  17. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

  18. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position • Therefore, it is important to model spoken language at the syllabic level

  19. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position • Therefore, it is important to model spoken language at the syllabic level • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

  20. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position • Therefore, it is important to model spoken language at the syllabic level • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY • It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

  21. Take Home Messages • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time • Nuclei and codas are expressed canonically only 60% of the time • Nuclei tend to be realized as vowels different from the canonical form • Codas are often deleted entirely • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position • Therefore, it is important to model spoken language at the syllabic level • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY • It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

  22. Take Home Messages • PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT INFORMATION CONTENT Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

  23. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

  24. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material

  25. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments

  26. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words

  27. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables

  28. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables • Articulatory-acoustic features

  29. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables • Articulatory-acoustic features • PERCEPTUAL EVIDENCE • The articulatory-acoustic basis of consonant recognition

  30. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables • Articulatory-acoustic features • PERCEPTUAL EVIDENCE • The articulatory-acoustic basis of consonant recognition • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition

  31. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables • Articulatory-acoustic features • PERCEPTUAL EVIDENCE • The articulatory-acoustic basis of consonant recognition • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition • COMPUTATIONAL METHODS • Automatic methods for phonetic transcription based on articulatory-acoustic features

  32. Road Map • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH • Provides the basis for the statistical analyses of spontaneous material • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: • Phonetic segments • Words • Syllables • Articulatory-acoustic features • PERCEPTUAL EVIDENCE • The articulatory-acoustic basis of consonant recognition • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition • COMPUTATIONAL METHODS • Automatic methods for phonetic transcription based on articulatory-acoustic features • Is the most likely means through which it will be possible to generate sufficient empirical data with which to rigorously test hypotheses germane to spoken language

  33. Phonetic Transcription of Spontaneous (American) English

  34. Phonetic Transcription of Spontaneous English • TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION - SWITCHBOARD • AMOUNT OF MATERIAL MANUALLY TRANSCRIBED     • 3 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods) • 1 hour labeled and segmented at the phonetic-segment level • DIVERSITY OF MATERIAL TRANSCRIBED • Spans speech of both genders (ca. 50/50%) reflecting a wide range of American dialectal variation (6 regions + “army brat”), speaking rate and voice quality • TRANSCRIBED BY WHOM? • 7 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of the corpus was transcribed by three individuals out of the original eight • Supervised by Steven Greenberg and John Ohala • TRANSCRIPTION SYSTEM • A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd • HOW LONG DOES TRANSCRIPTION TAKE? (Don’t Ask!) • 388 times real time for labeling and segmentation at the phonetic-segment level • 150 times real time for labeling phonetic segments and segmenting syllables • HOW WAS LABELING AND SEGMENTATION PERFORMED? • Using a display of the signal waveform, spectrogram, word transcription and “forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations • DATA AVAILABLE AT - http://www.icsi/berkeley.edu/real/stp

  35. A Brief Tour of Pronunciation Variation in Spontaneous American English

  36. Cumulative Word Frequency in English Focus on 100 most common words The 10 most common words account for 27% of the corpus The 100 most common words account for 67% of the corpus The 1000 most common words account for 92% of the corpus Thus, most informal dialogues are composed of a relatively small number of common words. However, it is the infrequent words that typically provide the precision and detail required for complex information transfer 92% 67% 27% Computed from the Switchboard corpus (American English telephone dialogues)

  37. N Pronunciation N Pronunciation How Many Pronunciations of “And”?

  38. N Pronunciation N Pronunciation How Many Pronunciations of “And”?

  39. MCP %Total Most Common Pronunciation Rank Word N #Pron How Many Different Pronunciations?

  40. MCP %Total Most Common Pronunciation Rank Word N #Pron How Many Different Pronunciations?

  41. MCP %Total Most Common Pronunciation Rank Word N #Pron How Many Different Pronunciations?

  42. MCP %Total Most Common Pronunciation Rank Word N #Pron How Many Different Pronunciations?

  43. MCP %Total Most Common Pronunciation Rank Word N #Pron How Many Different Pronunciations?

  44. English is (sort of) like Chinese …. 95% of the words contain just ONE or TWO syllables …. 81% of the word tokens are monosyllabic Of the 100 most common words, 90 are one syllable in length Only 22% of the words in the lexicon are one syllable long Hence, there is a decided preference for monosyllablic words in informal discourse

  45. Syllable and. Word Frequencies are Similar Words and syllables exhibit similar distributions over the 300 most common elements, accounting for 80% of the corpus The similarity of their distributions is a consequence of most words consisting of just a single syllable

  46. Word Frequency in Spontaneous English Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10 Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc. Computed from the Switchboard corpus (American English telephone dialogues)

  47. Information Affects Pronunciation The faster the speaking rate the more likely that the pronunciation deviates from canonical However, the effect is much more pronounced for the 100 most common words than for more infrequent words From Fosler, Greenberg and Morgan (1999); Greenberg and Fosler (2000)

  48. English Syllable Structure is (sort of) Like Japanese Most syllables are simple in form (no consonant clusters) 87% of the pronunciations are simple syllabic forms 84% of the canonical corpus is composed of simple syllabic forms n= 103, 054

  49. Complex Syllables are Important, Though Thus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex There are many “complex” syllable forms (consonant clusters, but all occur relatively infrequently Complex codas are not as frequently realized in actual pronunciation as their canonical representation Complex onsets tend to preserve the canonical pronunciation in realize their canonical representation n= 17,760

  50. Syllable-Centric Pronunciation Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues Onsets are pronounced canonically far more often than nuclei or codas Percent Canonically Pronounced (Read Sentences) “Cat” [k ae t] [k] = onset [ae] = nucleus [t] = coda Syllable Position (Spontaneous speech) n= 120,814

More Related