10 likes | 118 Views
College of Engineering Temple University. An Introduction to Mandarin Speech Recognition. Mandarin. English. John Steinberg & Dr. Joseph Picone Department of Electrical and Computer Engineering, Temple University, Philadelphia, Pennsylvania. Abstract
E N D
College of Engineering Temple University An Introduction to Mandarin Speech Recognition Mandarin English John Steinberg & Dr. Joseph PiconeDepartment of Electrical and Computer Engineering, Temple University, Philadelphia, Pennsylvania • Abstract • Speech recognition has recently gained a great deal of popularity in video gaming, robotics, speech prosthetics, and mobile technology. With China becoming an increasingly important player in the world economy, a new demand for Mandarin speech recognition has emerged. • This poster will discuss • Applications of speech recognition • Obstacles in English vs. Mandarin speech recognition • Methods used to successfully implement Mandarin speech recognition systems. • Benchmarks in the history of Mandarin speech recognition Speech Recognition Systems World Languages • Mandarin Chinese • Character Based • 8000+ characters compose 80k-200k common words • Act as morphemes • Monosyllabic • Have single associated tone • Tonal Language • 4 distinct tones, 1 neutral tone • Highly Contextual • Few syllables compared to English • English: ~10,000 • Mandarin: ~1300 • High # of homo-phones • 7 instances of “ma” • Coarticulation • Context can cause changes in tone • Bu4 + Dui4 = Bu2 Dui4 • Ni3 + Hao3 = Ni2 Hao3 Language families across the world [6] [1] • English speakers in the World: ~ 350 million [2] • Estimated # of current English learners in China: 200-350 million [3] • Estimated # of native Mandarin speakers: 1+ Billion Speech Recognition Applications Other Applications Auto/GPS • Translators • Prostheses • Language Educ. • Multimedia Search Mobile Technology National Intelligence Language families by speaker population [6] • A Timeline of Mandarin Speech Recognition • Conclusions • Mandarin speech recognition is a developing and increasingly important field • Mandarin speech recognition involves many challenges including large numbers of homophones, word & sentence flexibility, a lack of a segmented lexicon, and coarticulation effects. • Possible improvements to CER may be found in prosodic features and language models • Reference • [1] Lee, C-H. “Advances in Chinese Spoken Language Processing”, World Scientific Publishing Co., Singapore, 2007 pp 128 • [2] Lin-Shan Lee; Keh-Jiann Chen; et. al. , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," ISSIPNN '94, 1994 pp.155-159 vol.1, 13-16 Apr 1994 • [3] Ren-Yuan Lyu; Lee-Feng Chien; et. al., "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," ICASSP-95, vol.1, pp.57-60 vol.1, 9-12 May 1995 [4] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009 http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html [5] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; et. al., "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993 [6] ] K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007 • Modeling Methods • Prosodic Features • Models rhythm, focus, and intonation of speech • Reduces search space (i.e. reduces number of homophones and constrains sentence flexibility) • Language Models • Reduces solution space by using probable combinations of phones, words, phrases, etc. • Challenges in Mandarin Speech Recognition • Highly developed language model is required due to highly contextual nature of Mandarin • Tone modeling • Coarticulation • Large # of homophones • Chinese text is unsegmented • No standard lexicon • Chinese sentence/word structure is very flexible • Ex: Beijing Da Xue -> Bei Da • Prosodic Features • Used for segmentation • Measures pause duration, segment / syllable duration, and F0 reset. • Language Models • N-Grams • Probabilities of the occurrence of N sequences of words • Neural-Networks • High dimensionality • Random Forests • Incorporates morphology into language model • A Timeline of Benchmarks [1] [4]