1.1k likes | 1.91k Views
Voice Browsers. GeneralMagic Demo. Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi. What is a Voice Browser?. Expanding access to the Web Will allow any telephone to be used to access appropriately designed Web-based services Server-based
E N D
Voice Browsers GeneralMagic Demo Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi
What is a Voice Browser? • Expanding access to the Web • Will allow any telephone to be used to access appropriately designed Web-based services • Server-based • Voice portals
What is a Voice Browser? • Interaction via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. • An advantage to people with visual impairment • Web access while keeping hands & eyes free for other things (eg. Driving).
What is a Voice Browser? • Mobile Web • Naturalistic dialogs with Web-based services.
Motivation • Far more people today have access to a telephone than have access to a computer with an Internet connection. • Many of us have already or soon will have a mobile phone within reach wherever we go.
Motivation • Easy to use - for people with no knowledge or fear of computers. • Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller.
Motivation • Many companies to offer services over the phone via menus traversed using the phone's keypad. Voice Browsers are the next generation of call centers, which will become Voice Web portals to the company's services and related websites, whether accessed via the telephone network or via the Internet.
Motivation • Disadvantages to existing methods: • WAP (Cellular phones, Palm Pilots) • Small screens • Access Speed • Limited or fragmented availability • Akward input • Price • Lack of user habit
The leading role is turned over to the USER Differences Between Graphical & Voice Browsing • Graphical browsing is more passive due to the persistence of the visual information • Voice browsing is more active since the user has to issue commands. • Graphical Browsers are client-based, whereas Voice Browsers are server-based.
Possible Applications • Accessing business information: • The corporate "front desk" which asks callers who or what they want • Automated telephone ordering services • Support desks • Order tracking • Airline arrival and departure information • Cinema and theater booking services • Home banking services
Possible Applications (2) • Accessing public information: • Community information such as weather, traffic conditions, school closures, directions and events • Local, national and international news • National and international stock market information • Business and e-commerce transactions
Possible Applications (3) • Accessing personal information: • Voice mail • Calendars, address and telephone lists • Personal horoscope • Personal newsletter • To-do lists, shopping lists, and calorie counters
Advancing Towards Voice • Until now, speech recognition and synthesis technologies had to be handcrafted into applications. • Voice Browsers intend the voice technologies to be handcrfted directly into web servers. • This demands transformation of Web content into formats better suited to the needs of voice browsing or authoring content directly for voice browsers.
The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collective understanding.
WC3 Speech Interface Framework • Pronunciation Lexicon • Call Control • Voice Browser Interoperation • VoiceXML • Speech Synthesis • Speech Recognition • DTMF Grammars • Speech Grammars • Stochastic (N-Gram) Language Models • Semantic Interpretation
VoiceXML • VoiceXML is a dialog markup language designed for telephony applications, where users are restricted to voice and DTMF (touch tone) input. Browser text.html Internet Web Server text.vxml
Speech Synthesis • The specification defines a markup language for prompting users via a combination of prerecorded speech, synthetic speech and music. You can select voice characteristics (name, gender and age) and the speed, volume, pitch, and emphasis. There is also provision for overriding the synthesis engine's default pronunciation.
Speech Recognition Speech Grammars Speech Semantic Interpretation Stochastic Language Models USER Touch Tone DTMF Grammars
DTMF Grammars • Touch tone input is often used as an alternative to speech recognition. • Especially useful in noisy conditions or when the social context makes it awkward to speak. • The W3C DTMF grammar format allows authors to specify the expected sequence of digits, and to bind them to the appropriate results
Speech Grammars • In most cases, user prompts are very carefully designed to encourage the user to answer in a form that matches context free grammar rules. • Speech Grammars allow authors to specify rules covering the sequences of words that users are expected to say in particular contexts. These contexual clues allow the recognition engine to focus on likely utterances, improving the chances of a correct match.
Stochastic (N-Gram) Language Models • In some applications it is appropriate to use open ended prompts (how can I help). In these cases, context free grammars are unuseful. • The solution is to use a stochastic language model. Such models specify the probability that one word occurs following certain others. The probabilities are computed from a collection of utterances collected from many users.
Semantic Interpretation • The recognition process matches an utterance to a speech grammar, building a parse tree as a byproduct. • There are two approaches to harvesting semantic results from the parse tree: 1.Annotating grammar rules with semantic interpretation tags (ECMAScript). 2. Representing the result in XML.
Semantic Interpretation - Example For example (1st approach), the user utterance: "I would like a medium coca cola and a large pizza with pepperoni and mushrooms.” could be converted to the following semantic result { drink: { beverage: "coke” drinksize: "medium” } pizza: { pizzasize: "large" topping: [ "pepperoni", "mushrooms" ] } }
Pronunciation Lexicon • Application developers sometimes need to ability to tune speech engines, whether for synthesis or recognition. • W3C is developing a markup language for an open portable specification of pronunciation information using a standard phonetic alphabet. • The most commonly needed pronunciations are for proper nouns such as surnames or business names.
Call Control • Fine-grained control of speech (signal processing) resources and telephony resources in a VoiceXML telephony platform. • Will enable application developers to use markup to perform call screening, whisper call waiting, call transfer, and more. • Can be used to transfer a user from one voice browser to another on a competely different machine.
Voice Browser Interoperation • Mechanisms to transfer application state, such as a session identifier, along with the user's audio connections. • The user could start with a visual interaction on a cell phone and follow a link to switch to a VoiceXML application. • The ability to transfer a session identifier makes it possible for the Voice Browser application to pick up user preferences and other data entered into the visual application.
Voice Browser Interoperation (2) • Finally, the user could transfer from a VoiceXML application to a customer service agent. • The agent needs the ability to use their console to view information about the customer, as collected during the preceding VoiceXML application. The ability to transfer a session identifier can be used to retrieve this information from the customer database.
Voice Style Sheets? • Some extensions are proposed to HTML 4.0 and CSS2 to support voice browsing • Prerecorded content is likely to include music and different speakers. These effects can be reproduced to some extent via the aural style sheets features in CSS2.
Voice Style Sheets! • Volume • Rate • Pitch • Direction • Spelling out text letter by letter • Speech fonts (male/female, adult/child etc.) • Inserted text before and after element content • Sound effects and music Authors want control over how the document is rendered. Aural style sheets (part of CSS2) provide a basis for controlling a range of features:
How Does It Work? • How do I connect? • Do I speak to the browser or does the browser speak to me? • What is seen on the screen? • How do I enter input?
Problems • How does the browser understand what I say? • How can I tell it what I want? • …what if it doesn’t understand?
Overview on Speech Technologies • Speech Synthesis • Text to Speech • Speech Recognition • Speech Grammars • Stochastic n-gram models • Semantic Interpretation
What is Speech Synthesis? • Generating machine voice by arranging phonemes (k, ch, sh, etc.) into words. • There are several algorithms for performing Speech Synthesis. The choice depends on the task they're used for.
How is Speech Synthesis Performed? • The easiest way is to just record the voice of a person speaking thedesired phrases. • This is useful if only a restricted volume of phrases and sentences is used, e.g. schedule information of incoming flights. The quality depends on the way recording is done.
How is Speech Synthesis Performed? • Another option is to record a large database of words. • Requires large memory storage • Limited vocabulary • No prosodic information • More sophisticated but worse in quality are Text-To-Speech algorithms.
How is Speech Synthesis Performed?Text To Speech • Text-To-Speech algorithms split the speech into smaller pieces. The smaller the units, the less they are in number, but the quality also decreases. • An often used unit is the phoneme,the smallest linguistic unit. Depending on the language used, there are about 35-50 phonemes in western European languages, i.e. we need only 35-50 single recordings. february twenty fifth: f eh b r ax r iy t w eh n t iy f ih f th
Text To Speech • The problem is, combining them as fluent speech requires fluent transitions between the elements. The intelligibility is therefore lower, but the memory required is small. • A solution is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact.
Text To Speech • This means there are now approximately 1600 recordings needed (40*40). • The longer the units become, the more elements there are, but the qualityincreases along with the memory required.
Text To Speech • Other units which are widely usedare half-syllables, syllables, words, or combinations of them, e.g. wordstems and inflectional endings. • TTS is dictionary-driven. The larger the dictionary resident in the browser is, the better the quality. • For unknown words, falls back on rules for regular pronunciation.
Text To Speech • Vocabulary is unlimited!!! • But what about the prosodic information? • Pronunciation depends on the context in which a word occurs. Limited linguistic analysis is needed. • How can I help? • Help is on the way!
Text To Speech • Another example: • I have read the first chapter. • I will read some more after lunch. • For these cases, and in the cases of irregular words and name pronunciation, authors need a way to provide supplementary TTS information and to indicate when it applies.
Text To Speech • But specialized representations for phonemic and prosodic information can be off putting for non-specialist users. • For this reason it is common to see simplified ways to write down pronunciation, for instance, the word "station" can be defined as: station: stay-shun
Text To Speech • This approach encourages users to add pronunciation information, leading to an increase in the quality of spoken documents, compared to more complex and harder to learn approaches. • This is where W3C comes in: Providing a specification to enable consistent control (generating, authoring, processing) of voice output by speech synthesizers for varying speech content, for use in voice browsing and in other contexts.
Overview on Speech Technologies • Speech Synthesis • Text to Speech • Speech Recognition • Speech Grammars • Stochastic n-gram models • Semantic Interpretation
Speech Recognition • Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. • Speech is first digitized and then matched against a dictionary of coded waveforms. The matches areconverted into text.
Speech Recognition Types of voice recognition applications: • Command systems recognize a few hundred words and eliminate using the mouse or keyboard for repetitive commands. • Discrete voice recognition systems are used for dictation, but require a pause between each word. • Continuous voice recognition understands natural speech without pauses and is the most process intensive.