200 likes | 238 Views
Spoken Dialogue Systems. Julia Hirschberg CS 4706. Issues. Error avoidance Error detection From the system side: how likely is it the system made an error? From the user side: what cues does the user provide to indicate an error?
E N D
Spoken Dialogue Systems Julia Hirschberg CS 4706
Issues • Error avoidance • Error detection • From the system side: how likely is it the system made an error? • From the user side: what cues does the user provide to indicate an error? • Error handling: what can the system do when it thinks an error has occurred? • Evaluation: how do you know what needs fixing most?
Avoiding misunderstandings • By imitating human performance • Timing and grounding (Clark ’03)
Recognizing Problematic Dialogues • Hastie et al, “What’s the Trouble?” ACL 2002.
Recognizing Problematic Utterances (Hirschberg et al ’99--) • Collect corpus from interactive voice response system • Identify speaker ‘turns’ • incorrectly recognized • where speakers first aware of error • that correct misrecognitions • Identify prosodic features of turns in each category and compare to other turns • Use Machine Learning techniques to train a classifier to make these distinctions automatically
Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition correction aware site
Results • Reduced error in predicting misrecognized turns to 8.64% • Error in predicting ‘awares’ (12%) • Error in predicting corrections (18-21%)
Evidence from Human Performance • Users provide explicit positive and negative feedback • Corpus-based vs. laboratory experiments – do these tell us different things? • Bell & Gustafson ’00 • What do we learn from this? • What functions does feedback serve? • Krahmer et al • ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification)
Pos: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info • Neg: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info • Hypotheses supported but… • Can these cues be identified automatically? • How might they affect the design of SDS?
Error Handling Strategies • Goldberg et al ’03: how should systems best inform the user that they don’t understand? • System rephrasing vs. repetitions vs. statement of not understanding • Apologies • What behaviors might these produce? • Hyperarticulation • User frustration • User repetition or *rephrasing
What lessons do we learn? • What produces least frustration? • Best recognized input?
Evaluating Dialogue Systems • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by whatgets accomplished by the user and the dialogue agent and howit gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures
Task Success • Task goals seen as Attribute-Value Matrix • ELVIS e-mail retrieval task(Walker et al ‘97) • “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 • Task success defined by match between AVM values at end of with “true” values for AVM
Metrics • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted
Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors
Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog? (User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) User Satisfaction:Sum of Many Measures
Performance Functions from Three Systems • ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET • TOOT User Sat.= .35* COMP + .45* MRS - .14*ET • ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help • COMP: User perception of task completion (task success) • MRS: Mean recognition accuracy (cost) • ET: Elapsed time (cost) • Help: Help requests (cost)
Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development • Making predictions about system modifications • Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’
Next Week • Speech summarization and data mining