330 likes | 473 Views
Trying to Understand Misunderstanding: How Robust Can Spoken Natural Language Dialogue Systems Be?. Ronnie W. Smith East Carolina University. Sponsors. National Science Foundation Duke University East Carolina University DARPA BBN. Collaborators.
E N D
Trying to Understand Misunderstanding: How Robust Can Spoken Natural Language Dialogue Systems Be? Ronnie W. Smith East Carolina University
Sponsors • National Science Foundation • Duke University • East Carolina University • DARPA • BBN
Collaborators 1987-1994: Dr. Alan Biermann, Dr. Ruth Day, Dr. Robert Rodman, Richard Hipp, Barry Koster, Dania Egedi, Robin Gambill, Curry Guinn 1994-2000: Dr. Steve Gordon, Robert Hoggard, Shannon Pollard, Chris Shaffer, Greg Keim, Jeremy Mauget, Devender Singh, Jason Carrow, Brian Manning
Quick Outline • Initial motivation and results • Careful studies • Not so careful studies • What I’d really like to do!
Initial Research Focus Develop a model of dialog processing that: • Uses problem-solving to achieve target goal • Exploits contextual knowledge from the discourse and user • Allows mixed-initiative interaction (computer or user controlled) • Allows voice interaction • Operates in real time to permit experimentation
Experimental Evaluation Issues: • Is computational model practical? • Would users dialog behavior change as they gained experience and were given control of the dialog? Hypotheses --As users gained experience and were given the initiative, they • Would complete the task faster • Would speak fewer utterances • Would speak longer utterances • Would spend more time thinking before producing a response Side effect: analyzable data
Data Collection: The Circuit Fix-It Shop • Constructed at Duke University • Assisted users in repairing an electronic circuit • 125 word vocabulary, speaker-dependent connected speech • Varying initiative fixed per interaction • 141 dialogs, 2840 user utterances • 114 or 84% of the dialogs were successfully completed • 50% utterance recognition rate • 81.5% correct interpretation rate
Sample Interaction – Part 1 C: This is the Circuit Fix-It Shop. How may I help you? U: This RS111 circuit is not working. C: I am familiar with that circuit. The LED is supposed to be displaying alternately flashing one and seven. U: The LED is flashing seven. C: Did you mean to say that the LED is flashing? U: The LED is displaying a flashing seven. C: Is the one on the LED displaying? U: No. C: Is anything else on the LED displaying other than the NOT** flashing seven? U: The seven is flashing. C: Is anything else on the LED displaying other than the flashing seven? U: No. C: What is the switch at when the LED is displaying only a flashing seven? U: It is up. C: What is the knob at when the LED is displaying only a flashing seven? U: The knob is at ten.
Sample Interaction – Part 2 C: Is there a wire between connector one eight and connector four one? U: Where is connector four one? C: Below the white knob with the orange circle is a pink box, called transistors. Now, in the upper right of this pink box, called transistors is connector four one. U: The wire is not there. C: Did you mean to say that there IS** a wire between connectors one eight and four one? U: There is no wire. C: Add a wire between connector one eight and connector four one.
Initial Miscommunication Handling Techniques Error-correcting Parsing (Hipp) • Convert input to “nearest” grammatical utterance • “nearest” is determined by a cost matrix for insertions, deletions, and substitutions of words • Costs are not all the same (e.g., “a” vs. “not”) Tell the user what went wrong • Only tell user what computer’s interpretation was • Only when misrecognition caused contradictory interpretation (but required for only 48% of these)
What to Do Next? Get a better speech recognizer! Well--- • Better is not the same as perfect! • Better => stretch its limits anyway • There will probably always be ungrammatical spoken inputs. • There will always be mismatched speaker/hearer background knowledge.
What to Do Next? Investigate strategies for the prevention, detection, and repair of miscommunication in natural language dialog • Detailed analysis of existing dialogs • Development and evaluation of strategies for handling miscommunication
Effects of Variable Initiative on Linguistic Behavior in Human-Computer Spoken Natural Language Dialog • Smith and Gordon (Computational Linguistics, March 1997) • Based on Circuit Fix-It Shop Data • Based on classifying utterances according to task phase • Introduction: establish task purpose • Assessment: establish current system behavior • Diagnosis: establish cause for errant behavior • Repair: establish completion of correction • Test: establish correctness of behavior
Result 1: Relative Number of Utterances Conclusion: Experienced users tend not to discuss details they can handle themselves.
Result 2: Frequency of User Subdialog Transistions Conclusion: Computer initiates most subdialogs except when experienced users are completing the task.
Result 3: Predictability of Subdialog Transistions Idealized Transition Model I D R A T F
Result 3: Predictability of Subdialog Transistions Empirical Transition Model 100 91 69 62 96 I D R A T F 8 80 75 72 97 19 12 39 53 25 Computer controlled % User controlled % 24 • Percentage “normal” dialogs • Computer-controlled: 64% • User-controlled: 33%
Study Conclusions Computer controlled dialogs--- • Have an orderly pattern of computer-initiated subdialogs • Have terse user responses • Are not amenable to user-correction during miscommunication User controlled dialogs--- • Are less orderly • Contain more user-initiated subdialogs • Indicate user willingness to exploit growing expertise
Analysis of Strategies for Selective Utterance Verification • Smith (ANLP, 1997; IJHCS, 1998) • Motivation---miscommunication due to speech recognition errors Spoken: I want to fix this circuit Recognized: power a six a circuit Spoken: there is no wire on connector one zero four Recognized: stays no wire I connector one zero four
Verification Subdialogs Computer: This is the circuit fix-it shop. How may I help you? Spoken: I want to fix a circuit. Recognized: power a six a circuit. Computer: Did you mean to say there is a power circuit? WHEN TO USE THIS??
Goal: SelectiveVerification • Initiate a verification subdialog only when it is believed to be needed. • Criteria for need: sufficiently unsure you’ve fully understood AND the need to fully understand is sufficiently great. • Terminology • Under-verification---system generates an incorrect meaning that is not verified • Over-verification---a correct meaning is verified • Ideal: minimize under-verifications while keeping over-verifications to a minimum as well
Measurements of Uncertainty • Parse Cost---sum of costs incurred by error-correcting parser in transforming input to a grammatical utterance • Expectation Cost---how expected was the response given the dialog context
Measuring Utterance Importance • Unexplored • Domain-dependent? • Fixed-threshold (depends on risk due to miscommunication)
Basic Results Generalization: This result holds under most threshold values.
Study Conclusions • Key parameters for selecting strategy and threshold value: expectation strength, word recognition rate, and penalty for under-verifications • Results depend on the level of system initiative • Results assume all verification subdialogs succeed • Limits to verification subdialogs => other strategies are needed.
General Problem • “Knowing” when you don’t know. OR • “Suspecting” when you don’t know. BUT • If you don’t know, how do you talk about it? THUS • System must provide some information on its understanding of context
Domain-Level Techniques • Detection of dead ends (suspicious error path checks out OK) C: The LED is supposed to be displaying alternately flashing one and seven. U: LED displaying seven (misrecognized and understood as “LED displaying flashing seven”) C: Is anything else on the LED on? U: No. C: Is there a wire between connectors 18 and 41? U: Yes C: Are you sure that the LED is displaying only a flashing seven?
Domain-Level Techniques • Detection of impossibilities (description not possible given “known” conditions) C: What is the LED displaying? U: The LED is displaying a one on the left hand side. C: Is anything else on the LED on? U: No. (NOTE: misstatement---seven also displaying) C: Is the one on the LED flashing? U: Yes. C: Are you sure that the one on the LED is flashing? U: Yes. C: Are you sure that the seven on the LED is off? U: The LED is displaying alternately flashing one and seven.
Domain-Level Techniques In general, these are detections of logical inconsistencies, but are their general dialogue strategies for discussing them?
Grounding---establishing common context of understanding • Beyond context free acknowledgments U: The LED is displaying a flashing seven. (misunderstood) C: Is the one on the LED displaying? U: No. C: Is anything else on the LED displaying other than the NOT** flashing seven? U: The seven is flashing. C: Is anything else on the LED displaying other than the flashing seven? U: No. • How much grounding is enough?
The Initiative Factor • Explicit semantic grounding leverages human intelligence • But only if we allow it! C: Whether there is a wire between connectors 111 and 120 needs to be observed. U: There is a wire between connectors 111 and 120 (misrecognized). C: There is not supposed to be a wire between connectors 111 and 1. U: (repeats and is understood)
The Initiative Factor • Too much computer control? C: What is the voltage between connectors 121 and 120? (computer suspects problem in power circuit) U: I believe there’s a problem with the LED circuit. C: What is the voltage between connectors 121 and 120? Suppose instead the response is--- U: Voltage is eight (misrecognized as control knob at six). Then computer response is okay.
Final Thoughts/Summary • Design systems to leverage human intelligence • The ability to follow step-by-step instruction? • Varying levels of system initiative • VERY challenging when user expertise evolves. • Menus vs. keyboard shortcuts???? • Explicit semantic grounding • Verification subdialogs, etc. • What’s the right amount? • We still need carefully designed studies with real systems!!!