1 / 15

The Use of Speech in Speech-to-Speech Translation

The Use of Speech in Speech-to-Speech Translation. Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk. Candidacy Exam Organization. Use and Meaning of Intonation. Automatic Analysis of Intonation. Applications. Speech-to-Speech Translation. L2 Learning Systems. ASR. MT. TTS. ASR + MT.

deon
Download Presentation

The Use of Speech in Speech-to-Speech Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk

  2. Candidacy Exam Organization Use and Meaning of Intonation Automatic Analysis of Intonation Applications Speech-to-Speech Translation L2 Learning Systems

  3. ASR MT TTS ASR + MT TTS The Use of Speech in Speech-to-Speech Translation • The Use of Prosodic Event Information • On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Cascaded / Loose Coupled Approaches • Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 • A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Integrated Approaches • Finite State Speech-to-Speech TranslationVidal 1997 • On the Integration of Speech Recognition and Statistical Machine TranslationMatusov 2005 • Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003

  4. The Use of Speech in Speech-to-Speech Translation • The Use of Prosodic Event Information • On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Cascaded / Loosely Coupled Approaches • Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 • A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Integrated / Tightly Coupled Approaches • Finite State Speech-to-Speech TranslationVidal 1997 • On the Integration of Speech Recognition and Statistical Machine TranslationMatusov 2005 • Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003

  5. On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • INTARC - German-English Translator produced for VERBMOBIL project. • Spontaneous, limited domain (appointment scheduling) • 80 minutes of prosodically labeled speech • Phrase Boundary (PB) Detector • Gaussian classifier based on F0, energy and time features with a 4 syl. window (acc. 80.76%) • Focus Detector • Rule based approach: Identifies location of steepest F0 decline (acc. 78.5%) • Syntactic parsing search space is reduced by 65% • Baseline syntactic parsing uses • Decoder factor: product of acoustic and bi-gram scores • Grammar factor: grammar model probability of a parse using the hypothesized word • Prosody factor: 4-gram model of prosodic events (focus and PB) • Semantic parsing search space is reduced by 24.7% • The semantic grammar was augmented, labeling rules as “segment-connecting”(SC) and “segment-internal” (SI) • SC rules are applied when there is a PB between segments, SI are applied when there are not. • Ideal phrase boundaries reduced the number of hypotheses by 65.4% (analysis trees by 41.9%) • Automatically hypothesized PBs required a backoff mechanism to handle errors and PBs that are not aligned with grammatical phrase boundaries. • Prosodically driven translation is used when deep transfer (translation) fails • A focused word determines (probabilistically) a dialog act which is translated based on available information from the word chain. • Correct: 50%, Incomplete: 45%, Incorrect: 5%

  6. A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Limited domain translation system (Hotel Reservations) • Cascaded approach • ASR: sequential model ~2k word vocabulary • MT: syntactically driven ~12k word vocabulary • TTS: CHATR (now unit selection, then concatenative) • Early Example of “Interactive” Speech-to-Speech Translation • When the system has low confidence in either recognition or MT outputs, it prompts the user for corrections. • Speech Information is used in three ways in ATR-MATRIX • Voice Selection • Based on the source voice, either a male or female voice is used for synthesis • Hypothesized phrase boundaries • Using pause information along with POS N-gram information the source utterance is divided into “meaningful chunks” for translation. • Phrase Final Behavior • If phrase final rise is detected, it is passed to the MT module as a “lexical” item potentially indicating a question.

  7. The Use of Speech in Speech-to-Speech Translation • The Use of Prosodic Event Information • On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Cascaded / Loosely Coupled Approaches • Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 • A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Integrated / Tightly Coupled Approaches • Finite State Speech-to-Speech TranslationVidal 1997 • On the Integration of Speech Recognition and Statistical Machine TranslationMatusov 2005 • Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003

  8. ASR MT TTS disambiguation Janus-III: Speech-to-Speech Translation in Multiple LanguagesLavie et al. 1997 • Interlingua and Frame-Slot based Spanish-English translation • limited domain (conference registration) spontaneous speech • Cascaded Approach • Two semantic parse techniques • GLR* Interlingua parsing (transcript 82.9%; ASR 54%) • Manually constructed grammar to parse input into interlingua • robust, doesn’t not require “grammatically correct” input • Search for the maximal subset covered by the grammar • Generation is performed by an interlingua generator • Phoenix (transcript 76.3%; ASR 48.6%) • identifies key concepts and their structure • parsing grammar contains specific patterns which represent domain concepts • The patterns are then compiled into a “recursive transition network” • Each concept has one or more fixed phrasings in the target language • Phoenix is used as a backoff when GLR* fails. • Transcript: 83.3%; ASR 63.6% • Late stage disambiguation • Multiple translations are processed through the whole system. • Translation hypothesis selection occurs just before generation using scores from recognition, parsing and discourse processing.

  9. A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Process many hypotheses, then select one. • In a cascaded architecture: • HMM-based ASR produces N-best recognition hypotheses • IBM Model 4 MT processes all N. • Rescore MT hypotheses based on weighted log-linear combination of ASR and MT features. • Construct the feature weight model by optimizing a translation distance metric (mWER, mPER, BLEU, NIST) • Experiment Results • Corpus: 162k/510/508 Japanese-English parallel sentences • Baseline: no optimization of MT features • Substantial improvement was obtained by optimizing feature weights based on distance metric • Additional improvement was achieved by including ASR features • Translation of N-best ASR hypotheses improved sentence translation accuracy of incorrectly recognized 1-best hypotheses by 7.5%

  10. The Use of Speech in Speech-to-Speech Translation • The Use of Prosodic Event Information • On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Cascaded / Loosely Coupled Approaches • Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 • A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Integrated / Tightly Coupled Approaches • Finite State Speech-to-Speech TranslationVidal 1997 • On the Integration of Speech Recognition and Statistical Machine TranslationMatusov 2005 • Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003

  11. Finite-State Speech-to-Speech TranslationVidal 1997 • FSTs can naturally be applied to translation. • FSTs for statistical MT can be learned from parallel corpora. (OSTIA) • Speech input is handled in two ways: • Baseline cascaded approach • Integrated approach • Create an FST on text, replace each edge with an acoustic model of the lexical item • A major drawback of using this approach is large training data requirement. • Align the source and target utterances, reducing their “asynchronicity” • Cluster lexical items, reducing the vocabulary size • Proof of concept experiment • Text: ~30 lexical items used in 16k paired sentences (Spanish- English) • Greater than 99% translation accuracy is achieved • Speech: 50k/400 (training/testing) paired utterances, spoken by 4 speakers • Best performance: 97.2% translation acc. 97.4% recognition accuracy • Requires inclusion of source and target 4-gram LMs in FST training. • Travel domain experiment • Text: ~600 lexical items in 169k/2k paired sentences • 0.7% translation WER w/ categorization; 13.3% WER w/o • Speech: 336 test utterances (~3k words) spoken by 4 speakers • Text transducer was used, edges replaced by concatenation of “phonetic elements” modeled by a continuous HMM. • 1.9% translation WER and 2.2% recognition WER were obtained.

  12. Length of source Aligned target word Lexical context Translation model Target LM Acoustic context Best English sentence length French audio On the Integration of Speech Recognition and Statistical Machine TranslationMatusov et al. 2005 • Use word lattices weighted by HMM ASR scores as input to a weighted FST for translation • Noisy Channel Model • Using an alignment model, A • Instead of modeling the alignment, search for the best alignment • Evaluation: • Material: 4 parallel corpora • Spontaneous speech in the travel domain • 3k - 66k paired sentences in Italian-English, Spanish-English and Spanish-Catalan • Vocabulary size 1.7k-15k words • On all metrics (mWER, mPER, BLEU, NIST), the translation results are as follows: • Correct text • Word lattice w/ acoustic scores • Fully integrated ASR and MT (FUB Italian-English only) • Word lattice w/o acoustic scores • Single best ASR hypothesis (lower mPER than lattice w/o scores on FUB I-E) • Denser ASR lattices yield reduced translation WER (on FUB Italian-English)

  13. st-2 st-1 st Li Semantic Label ot-2 ot-1 ot Wi Word Fj-1 Fj Phoneme st-2 st-2 st-1 st-1 st st Subphone ot-2 ot-2 ot-1 ot-1 ot ot Observation Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003 • Application of direct modeling to ASR, with the goal of direct modeling of interlingua text for MT. • A direct model of target text from source acoustics could also be constructed using this approach. • Composing models (e.g., noisy channel models) can lead to local or sub-optimal solutions • Direct Modeling tries to avoid these by creating a single maximum entropy model • p(text|acoustics,...) • Direct modeling can also include other non-independent observations (features). • Major considerations: • To simplify computational complexity, acoustic features are quantized. • Since the feature vector can get very large, reliable feature selection is necessary. • In preliminary experiments, 150M features were reduced to 500K via feature selection

  14. The Use of Speech in Speech-to-Speech Translation • The Use of Prosodic Event Information • On the Use of Prosody in a Speech-to-Speech TranslatorStrom et al. 1997 • A Japanese-to-English Speech Translation System: ATR-MATRIXTakezawa et al. 1998 • Cascaded / Loosely Coupled Approaches • Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 • A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine TranslationZhang et al. 2004 • Integrated / Tightly Coupled Approaches • Finite State Speech-to-Speech TranslationVidal 1997 • On the Integration of Speech Recognition and Statistical Machine TranslationMatusov 2005 • Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationGao 2003

  15. Thank you.

More Related