A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwe

A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwee Tou Ng {wangpd, nght}@comp.nus.edu.sg Experimental Results Motivation Missing Word Recovery Objective Example Punctuation Correction Limitations of Prior Work Corpora:(1) Chinese-English: 1,000 Weibo messages, normalized and then translated (2) English-Chinese: 2,000 messages randomly selected from the NUS English SMS corpus,normalized and then translated • A beam-search decoder for normalization of social media text • Applied to machine translation (MT) • Effectively integrate multiple normalization operations • Problem: Some words are omitted, e.g., “be” in English. • Solution: A CRF model to recover such missing words • Example: To recover “be” in English, the model has tags None, BE, IS, ARE, and AM, and assigns a tag to each token denoting whether to insert some form of “be” after the token. • Input text: “yeah must sign up , im in lt25”: • Without normalization, translation is “对[yeah] 必须 [must] 签署[sign up] ， im在[in] lt25”. • Normalized text: “yeah must sign up , i ’m in lt25 .” • Translation of normalized text: “对必须签署，我在 lt25 。” • Word normalization and punctuation correction improve translation. • Most prior work focused on word substitution • Other normalization operations are also needed, e.g., punctuation correction, missing word recovery, etc. • Use a two-layer dynamic conditional random fields (CRF) model • Layer 1: Punctuation tags None, Comma, Period, Question-Mark, and Exclamatory-Mark • Layer 2: Sentence boundary tags Declarative-Begin, Declarative-In, Question-Begin, Question-In, Exclamatory-Begin, and Exclamatory-In • Social media texts (SMS messages, Twitter messages, etc.) are written informally with abbreviations, specialized vocabulary, etc. • Problematic for natural language processing applications • Quotation: Applicable if an informal word ends with a letter in (m,s,t) and if inserting a quotation mark before the letter produces a formal word, e.g., im i’m, shes she’s, isnt  isn’t • Abbreviation: informal toformal word by adding missing vowels • Time: Change a number into a time expression if possible, e.g., from “1130am” to “11 : 30 am” Chinese-English English-Chinese Text Normalization Decoder Baselines: (1) ORIGINAL: No text normalization (2) LATTICE: Use lattice with the manually assembled dictionary (3) PBMT: Use phrase-based MT to perform normalization Text Normalization for English Text Normalization for Chinese Given an input text, the decoder searches for its best normalization, i.e., the best hypothesis, by iteratively performing two subtasks: Producing new sentence-level hypotheses from hypotheses in the current stack, carried out by hypothesis producers; Evaluating the new hypotheses to retain good ones, carried out by feature functions. • We design the following hypothesis producers: • Dictionary:A dictionary of informal-formal word pairs • Punctuation:A dynamic CRF model • Pronunciation: Use Chinese Pinyin to model the pronunciation similarity of Chinese words • Pronoun:A CRF model to recover the missing pronoun “我[I]” • Interjection:Remove redundant interjections, e.g., “好的[ok]哦[oh]”  “好的” • Resegmentation:Fix word segmentation problems • Dictionary, Punctuation, Interjection, plus • Pronunciation: Model English word pronunciation similarity • Be: ACRF model to recover the missing word “be” • Retokenization: Fix tokenization problems • Prefix: Informal word w  formal word w’ if w is a prefix of w’, e.g., “goin”  “going”

A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwe

A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwe

Presentation Transcript

Social Media and Search

Text to Text Translation Services

Text Normalization

Normalization of SMS Text

How to better do social media text normalization for machine translation?

Beam-Stack Search: Integrating Backtracking with Beam Search

Text Normalization and Feature Extraction

Social Media Tools and Search

SOCIAL MEDIA FOR RESEARCH Opportunities for Collaboration and Knowledge Translation

Search Applications: Machine Translation

Machine Translation Decoder for Phrase-Based SMT

Social Search Application

Bitext - The Bits and Text Company Semantic Solutions for Social Media, Search and more…

Machine Translation Decoder for Phrase-Based SMT

Barriers to Adoption with an example of social search media

SEARCH AND SOCIAL MEDIA

Get Smarter Search With Effectiveness of Social Media Apps

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism

Statistical Machine Translation Models for Personalized Search

Text Normalization and Feature Extraction

The Application of Machine Translation in CADAL

Machine Translation, Free Machine Translation