1 / 55

Named Entity Recognition and Transliteration for 50 Languages

Named Entity Recognition and Transliteration for 50 Languages Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park, Vasin Punyakanok, Tao Tao, Su-youn Yoon University of Illinois at Urbana-Champaign

jacob
Download Presentation

Named Entity Recognition and Transliteration for 50 Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Recognition and Transliteration for 50 Languages Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park, Vasin Punyakanok, Tao Tao, Su-youn Yoon University of Illinois at Urbana-Champaign http://compling.ai.uiuc.edu/reflex The Second Midwest Computational Linguistics Colloquium (MCLC-2005) May 14-15 The Ohio State University

  2. General Goals • Develop multilingual named entity recognition technology: focus on persons, places, organizations • Produce seed rules and (small) corpora for several LCTLs (Less Commonly Taught Languages) • Develop methods for automatic named entity transliteration • Develop methods for tracking names in comparable corpora Sproat et al.: NER and Transliteration for 50 Languages

  3. Languages • Languages for seed rules: Chinese, English, Spanish, Arabic, Hindi, Portuguese, Russian, Japanese, German, Marathi, French, Korean, Urdu, Italian, Turkish, Thai, Polish, Farsi, Hausa, Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto, Amharic, Indonesian, Tagalog, Hungarian, Greek, Czech, Swahili, Somali, Zulu, Bulgarian, Quechua, Berber, Lingala, Catalan, Mongolian, Danish, Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan, Twi, Basque. • Languages for (small) corpora: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua. Sproat et al.: NER and Transliteration for 50 Languages

  4. Milestones • Resources for various languages: • NER seed rules for: Armenian, Persian, Swahili, Zulu, Hindi, Russian, Thai • Tagged corpora for: Chinese, Arabic, Korean • Small tagged corpora for: Armenian, Persian, Russian (10-20K words) • Named Entity recognition technology: • Ported NER technology from English to Chinese, Arabic, Russian and German • Name transliteration: Chinese-English, Arabic-English, Korean-English Sproat et al.: NER and Transliteration for 50 Languages

  5. Linguistic/Orthographic Issues • Capitalization • Word boundaries • Phonetic vs.Orthographic issues in transliteration Sproat et al.: NER and Transliteration for 50 Languages

  6. Named Entity Recognition Sproat et al.: NER and Transliteration for 50 Languages

  7. Multi-lingual Text Annotator Annotate any word in a sentence by selecting the word and an available category. It's also possible to create new categories. http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php Sproat et al.: NER and Transliteration for 50 Languages

  8. Multi-lingual Text Annotator View text in other encodings. New language encodings are easily added in a simple text file mapping. http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php Sproat et al.: NER and Transliteration for 50 Languages

  9. Motivation for Seed Rules “The only supervision is in the form of 7 seed rules (namely, that New York, California and U.S. are locations; that any name containing Mr. is a person; that any name containing Incorporated is an organization; and that I.B.M. and Microsoft are organizations).” [Collins and Singer, 1999] Sproat et al.: NER and Transliteration for 50 Languages

  10. Seed Rules: Thai • Something including and to the right of นาย is likely to be a personSomething including and to the right of นาง is likely to be a personSomething including and to the right of นางสาว is likely to be a personSomething including and to the right of น.ส. is likely to be a personSomething including and to the right of คุณ is likely to be a personSomething including and to the right of เด็กหญิง is likely to be a personSomething including and to the right of ด.ญ. is likely to be a person • Something including and to the right of พ.ต.อ. is likely to be a personSomething including and to the right of พล.ต.ต. is likely to be a personSomething including and to the right of พล.ต.ท. is likely to be a personSomething including and to the right of พล.ต.อ. is likely to be a personSomething including and to the right of ส.ส. is likely to be a person • ทักษิณ ชินวัตร is a personทักษิณ is likely a personชวน หลีกภัย is a personบรรหาร ศิลปอาชา is a person Sproat et al.: NER and Transliteration for 50 Languages

  11. Seed Rules: Thai • Something including and in between บริษัท and จำกัด is likely to be an organizationSomething including and to the right of บจก. is likely to be an organizationSomething including and in between บริษัท and จำกัด (มหาชน) is likely to be an organizationSomething including and in between บจก. and (มหาชน) is likely to be an organizationSomething including and to the right of ห้างหุ้นส่วนจำกัด is likely to be an organizationSomething including and to the right of หจก. is likely to be an organization • สำนักนายกรัฐมนครี is an organizationวุฒิสภา is an organizationแพทยสภา is an organizationพรรคไทยรักไทย is an organizationพรรคประชาธิปัตย์ is an organizationพรรคชาติไทย is an organization • Something including and to the right of จังหวัด is likely to be a locationSomething including and to the right of จ. is likely to be a locationSomething including and to the right of อำเถอ is likely to be a locationSomething including and to the right of ตำบล is likely to be a location • กรุงเทพมหานคร is a locationเชียงใหม่ is a locationเชียงราย is a locationขอนแก่น is a location Sproat et al.: NER and Transliteration for 50 Languages

  12. Seed Rules: Armenian • CityName = CapWord  [ քաղաք | մայրաքաղաք ] StateName = CapWord նահանգ CountryName1 = CapWord երկիր • PersonName1 = TITLE? FirstName? LastName  LastName = [Ա-Ֆ].*յան FirstName = [FirstName1 | FirstName2] FirstName1 = [Ա-Ֆ]\. FirstName2 = [Ա-Ֆ].*   PersonNameForeign = TITLE FirstName? CapWord? CapWord PersonAny = PersonName1 | PersonNameForeign Sproat et al.: NER and Transliteration for 50 Languages

  13. Armenian Lexicon Lexicon GEODESC արեւելյան արեւմտյան … Lexicon PLACEDESC պանդոկ պալատ … Lexicon ORGDESC միություն ժողով … Lexicon COMPDESC գործակալություն ընկերություն… Lexicon TITLE տիկին Տկն… Sproat et al.: NER and Transliteration for 50 Languages

  14. Lexicon TITLEآقايدکترخانمجناببانومهندس Lexicon OrgDescاستانداريوزارتدولترژيمشهرداريانجمن Lexicon POSITIONرئيس جمهوررييس جمهوريپرزيدنتديپلمات Descriptors for named entitiesLexicon PerDescسابقآيندهLexicon CityDescشهرشهرکپايتختLexicon CountryDescکشور Seed Rules: Persian Sproat et al.: NER and Transliteration for 50 Languages

  15. People Rules Something including and to the right of Bw. is likely to be a person. Something including and to the right of Bi. is likely to be a person. A capitalized word to the right of bwana, together with the word bwana, is likely to be a person. A capitalized word to the right of bibi, together with the word bibi, is likely to designate a person. Place Rules A capitalized word to the right of a word ending with -jini, is likely to be a place. A capitalized word starting with the letter U is likely to be a place. A word ending in ni is likely to be a place. A sequence of words including and following the capitalized word Uwanja is likely a place. Seed Rules: Swahili Sproat et al.: NER and Transliteration for 50 Languages

  16. Named Entity Recognition • Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.) After receiving his M.B.A. from [ORG Harvard Business School], [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] in [LOC Washington]. Sproat et al.: NER and Transliteration for 50 Languages

  17. Named Entity Recognition • Not an easy problem since entities: • Are inherently ambiguous (e.g. JFK can be both location and a person depending on the context) • Can appear in various forms (e.g. abbreviations) • Can be nested, etc. • Are too numerous and constantly evolving (cf. Baayen, H. 2000. Word Frequency Distributions. Kluwer. Dordrecht.) Sproat et al.: NER and Transliteration for 50 Languages

  18. Named Entity Recognition Two tasks (sometimes, done simultaneously): • Identify the named entity phrase boundaries (segmentation) • May need to respect constraints: • Phrases do not overlap • Phrase order • Phrase length • Classify the phrases (classification) Sproat et al.: NER and Transliteration for 50 Languages

  19. s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 o2 o1 o3 o4 o5 o6 o1 o2 o3 o4 o5 o6 Identifying phrase properties with sequential constraints • View as inference with classifiers problem. Three models[Punyakanok & Roth NIPS’01] http://l2r.cs.uiuc.edu/~danr/Papers/iwclong.pdf • HMMs • HMM with classifiers • Conditional Models • Projection based Markov model • Constraint Satisfaction Models • Constraint satisfaction with classifiers • Other models proposed • CRF • StructurePerceptron • A model comparison in the context of the SRL problem [Punyakanok et al IJCAI’05] Most common Sproat et al.: NER and Transliteration for 50 Languages

  20. Adaptation • Most approaches in NER are targeted toward specific setting: language, subject, set of tags, etc. • Labeled data may be hard to acquire for each particular setting • Trained classifiers tend to be brittle when moved even just to a related subject • We consider the problem of exploiting the hypothesis we learned in one setting to improve learning in another. • Kinds of adaptation that can be considered: • Across corpora with a domain • Across domains • Across annotation methodologies • Across languages Sproat et al.: NER and Transliteration for 50 Languages

  21. Adaptation Example Starting with Reuters classifier is better than starting from scratch • Train on: • Reuters + increasing amounts of NYT • No Reuters, just increasing amounts of NYT • Test on: NYT • Performance on NYT increases quickly as classifier is trained on examples from NYT • Starting with existing classifier trained on related corpus is better than starting from scratch Trained on Reuters + 13% NYT; tested on NYT Trained on Reuters; tested on NYT Sproat et al.: NER and Transliteration for 50 Languages

  22. Sentence Splitter Word Splitter FEX NER SNoW-based Network file Current Architecture - Training Annotated Corpus • Pre-process annotated corpus • Extract features • Train classifier Honorifics Features script Gazetteers Italics : setting specific : optional Sproat et al.: NER and Transliteration for 50 Languages

  23. Sentence Splitter Word Splitter FEX NER SNoW-based Current Architecture - Tagging Corpus • Pre-process corpus • Extract features • Run NER Honorifics Features script Gazetteers Network file Annotated Corpus Sproat et al.: NER and Transliteration for 50 Languages

  24. Document Classifier Sentence Splitter Honorifics Knowledge Engineering Components Word Splitter Features script Gazetteers Network file Annotated Corpus Extending Current Architecture to Multiple Settings Chinese newswire Corpus German biological English news • Choose setting • Pre-process, extract features and run NER FEX NER SNoW-based Sproat et al.: NER and Transliteration for 50 Languages

  25. Extending Current Architecture to Multiple Settings: Issues For each setting, we need: • Honorifics and gazetteers • Tuned sentence and word splitters • Types of features • Tagged training corpus • Work is being done to move tags across parallel corpora (if available) Sproat et al.: NER and Transliteration for 50 Languages

  26. Extending Current Architecture to Multiple Settings: Issues If parallel corpora are available and one is annotated, may be able to use Stochastic Inversion Transduction Grammars to move tags across corpora [Wu, Computational Linguistics ‘97] • Generate bilingual (annotated and unannotated parallel corpora) parses • Use ITGs as a filter to deem sentence/phrase pairs as parallel enough • For those that are, simply move the label from annotated to the unannotated phrase in same parse tree node. • Use the now tagged examples as training corpus Sproat et al.: NER and Transliteration for 50 Languages

  27. Extending Current Architecture to Multiple Settings • Baseline experiments with Arabic, German, and Russian: • E.g. For Russian with no honorifics, gazetteers, features tuned for English, and imperfect sentence splitter we still get about 77% precision and 36% recall. NB: Used small hand-constructed corpus of approx. 15K wds, 1,300 NE (80/20 split) Sproat et al.: NER and Transliteration for 50 Languages

  28. Summary • Seed rules and corpora for subset of 50 languages • Adapted NER system for English to other languages • Demonstrated adaptation of NER system to other settings • Experimenting with ITG as basis for annotation transplantation Sproat et al.: NER and Transliteration for 50 Languages

  29. Methods of Transliteration Sproat et al.: NER and Transliteration for 50 Languages

  30. Comparable Corpora 三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・ 拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰 的于・默伦迪克斯,周蜜在下午以11:4和11:1战 胜了中国香港选手凌婉婷。 In the day's other matches, second seed Zhou Mi overwhelmed Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat Judith Meulendijks of Netherlands 11-2, 11-9 and third seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen of Denmark 11-1, 11-1, enabling China to claim five quarterfinal places in the women's singles. 三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・ 拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰 的于・默伦迪克斯,周蜜在下午以11:4和11:1战 胜了中国香港选手凌婉婷。 In the day's other matches, second seed Zhou Mi overwhelmed Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat Judith Meulendijks of Netherlands 11-2, 11-9 and third seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen of Denmark 11-1, 11-1, enabling China to claim five quarterfinal places in the women's singles. Sproat et al.: NER and Transliteration for 50 Languages

  31. Transliteration in Comparable Corpora • Take the newspapers for a day in any set of languages: a lot of them will have names in common. • Given a name in one language, find its transliteration in a similar text in another language. • How can we make use of: • Linguistic factors such as similar pronunciations • Distributional factors • Right now we used partly supervised methods (e.g. we assume small training dictionaries): • We are aiming for largely unsupervised methods (in particular, no training dictionary) Sproat et al.: NER and Transliteration for 50 Languages

  32. Some Comparable Corpora • We have (from the LDC) comparable text corpora for: • English (19M words) • Chinese (22M characters) • Arabic (8M words) • Many more such corpora can, in principle, be collected from the web Sproat et al.: NER and Transliteration for 50 Languages

  33. How Chinese Transliteration Works • About 500 characters tend to be used for foreign words • Attempt to mimic the pronunciation • But lots of alternative ways of doing it Sproat et al.: NER and Transliteration for 50 Languages

  34. Transliteration Problem • Many applications of transliteration have been in machine translation [Knight&Graehl, 1998; Al-Onaizan&Knight, 2002; Gao, 2004]: • What’s the best translation of this Chinese name? • Our problem is slightly different: • Are these two names the same? • Want to be able to reject correspondences • Assign 0 probability to some unseen cases in training data Sproat et al.: NER and Transliteration for 50 Languages

  35. Approaches to Transliteration • Much work using the source-channel approach: • Cast as a problem where you have a clean “source” – e.g. a Chinese name – and a “noisy channel” that “corrupts” the source into the observed form – e.g. an English name: • P(E|C)P(C) • E.g.: P(fi,E fi+1,E fi+2,E … fi+n,E |sC) Chinese characters represent syllables (s); we match these to sequences of English phonemes (f) Sproat et al.: NER and Transliteration for 50 Languages

  36. Resources • Small dictionary of 721 (mostly English) names and their Chinese transliterations • Large dictionary of about 1.6 million names from LDC Sproat et al.: NER and Transliteration for 50 Languages

  37. General Approach • Train a tight transliteration model from a dictionary of known transliterations • Identify names in English news text for a given day using an existing named entity recognizer • Process same day of Chinese text looking for sequences of characters used in foreign names • Do an all-pairs match using the transliteration model to find possible transliteration pairs Sproat et al.: NER and Transliteration for 50 Languages

  38. Model Estimation • Seek to estimate P(e|c) where e is a sequence of words in Roman script and c is a sequence of Chinese characters • We actually estimate P(e’|c’), where e’ is the pronunciation of e and c’ is the pronunciation of c. • We decompose the estimate of P(e’|c’) as: • Chinese transliteration matches syllables to similar-sounding spans of foreign phones. So c’I are syllables, and e’I are subsequences of the English phone string Sproat et al.: NER and Transliteration for 50 Languages

  39. Model Estimation • Align phone strings using modified Sankoff/Kruskal algorithm • For each Chinese s, allow an English phone string f to correspond just in case the initial of s corresponds to the initial of f some minimum number of times in training • Smooth probabilities using Good-Turing • Distribute unseen probability mass over unseen cases non-uniformly according to a weighting scheme Sproat et al.: NER and Transliteration for 50 Languages

  40. Model Estimation • We estimate the probability for a given unseen case as follows: • Where: • P(n0) is the probability of unseen cases according to the Good-Turing smoothing • P(len(e)=m|len(c)=n) is the probability of a Chinese syllable of length n corresponding to an English phone sequence of length m • count(len(e)=m) is the type count of phone sequences of length m (estimated from 194,000 pronunciations produced by the Festival TTS system on the XTag dictionary) Sproat et al.: NER and Transliteration for 50 Languages

  41. Some Automatically Found Pairs Pairs found in same day of newswire text Sproat et al.: NER and Transliteration for 50 Languages

  42. Further Pairs Sproat et al.: NER and Transliteration for 50 Languages

  43. Time Correlations • When some major event happens (e.g., the tsunami disaster), it is very likely covered by news articles in multiple languages • Each event/topic tends to have its own “associated vocabulary” (e.g., names such as Sri Lanka, India may occur in recent news articles) • We thus will likely see that the frequency of a name such as Sri Lanka will peak as compared with other time periods and the pattern is likely the same across languages • cf. [Kay and Roscheisen, CL, 1993; Kupiec, ACL, 1993; Rapp, ACL, 1995; Fung, WVLC, 1995] Sproat et al.: NER and Transliteration for 50 Languages

  44. … … Documents Day 1 Day 2 Day 3 Day n Time line a term Term Frequency … … Normalized to obtain a distribution Construct Term Distributions over Time Sproat et al.: NER and Transliteration for 50 Languages

  45. Pearson Correlation scores [-1, 1] Megawati-English Arafat-Chinese Megawati-English Megawati-Chinese Measure Correlations of English and Chinese Word Pairs bad correlation corr = 0.0324 good correlation corr = 0.885 Sproat et al.: NER and Transliteration for 50 Languages

  46. Chinese Transliteration English termEdmonton Chinese documents Candidate Chinese names 埃德蒙顿 阿勒泰 埃丁顿 阿马纳 阿亚德 埃蒂纳罗 … … 埃德蒙顿 0.96 阿勒泰 0.91 埃丁顿 0.88 阿马纳 0.75 … …  Rank Candidates • Methods: • Phonetic approach • Frequency correlation • Combination Sproat et al.: NER and Transliteration for 50 Languages

  47. Method1 (Freq+PhoneticFilter) • Compute the correlation • ranking them by correlation scores Phonetic method • Method2 (Freq+PhoneticScore) • Linearly combine the correlation scores with Phonetic scores (half/half) Chinese candidate 埃德蒙顿 阿勒泰 埃丁顿 阿马纳 阿亚德 埃蒂纳罗 … … Evaluation English term Edmonton MRR: Mean Reciprocal Rank AllMRR: Evaluation over all English names CoreMRR: Evaluation over just names w/ found Chinese correspondence Sproat et al.: NER and Transliteration for 50 Languages

  48. Summary and Future Work • So far: • Phonetic transliteration models • Time correlation between name distributions • Work in progress: • Linguistic models: • Develop graphical model approach to transliteration • Semantic aspects of transliteration in Chinese: female names ending in –ia transliterated with 娅 ya rather than 亚 • Resource-poor transliteration for any pair of languages • Document alignment • Coordinated mixture models for document/word-level alignment Sproat et al.: NER and Transliteration for 50 Languages

  49. character counter End character character transition chinese phone transition chinese phone english phone Graphical Models [Bilmes & Zweig 2002] Sproat et al.: NER and Transliteration for 50 Languages

  50. Semantic Aspects of Transliteration • Phonological model doesn’t capture semantic/orthographic features of transliteration: • Saint, San, Sao, … use 圣 sheng `holy’ • Female names ending in –ia transliterated with 娅 ya rather than 亚 ya • Such information boosts evidence that two strings are transliterations of each other • Consider gender. For each character c: • compute log-likelihood ratio abs(log(P(f|c)/P(m|c))) • build a decision list ranked by decreasing LLR Sproat et al.: NER and Transliteration for 50 Languages

More Related