1 / 15

Multilingual Synchronization

Multilingual Synchronization. Eun-kyung Kim 2011-02-10. Introduction. Wikipedia Supports over 270 languages Allows cross-lingual navigation with inter-language link Different quantity of data Goal

halle
Download Presentation

Multilingual Synchronization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual Synchronization Eun-kyung Kim 2011-02-10

  2. Introduction • Wikipedia • Supports over 270 languages • Allows cross-lingual navigation with inter-language link • Different quantity of data • Goal • Synchronizing multilingual Wikipedia data to fill the gap between different languages & to acquire the integrated knowledge

  3. Methodology (base) • Hypothesis • X is a key fact in L1 X’ should be a key fact in L2 • where X’ is a corresponding term to X in different language • Assumption • Inter-language links are accurate links to connect two pages about the same entity or concept in different languages • Key facts come from the structured data such as: • Infobox • Category • Hyperlink text (than normal text)

  4. Methodology • Basic methodology • Infobox synchronization between English & Korean • Duplicate resolving & conflict resolving • Comments from Committee (Ph.D proposal) • Improve the multilingualism • Do not ignore multiple viewpoints • Hard to evaluate • Extended methodology • Sync target selection in 5 languages • Key facts synchronization including not only Infobox but also LinkText • Filling missing information according to each background knowledge and characteristics • focusing on how to add new information from other language resources

  5. Methodology • Basic methodology • Infobox synchronization between English & Korean • Duplicate resolving & conflict resolving • Comments from Committee (Ph.D proposal) • Improve the multilingualism • Do not ignore multiple viewpoints • Hard to evaluate • Extended methodology • Sync target selection in 5 languages • Key facts synchronization including not only Infobox but also LinkText • Filling missing information according to each background knowledge and characteristics • focusing on how to add new information from other language resources

  6. Example of Infobox Synchronization

  7. Example of Infobox Synchronization • Drawback of Infobox • Sometimes meaningless for synchronization • Solution • Adding links information to synchronize Infobox from Arthritis

  8. Links on the Web • Links • navigate to a web page with more detailed information • point to previously published web pages with similar or related content • Understanding of the influence of each link can substantially benefit many applications • e.g., multilingual sync 베짱이 귀뚜라미 방아깨비 메뚜기목 메뚜기 해충 풀무치 벼메뚜기 여치 사우디아라비아 농업 예멘

  9. Multilingual Synchronization Process Finding missing links according to the model Preprocessing (Target Page Selection) Wikipedia Data LN Wikipedia Data L2 Wikipedia Data L1 Translating links into target languages to sync Extracting Links Modeling on influence links Computing similarity between existing and new L1 L2 LN … Unifying synchronized data

  10. Multilingual Synchronization Process Finding missing links according to the model Preprocessing (Target Page Selection) Wikipedia Data LN Wikipedia Data L2 Wikipedia Data L1 Translating links into target languages to sync Extracting Links Modeling on influence links Computing similarity between existing and new L1 L2 LN … Unifying synchronized data

  11. Preprocessing: Selecting Target • Source languages(5) • English, Spanish, French, Chinese, Korean • Extracting target pages with a complete graph(clique) by inter-language links • Assumption: • Pages founded in all 5 languages are key pages and the target to sync • Enforcing consistency of a link path • If a path from X(L1) to X’(L2) founded once,its inverse path (X’, X) is automatically added to the output A subset of UN official languages en:Badminton fr:Badminton es:Bádminton ko:배드민턴 zh:羽毛球

  12. Preprocessing: Selected Pages • Total 42,077 pages • Example) page-lengthcomparison • Badminton (배드민턴) • en(52,098) > fr(26,508) > ko(22,960) > zh(19,050) > es(17,594) • Suncheon,_Jeollanam-do(순천시_(전라남도)) • ko(20,816) > en(8,910) > zh(1,688) > es(1,600) > fr(1,503)

  13. Modelingon influence links • Example of links in multiple language Wikipedia • Different Wikipedia has different viewpoints and different concerns (fig) • Some links are newly added, some others are deleted by user in a temporal manner • We need to know the permutation distance of links on each language Wikipedia(ongoing)

  14. Evaluation Plan • Compare how much usefulinformation to fillfrom other language resources • Links of the Featured article in L1 vs.Unified links from M-Sync in L2 , …, LN • Compare how much relevantinformation to fillfrom other language resources • NGD(normalized Google distance)

  15. Task & Schedule • Target Conference: • Web Intelligence 2011 (3/4, 3/11) • Task • Modeling on influence link to synchronize • Link category analysis on each Lang • Using Wikipedia links information • Using Wikipedia template, category(CAT2ISA) • Link evolution analysis on each Lang • Using Wikipedia edit history • Making evaluation dataset

More Related