1 / 22

eWika: Towards the Digitalization of Philippine Languages

eWika: Towards the Digitalization of Philippine Languages. Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural Language Processing Research Lab. Isalin. Translate. MT Research in RP. started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra

lough
Download Presentation

eWika: Towards the Digitalization of Philippine Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural Language Processing Research Lab Isalin Translate

  2. MT Research in RP • started in 1993 at UP-Los Baňos • Dr. Rachel Roxas and Allan Borra • grammar-based • in 2004 start at DLSU • hybrid approach

  3. ENG-FIL MT System Project • 3-year project • started 2005 • funded by DOST-PCASTRD • composition: • 6 faculty members of College of Computer Studies • 15 computer science majors • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M

  4. Architectural Design of the Program Source Text User Interface Target Text MT: Example-based Output Modeller MT: Rule-based Translator Engine • Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator • Part-of-Speech tagger • Grammar, • Corpus (Tagged)

  5. Where do we get the translation rules? Rule-Based approach The boy ate apples. Apply translation rules Kumain ng mga mansanasang batang lalaki.

  6. A B C D B C D A Rule Learned: ABCD C ng DA B Example-Based • Learn the rules from examples Theboyateapples. Kumainngmga mansanasangbatang lalaki.

  7. ABCD C ng DA B Using the rule Themothercookedfish. A B C D Naglutongisdaangnanay. B C D A

  8. ABCD C ng DA B Using the rule Themotherwenthome. A B C D Umuwingbahayangnanay. B C D A

  9. ABCD C ng DA B Limitation of a Rule Theboyate the fish. B C D A

  10. Results of the MT Engine • Qualities of a Good Translation • Clarity – 3.3 • Accuracy – 3.2 • Naturalness - 2.8 • highest score of 5 • 100 respondents (5 linguists)

  11. Challenge! • Language resources • Quality of translation is dependent on it. • Built from almost non-existent digital forms • manual vs. automatic construction Dictionary Grammar Sample Translations

  12. Lexicon • Diksyunaryo ng Wikang Filipino • automatic construction (AeFLEX): • accuracy rate - 57% • Currently contains about 30,000+ entries • Challenge: Lexical resources • translation documents • part-of-speech tagger

  13. Morphological Analyzer and Generator • Dictionary is incomplete • Create a software that: • analyzes – determines the root word • generates – generates the inflected word Given: eating -> eat -> kain -> kumakain • Challenge : Lexical resources • lexicon • part-of-speech tagger

  14. Part-Of-Speech Tagger • automatic association of parts-of-speech to words in a document • Can? – kaya vs. lata • Baba? – chin or go down • Challenge : Lexical resource • corpora • lexicon • morphological analyzer • grammar

  15. Corpora • collection of translation-pair documents • used by the lexicon extractor and part-of-speech tagger, example-based MT • came from translation works of DLSU English majors, verified by linguists • consists of 207,000 words

  16. Lexicon Resource Dependency Lexicon Corpus POS Tagger Morph AG

  17. Bringing it home … • 171 Philippine Languages (SIL) • No Philippine Corpora • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc) • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)

  18. eWika: Digitalization of Philippine Languages • Build the Philippine Corpus • Build software tools to study or use the corpus • Across Regions • Across Forms and Genres • Across Languages

  19. Across Regions • Web-based application: GLOBALIZATION • upload, download, tools • Contributors (Main players) • Verifiers • Server: DLSU-M commits to host the server for the next three years. • Terms of Use: Research purposes.

  20. Across Languages • 171 Philippine Languages (SIL List) • start with 8 major languages • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano • Filipino Sign Language

  21. Across Forms and Genres • In various forms: • Text • Speech • Video: Filipino sign language • In various Genres: • Text – literary & creative, essays, news articles, religious, etc • Speech – scripted, conversations, etc • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)

  22. The dream of building electronic, online Philippine language resources and tools • Many many many major hurdles to overcome • NEEDED : Language Resources, Tools, & Peopleware

More Related