1 / 22

Learning Morphological Disambiguation Rules for Turkish

Learning Morphological Disambiguation Rules for Turkish. Deniz Yuret Ferhan T ü re Ko ç University, İ stanbul. Overview. Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation. Turkish Morphology.

tana
Download Presentation

Learning Morphological Disambiguation Rules for Turkish

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul

  2. Overview • Turkish morphology • The morphological disambiguation task • The Greedy Prepend Algorithm • Training • Evaluation

  3. Turkish Morphology • Turkish is an agglutinative language: Many syntactic phenomena expressed by function words and word order in English are expressed by morphology in Turkish. I will be able to go. (go) + (able to) + (will) + (I) git + ebil + ecek + im Gidebileceğim.

  4. Avrupa Europe lı European laş become tır make ama not able to dık we were larımız those that dan from mış were sınız you Fun with Turkish Morphology Avrupalılaştıramadıklarımızdanmışsınız

  5. So how long can words be? • uyu– sleep • uyut – make X sleep • uyuttur – have Y make X sleep • uyutturt – have Z have Y make X sleep • uyutturttur – have W have Z have Y make X sleep • uyutturtturt – have Q have W have Z … • …

  6. Morphological Analyzer for Turkish masalı • masal+Noun+A3sg+Pnon+Acc (= the story) • masal+Noun+A3sg+P3sg+Nom (= his story) • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (= with tables) • Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing • Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999)Design for a turkish treebank. EACL’99 • Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003

  7. 126 unique features 9129 unique IGs ∞ unique tags 11084 distinct tags observed in 1M word training corpus stem features features IG inflectional group (IG) derivational boundary tag Features, IGs and Tags masa+Noun+A3sg+Pnon+Nom^DB+Adj+With

  8. Why not just do POS tagging? from Oflazer (1999)

  9. Why not just do POS tagging? • Inflectional groups can independently act as heads or modifiers in syntactic dependencies. • Full morphological analysis is essential for further syntactic analysis.

  10. Morphological disambiguation • Ambiguity rare in English: lives = live+s or life+s • More serious in Turkish: 42.1% of the tokens ambiguous 1.8 parses per token on average 3.8 parses for ambiguous tokens

  11. Morphological disambiguation • Task: pick correct parse given context • masal+Noun+A3sg+Pnon+Acc • masal+Noun+A3sg+P3sg+Nom • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With • Uzun masalı anlat Tell the long story • Uzun masalı bitti His long story ended • Uzun masalı oda Room with long table

  12. Morphological disambiguation • Task: pick correct parse given context • masal+Noun+A3sg+Pnon+Acc • masal+Noun+A3sg+P3sg+Nom • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With Key Idea Build a separate classifier for each feature.

  13. If (W = çok) and (R1 = +DA) Then W has +Det If (L1 = pek) Then W has +Det If (W = +AzI) Then W does not have +Det If (W = çok) Then W does not have +Det If TRUE Then W has +Det “pek çok alanda” (R1) “pek çok insan” (R2) “insan çok daha” (R4) Decision Lists

  14. Greedy Prepend Algorithm GPA(data) 1 dlist = NIL 2 default-class = Most-Common-Class(data) 3 rule = [If TRUE Then default-class] 4 while Gain(rule, dlist, data) > 0 5 do dlist = prepend(rule, dlist) 6 rule = Max-Gain-Rule(dlist, data) 7 return dlist

  15. Training Data • 1M words of news material • Semi automatically disambiguated • Created 126 separate training sets, one for each feature • Each training set only contains instances which have the corresponding feature in at least one of their parses

  16. Input attributes For a five word window: • The exact word string (e.g. W=Ali'nin) • The lowercase version (e.g. W=ali'nin) • All suffixes (e.g. W=+n, W=+In, W=+nIn, W=+'nIn, etc.) • Character types (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST) Average 40 features per instance.

  17. Sample decision lists +Acc 0 1 W=+InI 1 W=+yI 1 W=UPPER0 1 W=+IzI 1 L1=~bu 1 W=~onu 1 R1=+mAK 1 W=~beni 0 W=~günü 1 W=+InlArI 1 W=~onlarý 0 W=+olAyI 0 W=~sorunu … (672 rules) +Prop 1 0 W=STFIRST 0 W==Türk 1 W=STFIRST R1=UCFIRST 0 L1==. 0 W=+AnAl 1 R1==, 0 W=+yAD 1 W=UPPER0 0 W=+lAD 0 W=+AK 1 R1=UPPER 0 W==Milli 1 W=STFIRST R1=UPPER0 … (3476 rules)

  18. Models for individual features

  19. Combining models • masal+Noun+A3sg+P3sg+Nom • masal+Noun+A3sg+Pnon+Acc • Decision list results and confidence (only distinguishing features necessary): • P3sg = yes (89.53%) • Nom = no (93.92%) • Pnon = no (95.03%) • Acc = yes (89.24%) • score(P3sg+Nom) = 0.8953 x (1 – 0.9392) • score(Pnon+Acc) = (1 – 0.9503) x 0.8924

  20. Evaluation • Test corpus: 1000 words, hand tagged • Accuracy: 95.87% (conf. int: 94.57-97.08) • Better than the training data !?

  21. Other Experiments • Retraining on own output: 96.03% • Training on unambiguous data: 82.57% • Forget disambiguation, let’s do tagging with a single decision list: 91.23%, 10000 rules

  22. Contributions • Learning morphological disambiguation rules using GPA decision list learner. • Reducing data sparseness and increase noise tolerance using separate models for individual output features. • ECOC, WSD, etc.

More Related