1 / 48

The Greedy Prepend Algorithm for Decision List Induction

The Greedy Prepend Algorithm for Decision List Induction. Deniz Yuret Michael de la Maza. Overview. Decision Lists Greedy Prepend Algorithm Opus search and UCI problems Version space search and secondary structure prediction Limited look-ahead search and Turkish morphology disambiguation.

thadeus
Download Presentation

The Greedy Prepend Algorithm for Decision List Induction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Greedy Prepend Algorithm for Decision List Induction Deniz Yuret Michael de la Maza

  2. Overview • Decision Lists • Greedy Prepend Algorithm • Opus search and UCI problems • Version space search and secondary structure prediction • Limited look-ahead search and Turkish morphology disambiguation

  3. Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. Class Name: 2 (democrat, republican) 1. handicapped-infants: 2 (y,n) 2. water-project-cost-sharing: 2 (y,n) 3. adoption-of-the-budget-resolution: 2 (y,n) 4. physician-fee-freeze: 2 (y,n) 5. el-salvador-aid: 2 (y,n) 6. religious-groups-in-schools: 2 (y,n) … 16. export-administration-act-south-africa: 2 (y,n)

  4. Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. 1. If adoption-of-the-budget-resolution = y and anti-satellite-test-ban = n and water-project-cost-sharing = y then democrat 2. If physician-fee-freeze = y then republican 3. If TRUE then democrat

  5. Alternative Representations • Decision trees:

  6. Alternative Representations • CNF: • DNF:

  7. Alternative Representations • For 0 < k < n and n > 2, k-CNF(n) U k-DNF(n) is a subset of k-DL(n) • For 0 < k < n and n > 2, k-DT(n) is a subset of k-CNF(n) ∩ k-DNF(n) • k-DT(n) is a subset of k-DL(n) Rivest 1987

  8. Overview • Decision Lists • Greedy Prepend Algorithm • Opus search and UCI problems • Version space search and secondary structure prediction • Limited look-ahead search and Turkish morphology disambiguation

  9. Decision List Induction • Start with an empty decision list or a default rule. • Keep adding the best rule that covers the unclassified and misclassified cases. Design Decisions: • Where to add the new rules (front, back) • Criteria for best rule • Search algorithm for best rule

  10. The Greedy Prepend Algorithm GPA(data) • dlist = NIL • default-class = most-common-class(data) • rule = [ if true then default-class ] • while gain(rule, dlist, data) > 0 • do dlist = prepend(rule, dlist) • rule = max-gain-rule(dlist, data) • return dlist

  11. The Greedy Prepend Algorithm • Starts with a default rule that picks the most common class • Prepends subsequent rules to the front of the decision list • The best rule is the one with maximum gain (increase in number of correctly classified instances) • Several search algorithms implemented

  12. + - Rule Search • The default rule predicts all instances to belong to the most common category False Assignments Correct Assignments Training Set Partition with respect to the Base Rule

  13. + - Rule Search • At each step add the maximum gain rule + - - + Partition with respect to the Next Rule Partition with respect to the Decision List

  14. Overview • Decision Lists • Greedy Prepend Algorithm • Opus search and UCI problems • Version space search and secondary structure prediction • Limited look-ahead search and Turkish morphology disambiguation

  15. Opus Search: Simple tree

  16. Opus Search: Fixed order tree

  17. Opus Search: Optimal pruning

  18. GPA-Opus on UCI Problems

  19. Overview • Decision Lists • Greedy Prepend Algorithm • Opus search and UCI problems • Version space search and secondary structure prediction • Limited look-ahead search and Turkish morphology disambiguation

  20. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ??????????????????????????????????????

  21. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????

  22. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????

  23. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????

  24. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????

  25. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ---???????????????????????????????????

  26. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----????????????????????????????

  27. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????

  28. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????

  29. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HH??????????????????????????

  30. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------?

  31. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

  32. GPA Rules • The first three rules of the sequence-to-structure decision list • 58.86% performance (of 66.36%)

  33. GPA Rule 1 • Everything => Loop

  34. GPA Rule 2

  35. GPA Rule 3

  36. GPA-Opus not feasible for secondary structure prediction • 9 positions • 20 possible amino-acids per position • Size of rule space: • With only pos=val type attributes: 21^9 • If we include disjunctions: 2^180

  37. GPA Version Space Search Searching for a candidate rule: • Pick a random instance • If the instance is currently misclassified and candidate rule corrects it: generalize candidate rule to include instance • If the instance is currently correct and candidate rule changes classification: specialize candidate rule to exclude instance

  38. GPA Secondary Structure Prediction Results • PhD 72.3 • NNSSP 71.7 • GPA 69.2 • DSC 69.1 • Predator 69.0

  39. Overview • Decision Lists • Greedy Prepend Algorithm • Opus search and UCI problems • Version space search and secondary structure prediction • Limited look-ahead search and Turkish morphology disambiguation

  40. Morphological Analyzer for Turkish masalı • masal+Noun+A3sg+Pnon+Acc (= the story) • masal+Noun+A3sg+P3sg+Nom (= his story) • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (= with tables) • Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing • Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999)Design for a turkish treebank. EACL’99 • Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003

  41. 126 unique features 9129 unique IGs ∞ unique tags 11084 distinct tags observed in 1M word training corpus stem features features IG inflectional group (IG) derivational boundary tag Features, IGs and Tags masa+Noun+A3sg+Pnon+Nom^DB+Adj+With

  42. Morphological disambiguation • Task: pick correct parse given context • masal+Noun+A3sg+Pnon+Acc • masal+Noun+A3sg+P3sg+Nom • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With • Uzun masalı anlat Tell the long story • Uzun masalı bitti His long story ended • Uzun masalı oda Room with long table

  43. Morphological disambiguation • Task: pick correct parse given context • masal+Noun+A3sg+Pnon+Acc • masal+Noun+A3sg+P3sg+Nom • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With Key Idea Build a separate classifier for each feature.

  44. If (W = çok) and (R1 = +DA) Then W has +Det If (L1 = pek) Then W has +Det If (W = +AzI) Then W does not have +Det If (W = çok) Then W does not have +Det If TRUE Then W has +Det “pek çok alanda” (R1) “pek çok insan” (R2) “insan çok daha” (R4) GPA on Morphological Disambiguation

  45. GPA-Opus not feasible Attributes for a five word window: • The exact word string (e.g. W=Ali'nin) • The lowercase version (e.g. W=ali'nin) • All suffixes (e.g. W=+n, W=+In, W=+nIn, W=+'nIn, etc.) • Character types (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST) Average 40 features per instance.

  46. GPA limited look-ahead search • New rules are restricted to adding one new feature to existing rules in the decision list

  47. GPA Turkish morphological disambiguation results • Test corpus: 1000 words, hand tagged • Accuracy: 95.87% (conf. int: 94.57-97.08) • Better than the training data !?

  48. Contributions and Future Work • Established GPA as a competitive alternative to SVM’s, C4.5 etc. • Need theory on why the best-gain rule does well. • Need to study robustness to irrelevant or redundant attributes. • Need to speed up the application of the resulting decision lists (convert to FSM?)

More Related