1 / 35

Learning set of rules

Learning set of rules. Content. Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules: FOIL Summary. Introduction. If-Then Rules Are very expressive Easy to understand Rules with variables: Horn-clauses

kaz
Download Presentation

Learning set of rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning set of rules

  2. Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules: FOIL • Summary

  3. Introduction • If-Then Rules • Are very expressive • Easy to understand • Rules with variables: Horn-clauses • Set of Horn-clauses build up a PROLOG program • Learning of Horn-clauses: Inductive Logic Programming (ILP) • Example first-order rule set for the target concept AncestorIF Parent (x,y) THEN Ancestor (x,y)IF Parent(x,z) Ancestor(z,y) THEN Ancestor(x,y)

  4. Introduction 2 • GOAL: Learning a target function as a set of IF-THEN rules • BEFORE: Learning with decision trees • Learning the decision tree • Translate the tree into a set of IF-THEN rules (for each leaf one rule) • OTHER POSSIBILITY: Learning with genetic algorithms • Each set of rule is coded as a bitvector • Several genetic operators are used on the hypothesis space • TODAY AND HERE: • First: Learning rules in propositional form • Second: Learning rules in first-order form (Horn clauses which include variables) • Sequential search for rules, one after the other

  5. Content • Introduction • Sequential Covering Algorithms • General to Specific Beam Search • Variations • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules: FOIL • Summary

  6. Sequential Covering Algorithms • Goal of such an algorithm:Learning a disjunct set of rules, which defines a preferably good classification of the training data • Principle:Learning rule sets based on the strategy of learning one rule, removing the examples it covers, then iterating this process. • Requirement for the Learn-One-Rule method: • As Input it accepts a set of positive and negative training examples • As Output it delivers a single rule that covers many of the positive examples and maybe a few of the negative examples • Required: The output rule has a high accuracy but not necessarily a high coverage

  7. Sequential Covering Algorithms 2 • Procedure: • Learning set of rules invokes the Learn-One-Rule method on all of the available training examples • Remove every positive example covered by the rule • Eventually short the final set of the rules: more accurate rules can be considered first • Greedy search: It is not guaranteed to find the smallest or best set of rules that covers the training example.

  8. Sequential Covering Algorithms 3 SequentialCovering( target_attribute, attributes, examples, threshold ) learned_rules { } rule LearnOneRule( target_attribute, attributes, examples ) while (Performance( rule, examples ) > threshold ) do learned_rules learned_rules + rule examples examples - { examples correctly classified by rule } rule LearnOneRule( target_attribute, attributes, examples ) learned_rules sort learned_rules according to Performance over examples return learned_rules

  9. General to Specific Beam Search • Specialising search • Organises a hypothesis space search in general the same fashion as the ID3, but follows only the most promising branch of the tree at each step • Begin with the most general rule (no/empty precondition) • Follow the most promising branch: • Greedily adding the attribute test that most improves the measured performance of the rule over the training example • Greedy depth-first search with no backtracking • Danger of sub-optimal choice • Reduce the risk: Beam Search (CN2-algorithm)Algorithm maintains the list of the k best candidatesIn each search step, descendants are generated for each of these k-best candidatesThe resulting set is then reduced to the k most promising members

  10. General to Specific Beam Search 2 • Learning with decision tree

  11. General to Specific Beam Search 3

  12. General to Specific Beam Search 4 The CN2-Algorithm LearnOneRule( target_attribute, attributes, examples, k ) Initialise best_hypothesis to the most general hypothesis Ø Initialise candidate_hypotheses to the set { best_hypothesis } while ( candidate_hypothesis is not empty ) do 1. Generate the next more-specific candidate_hypothesis 2. Update best_hypothesis 3. Update candidate_hypothesis return a rule of the form „IFbest_hypothesisTHENprediction“ where prediction is the most frequent value of target_attribute among those examples that match best_hypothesis. Performance( h, examples, target_attribute ) h_examples the subset of examples that match h return -Entropy( h_examples ), where Entropy is with respect to target_attribute

  13. General to Specific Beam Search 5 • Generate the next more specific candidate_hypothesis all_constraintsset of all constraints (a = v), where aÎattributes and v is a value of a occuring in the current set of examples new_candidate_hypothesis for each h in candidate_hypotheses, for each c in all_constraints create a specialisation of h by adding the constraint c Remove from new_candidate_hypothesis any hypotheses which are duplicate, inconsistent or not maximally specific • Update best_hypothesis for all h in new_candidate_hypothesisdo if statistically significant when tested on examples Performance( h, examples, target_attribute ) > Performance( best_hypothesis, examples, target_attribute ) ) thenbest_hypothesis h

  14. General to Specific Beam Search 6 • Update the candidate-hypothesis candidate_hypothesis the k best members of new_candidate_hypothesis, according to Performance function • Performance function guides the search in the Learn-One -Rule • s: the current set of training examples • c: the number of possible values of the target attribute • : part of the examples, which are classified with the ith. value

  15. Example for CN2-Algorithm LearnOneRule(EnjoySport, {Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport}, examples, 2) best_hypothesis = Ø candidate_hypotheses = {Ø} all_constraints = {Sky=Sunny, Sky=Rainy, AirTemp=Warm, AirTemp=Cold, Humidity=Normal, Humidity=High, Wind=Strong, Water=Warm, Water=Cool, Forecast=Same, Forecast=Change} Performance = nc / n n = Number of examples, covered by the rule nc = Number of examples covered by the rule and classification is correct

  16. Example for CN2-Algorithm (2) Pass 1 Remove delivers no result candidate_hypotheses = {Sky=Sunny, AirTemp=Warm} best_hypothesis ist Sky=Sunny

  17. Example for CN2-Algorithm (3) Pass 2 Remove (duplicate, inconsistent, not maximally specific) candidate_hypotheses = {Sky=Sunny AND AirTemp=Warm, Sky=Sunny AND Humidity=High} best_hypothesis remains Sky=Sunny

  18. Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules:FOIL • Summary

  19. Learning Rule Sets: Summary • Key dimension in the design of the rule learning algorithm • Here sequential covering: learn one rule, remove the positive examples covered, iterate on the remaining examples • ID3 simultaneous covering • Which one should be prefered? • Key difference: choice at the most primitive step in the searchID3: chooses among attributes by comparing the partitions of the data they generatedCN2: chooses among attribute-value pairs by comparing the subsets of data they cover • Number of choices: learn n rules each containing k attribute-value tests in their preconditionCN2: n*k primitive search stepsID3: fewer independent search steps • If the data is plentiful, then it may support the larger number of independent decisons • If the data is scarce, the sharing of decisions regarding preconditions of different rules may be more effective

  20. Learning Rule Sets: Summary 2 • CN2: general-to-specific (cf. Find-S specific-to-general) • Advantage: there is a single maximally general hypothesis from which to begin the search <=> there are many specific ones • GOLEM: choosing several positive examples at random to initialise and to guide the search. The best hypothesis obtained through multiple random choices is the selected one • CN2: generate then test • Find-S, CN2, AQ and Cigol are example-driven • Advantage: each choice in the search is based on the hypothesis performance over many examples • How are the rules post-pruned?

  21. Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • First-Order Horn Clauses • Terminology • Learning Sets of First-Order Rules: FOIL • Summary

  22. Learning First-Order Rule • Previous: Algorithms for learning sets of propositional rules • Now: Extension of these algorithms for learning first-order rules • In particular: inductive logic programming (ILP): • Learning of first-order rules or theories • Use PROLOG: general purpose Turing-equivalent programming language in which programs are expressed as collection of Horn clauses.

  23. First-Order Horn Clauses • Advantage of the representation: learning Daughter(x,y) • Result of C4.5 or CN2 • Result ILP • Problem: Propositional representations offer no general way to describe the essential relation among the values of the attributes • In first-order Horn clauses variables that do not occur in the postcondition may occur in the precondition • Possible: Use the same predicates in the rule postconditions and preconditions (recursive rules)

  24. Terminology • Every well formed expression is composed of constants, variables, predicates and functions • A term is any constant, any variable, or any function applied to any term • Literal is any predicate (or its negation) applied to any set of terms • A ground literal is a literal that does not contain any variables • A negative literal is a literal containing a negated predicate • A positive literal is a literal with no negative sign • A clause is any disjunction of literals: whose variables are universally quantifed • A Horn clause is an expression of the formwhere are positive literals. H is called the head, postcondicions or consequent of the Horn clause. The conjunction of the literal is called the body precondition or ancendents of the Horn clause

  25. Terminology 2 • For any literals A and B, the expression is equivalent to , and the expression is equivalent to=> a Horn clause can equivalently be written as • A substitution is any function that replaces variables by terms. For example, the substitution replaces the variable x by the term 3 and replaces the variable y by the term z. Given a substitution and a literal L => denotes the result of applying the substitution to L. • An unifying substitution for two literals and is any substitution such that

  26. Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules:FOIL • Generating Candidate Specialisation in FOIL • Guiding Search in FOIL • Learning Recursive Rule Sets • Summary of FOIL • Summary

  27. Learning Sets of First-Order Rules:FOIL FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true Neg those examples, for which target_predicate is false learned_rules{ } while ( pos ) do /* learn a new rule */new_rule the rule that predicts target_predicate with no preconditionsnew_rule_negnegwhile ( new_rule_neg ) do /* Add a new literal to specialize new_rule */candidate_literals generate candidate new literals for new_rule, based on predicatesbest_literalargmaxL candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rulenew_rule_neg subset of new_rule_neg that satisfies new_rule‘s preconditionslearned_ruleslearned_rules + new_rulepospos - { members of pos covered by new_rule } returnlearned_rules

  28. Learning Sets of First-Order Rules:FOIL 2 • External loop : „Specific-to-General“ • Generalise the set of rules: • Add a new rule to the existing hypothesis • Each new rule „generalises“ the actual hypothesis to increase the number of the positive classified instances • Start with the special precondition, in which there is not any positive target concept

  29. Learning Sets of First-Order Rules:FOIL 3 FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true neg those examples, for which target_predicate is false learned_rules { } while ( pos ) do /* learn a new rule */new_rule the rule that predict target_predicate with no preconditionsnew_rule_neg negwhile ( new_rule_neg ) do /* Add a new literal to specialize new_rule */candidate_literals generate candidate new literals for new_rule, based on predicatesbest_literal argmax L candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rulenew_rule_neg subset of new_rule_neg that satisfies new_rule's preconditionslearned_ruleslearned_rules + new_rulepospos - { members of pos covered by new_rule } returnlearned_rules

  30. Learning Sets of First-Order Rules:FOIL 4 • Internal loop : „General-to-specific“ • Specialization of the current rule • Generation of new literals • Current rule: P( x1, x2, ..., xk ) ¬L ... Ln • Treatment of new literales in the following form: • Q( v1, ..., vr ), where Q is a predicate name occuring in Predicates (QÎpredicates) and where the vi are either new variables or variables already present in the rule • At least one of the vi in the created literal must already exist as a variable in the rule. Equal( xj, xk ), where xj and xkvariables already present in the rule • The negation of either of the above forms of literals

  31. Learning Sets of First-Order Rules:FOIL 5 • Example: predict GrandDaughter(x,y) other predicates: Father and Female • FOIL begins with the most general rule GrandDaughter(x,y) • Specialisation: the following literals are generated as candidates:Equal(x,y), Female(x), Female(y), Father(x,y), Father(y,x), Father(x,z),Father(z,x), Father(y,z), Father(z,y) + the negation of each of these literals • Suppose FOIL to select greedily Father(y,z): GrandDaughter(x,y) Father(y,z) • Generating the candidate literals to further specialise the rule:all in the previous step generated literals + Female(z), Equal(z,x), Equal(z,y), Father(z,w) Father(w,z) + their negation • Suppose FOIL to select greedily at this point Father(z,x) and on the next iteration Female(x)GrandDaughter(x,y) Father(y,z) Father(z,x) Female(x)

  32. Guiding the Search in FOIL • Select the most promising literal: FOIL considers the performance of the rule over the training data • Consider all possible bindings of each variable in the current rule for each literal • As new literals are added to the rule the set of bindings will change • If a literal is added that introduces a new variable then the bindings for the rule will grow in length • If the new variable can be bound to different constants, then the number of bindings fitting the extended rule can be greater than the number associated with the original rule • The association function of FOIL (FOILGain) that estimates the utility of adding a new literal to the rule is based on the number of positive and negative bindings before and after adding the new literal

  33. Guiding the Search in FOIL 2 • FoilGain • Consider some rule R and a candidate literal L that might be added to the body of R (R') . • Prove for each literal and for all bindings the information gain • High value - high information gain • p0: the number of positive bindings of R • p1: the number of positive bindings of R' • n0: the number of negative bindings of R • n1: the number of negative bindings of R' • t: the number of positive bindings of rule R that are still covered after adding literal L to R

  34. Summary of FOIL • Extend CN2 to handle learning first-order rules similar to Horn clauses • During the learning process FOIL performs a general-to-specific search at each step adding a single new literal to the rule precondition • At each step FOILGain is used to select one of the candidate literals • If new literals are allowed to refer to the target predicate, then FOIL can, in principle learn sets of recursive rules • Handling noisy data the search is continued until some trade-off occurs between rule accuracy, coverage and complexity • FOIL uses a minimum description length approach to limit the growth of rules: by adding new literals only if their description length is shorter than the description length of the training data they explain.

  35. Summary • Sequential covering learns a disjunctive set of rules by first learning a single accurate rule then removing the positive examples covered by this rule and iterating the process over the remaining training examples • Variety of methods, vary in the search strategy • CN2: general-to-specific beam search, generating and testing progressively more specific rules until a sufficiently accurate rule is found • Set of first-order rules provides a highly expressive representation • FOIL

More Related