440 likes | 768 Views
Decision List. LING 572 Fei Xia 1/12/06. Outline. Basic concepts and properties Case study . Definitions. A decision list (DL) is an ordered list of conjunctive rules. Rules can overlap, so the order is important. A k-DL: the length of every rule is at most k.
E N D
Decision List LING 572 Fei Xia 1/12/06
Outline • Basic concepts and properties • Case study
Definitions • A decision list (DL) is an ordered list of conjunctive rules. • Rules can overlap, so the order is important. • A k-DL: the length of every rule is at most k. • A decision tree determines an example’s class by using the first matched rule.
An example A simple DL: • If X1=v11 && X2=v21 then c1 • If X2=v21 && X3=v34 then c2 Classify an example=(v11,v21,v34) The DL is 2-DL.
Rivest’s paper • It assumes that all attributes (including goal attribute) are binary. • It shows DL is easily learnable from examples.
Assignment and formula • Input attributes: x1, …, xn • An assignment gives each input attribute a value (1 or 0): e.g., 10001 • A boolean formula (function) maps each assignment to a value (1 or 0):
Two formulae are equivalent if they give the same value for same input. • Total number of different formulae: Classification problem: learn a formula given a partial table
CNF an DNF • Literal: • Term: conjunction (“and”) of literals • Clause: disjunction (“or”) of literals • CNF (conjunctive normal form): the conjunction of clauses. • DNF (disjunctive normal form): the disjunction of terms. • k-CNF and k-DNF
A slightly different definition of DT • A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1. • k-DT: the depth of a DT is at most k. • A DT defines a boolean formula: look at the paths whose leaf node is 1. • An example
Decision list • A decision list is a list of pairs (f1, v1), …, (fr, vr), fi are terms, and fr=true. • A decision list defines a boolean function: given an assignment x, DL(x)=vj, where j is the least index s.t. fj(x)=1.
Relations among different representations • CNF, DNF, DT, DL • k-CNF, k-DNF, k-DT, k-DL • For any k < n, k-DL is a proper superset of the other three. • Compared to DT, DL has a simple structure, but the complexity of the decisions allowed at each node is greater.
k-CNF and k-DNF are proper subsets of k-DL • k-DNF is a subset of k-DL: • Each term t of a DNF is converted into a decision rule (t, 1). • k-CNF is a subset of k-DL: • Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF are duals of each other. • The complement of a k-DL is also a k-DL. • Neither k-CNF nor k-DNF is a subset of the other • Ex: 1-DNF:
K-DT is a proper set of k-DL • K-DT is a subset of k-DNF • Each leaf labeled with “1” maps to a term in k-DNF. • K-DT is a subset of k-CNF • Each leaf labeled with “0” maps to a clause in k-CNF k-DT is a subset of
K-DT, k-CNF, k-DNF and k-DT k-CNF k-DT k-DNF K-DL
Learnability • Positive examples vs. negative examples of the concept being learned. • In some domains, positive examples are easier to collect. • A sample is a set of examples. • A boolean function is consistent with a sample if it does not contradict any example in the sample.
Two properties of a learning algorithm • A learning algorithm is economical if it requires few examples to identify the correct concept. • A learning algorithm is efficient if it requires little computational effort to identify the correct concept. We prefer algorithms that are both economical and efficient.
Hypothesis space • Hypothesis space F: a set of concepts that are being considered. • Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm. • The goal of a learning algorithm is to select the right concept from F given the training data.
Discrepancy between two functions f and g: • Ideally, we want to be as small as possible. • To deal with ‘bad luck’ in drawing example according to Pn, we define a confidence parameter:
“Polynomially learnable” • A set of Boolean functions is polynomially learnable if there exists an algorithm A and a polynomial function when given a sample of f of size drawn according to Pn, A will with probability at least output a s.t. Furthermore, A’s running time is polynomially bounded in n and m. • K-DL is polynomially learnable.
How to build a decision list • Decision tree Decision list • Greedy, iterative algorithm that builds DLs directly.
The algorithm in (Rivest, 1987) • If the example set S is empty, halt. • Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. • Add (t, v) to decision list and remove those examples from S. • Repeat 1-3.
The general greedy algorithm • RuleList=[], E=training_data • Repeat until E is empty or gain is small • f = Find_best_feature(E) • Let E’ be the examples covered by f • Let c be the most common class in E’ • Add (f, c) to RuleList • E=E – E’
Problem of greedy algorithm • The interpretation of rules depends on preceding rules. • Each iteration reduces the number of training examples. • Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned. Several papers on alternative algorithms
Summary of (Rivest, 1987) • Formal definition of DL • Show the relation between k-DL, k-CNF, k-DNF and k-DL. • Prove that k-DL is polynomially learnable. • Give a simple greedy algorithm to build k-DL.
Outline • Basic concepts and properties • Case study
In practice • Input attributes and the goal are not necessarily binary. • Ex: the previous word • A term a feature (it is not necessarily a conjunction of literals) • Ex: the word appears in a k-word window • Only some feature types are considered, instead of all possible features: • Ex: previous word and next word • Greedy algorithm: quality measure • Ex: a feature with minimum entropy
Case study: accent restoration • Task: to restore accents in Spanish and French A special case of WSD • Ex: ambiguous de-accented forms: • cesse cesse, cessé • cote côté, côte, cote, coté • Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote • Attributes: words within a window
The algorithm • Training: • Find the list of de-accent forms that are ambiguous. • For each ambiguous form, build a decision list. • Testing: check each word in a sentence • if it is ambiguous, then restore the accent form according to the DL
Step 2: Collecting training context Context: the previous three and next three words. Strip the accents from the data. Why?
Step 3: Measure collocational distributions Feature types are pre-defined.
Step 4: Rank decision rules by log-likelihood There are many alternatives. word class
Step 5: Pruning DLs • Pruning: • Cross-validation • Remove redundant rules: “WEEKDAY” rule precedes “domingo” rule.
Building DL • For a de-accented form w, find all possible accented forms • Collect training contexts: • collect k words on each side of w • strip the accents from the data • Measure collocational distributions: • use pre-defined attribute combination: • Ex: “-1 w”, “+1w, +2w” • Rank decision rules by log-likelihood • Optional pruning and interpolation
Experiments Prior (baseline): choose the most common form.
Global probabilities vs. Residual probabilities • Two ways to calculate the log-likelihood: • Global probabilities: using the full data set • Residual probabilities: using the residual training data • More relevant, but less data and more expensive to compute. • Interpolation: use both • In practice, global probability works better.
Combining vs. Not combining evidence • Each decision is based on a single piece of evidence. • Run-time efficiency and easy modeling • It works well, at least for this task, but why? • Combining all available evidence rarely produces different results • “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided”:
Summary of case study • It allows a wider context (compared to n-gram methods) • It allows the use of multiple, highly non-independent evidence types (compared to Bayesian methods) kitchen-sink approach of the best kind
Probabilistic DL • DL: a rule is (f, v) • Probabilistic DL: a rule is (f, v1/p1 v2/p2 … vn/pn)
T: q S: T-S: S1 S2 Sn … Entropy of a feature q fired not fired
Algorithms for building DL • AQ algorithm (Michalski, 1969) • CN2 algorithm (Clark and Niblett, 1989) • Segal and Etzioni (1994) • Goodman (2002) • …
Summary of decision list • Rules are easily understood by humans (but remember the order factor) • DL tends to be relatively small, and fast and easy to apply in practice. • DL is related to DT, CNF, DNF, and TBL. • Learning: greedy algorithm and other improved algorithms • Extension: probabilistic DL • Ex: if A & B then (c1, 0.8) (c2, 0.2)