190 likes | 303 Views
A PAC-Bayes Risk Bound for General Loss Functions. NIPS 2006 Pascal Germain, Alexandre Lacasse, Fran ç ois Laviolette, Mario Marchand Université Laval, Québec, Canada. Summary.
E N D
A PAC-Bayes Risk Bound for General Loss Functions NIPS 2006 Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand Université Laval, Québec, Canada
Summary • We provide a (tight) PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions • like the exponential loss and the logistic loss. • Experiments with Adaboost indicate that the upper bound (computed on the training set) behaves very similarly as the true loss (estimated on the testing set).
Convex Combinations of Classifiers • Consider any set H of {-1, +1}-valued classifiers and any posterior Q on H . • For any input example x, the [-1,+1]-valued output fQ(x) of a convex combination of classifiers is given by
The Margin and WQ(x,y) • WQ(x,y) is the fraction, under measure Q, of classifiers that err on example (x,y) • It is relate to the margin y fQ(x) by
General Loss Functions Q(x,y) • Hence, we consider any loss function Q(x,y) that can be written as a Taylor series • and our task is to provide tight bounds for the expected loss Q that depend on the empirical loss measured on a training set of m examples, where
Bounds for the Majority Vote • A bound on Q also provides a bound on the majority vote since
Proof • where h1-k denotes the product of k classifiers. Hence
Proof (cnt.) • Let us define the “error rate” R(h1-k ) as • to relate Q to the error rate of a new Gibbs classifier:
Proof (ctn.) • Where is a distribution over products of classifiers that works as follows: • A number k is chosen according to • k classifiers in H are chosen according to Qk • So denotes the risk of this Gibbs classifier:
Proof (ctn.) • The standard PAC-Bayes theorem implies that for any prior on H* = [k2N+Hk , we have • Our theorem follows for any having the same structure of (i.e: k is first chosen according to |g(k)|/c, then k classifiers are chosen accord. to Pk) since, in that case, we have
Remark • Since we have • any looseness in the bound for R(GQ) will be amplified by c on the bound for Q. • Hence, the bound on Q can be tight only for small c. • This is the case for Q(x,y) = |fQ(x) – y|r since we have c = 1 for r = 1 and c = 3 for r = 2.
Bound Behavior During Adaboost • Here H is the set of decision stumps. The output h(x) of decision stump h on attribute x with threshold t is given by h(x) = sgn(x-t) . • If P(h) = 1/|H| hH, then • H(Q) generally increases at each boosting round
Results for the Exponential Loss • For this loss function, we have • Since c increases exponentially rapidly with , so will the risk bound.
Results for the Sigmoid Loss • For this loss function, we have • The Taylor series for tanh(x) converges only for |x| < /2. We are thus limited to < /2.
Conclusion • We have obtained PAC-Bayesian risk bounds for any loss function Q having a convergent Taylor expansion around WQ = ½. • The bound is tight only for small c. • On Adaboost, the loss bound is basically parallel to the true loss.