The probably approximately correct (PAC) learning model

The probably approximately correct (PAC) learning model

PAC learnability • A formal, mathematical model of learnability • It origins from a paper by Valiant (“A theory of the learnable”, 1984) • It is very theoretical • Has only very few results that are usable in practice • It formalizes the concept learning task as follows: • X: the space from which of the objects (examples) come.E.g. if the objects are described by two continuous features, then X=R2 • c is a concept: cX. but c can also be interpreted as an X{0, 1} mapping (each point of X either belongs to c or not) • C concept class: a set of concepts. In the following we always assume that the c concept we have to learnis a member ofC

PAC learnability • L denotes the learning algorithm: its goal is to learn a given c. As its output, it return a hypothesis hfrom concept class C. As a help, it has access to a function Ex(c,D), which given training examples in the form of <x,c(x)> pairs. The examples are random, independent, andfollow a fixed (but arbitrary)probability distribution D. • Example: • X=R2(the 2D plane) • C=all such rectangles of the plain that are parallel with the axes • c: one fixed rectangle • h: another given rectangle • D: a probability distribution defined over the plain • Ex(c,D): It gives positive and negative examples from c

Defining the error • After learning, how can we measure the final error of h? • We could define it as error(h)=c∆h=(c\h)(h\c) • (the symmetric difference of c and h– the size of the blue area in the figure) • It is not good - why? • D is the probability of picking a given point of the plain randomly • We would like our method to work for any possible distribution D • If D is 0 in a given area, we won’t getsamplesfrom there  we cannot learn in this area • So we cannot guaranteeerror(h)=c∆h to become 0 for any arbitrary D • However, D is the same during testing we won’t get samples from that area during testing either no problem if we couldn’t learn there!

Defining the error • Solution: let’s define the error as c∆h, but weight it by D! • That is, instead of c∆h,we calculate the area under D on c∆h • „Sparser” areas: they are harder to learn, but also count less in the error, as the weight,that is the value of D is smaller there • As D is the probability distribution of picking the points of X, the shaded are under D is the probability of randomly picking a point fromc∆h. Because of this, the error of h is defined as:

PAC learnability - definition • LetCbe concept class over X. Cis PAC learnable, if there exists an algorithm L thatsatisfies the following: using the samples given by Ex(c,D) for anyconcept c C, any distribution D(x) and any constants 0<ε<1/2, 0<δ<1/2,with probability P≥1-δ it finds a hypothesis hCfor which error(h)≤ε. • Briefly: With a large probability, L finds a hypothesis with a small error (this is why L is „probably approximately correct”). • 1. remark: the definition does not help find L, it just lists the conditions it has to satisfy • 2. remark: notice thatε and δ can be arbitrarily close to 0 • 3. remark: it should work for anyarbitrarily chosen c C and distribution D(x)

Why „probably approximately” ? • Why do we need the two thresholds, ε and δ ? • We have to allow the error to be ε because the set of examples is always finite, and so we cannot guarantee the perfect learning of a concept. The simplest example is when X is infinite, as in this case a finite amount of examples won’t ever be able to perfectly cover each point of an infinite space. • Allowing a large error with probability δ is required because we cannot guarantee the error to become small for each run. For example, it may happen that each randomly picked training sample will be the same point – and we won’t be able to learn from just one sample. This is why we can guarantee to find a good hypothesis only with a certain probability.

Example • X=R2 (The training examples are the points of the plane) • C=the set of those rectangles which are parallel with the axes • c: one given rectangle • h: another fixed rectangle • error(h): the area of h∆c weighted with D(x) • Ex(c,D): it gives positive and negative examples • Let L be the algorithm that always returns as h the smallest rectangle that contains all the positive examples! • Notice that • h  c • As the number of examples increases, the error can only decrease

Examples • We prove that by increasing the number of examples the error becomes smallerand smaller with an increasing probability; that is, algorithm L learns C in a PAC sense • Let 0<ε, δ≤1/2 be fixed • Consider hypothesish’ obtained as shown in the figure: • ε/4 is the probability that a randomly pulled sample falls in T • Then, 1- ε/4 is the probability that a randomly pulled sample does NOT fall in T • The probability that m randomly pulled samples do NOT fall in T : • The weighted area of the four bandsis <ε (there are overlaps) • So if we already pulled samples from T on each side, thenerror(h)< ε for the actual hypothesis h

Example • So error(h) ≥ ε is possible only if we haven’t yet pulled a sample from T on at least one of the sides • error(h) ≥ ε no sample from T1, T2, T3 or T4 • P(error(h) ≥ ε) ≤P(no sample from T1, T2, T3 or T4) ≤ • So we got this: P(error(h) ≥ ε) ≤ • We want to push this probability under a threshold δ: • How does the above curve look likeas a function of m (ε is fixed)? • It is an exponential function with a base between 0 and 1 • It goes to 0 if m goes to infinity • So it can be pushed under arbitrarily smallδ by selecting a large enough m

Example – expressing m • We already proved that L is a PAC learner,but as an addition, we can also specify m as a function of ε and δ: • We make use of the formula1-x≤e-x in defining a threshold: • That is, for a given ε and δ if m is larger than the threshold above, thenP(error(h) ≥ ε) ≤δ • Notice that m grows relatively slowly as a function of ε and δ , which might be important for practical usability

EfficientPAC learning • For practical considerationsit would be good if m increased slowly with the decrease of ε and δ • „slowly” = polynomially • Besides the number of training samples, whet else may influence the speed of training? • The processing cost of the training samples usually depends on the number of features, n • During its operation, the algorithm modifies its actual hypothesis h after receiving each training sample. Because of this, the cost of operating L largely depends on the complexity of the hypothesis, and so the complexity of concept c (as we represent c with h) • In the following we define the complexity of the concept c, or more precisely, the complexity of the representation of c

Concept vs. Representation • What is the difference between a concept and its representation? • Concept: an abstract (eg. mathematical) notion • Representation: its concrete representation in a computer programme • 1stexample: the concept is a logical formula • It can be represented by a disjunctive normal form • But also by using only NAND operations • 2ndexample: the concept is a polygon on the plane • It can be represented by the series of its vertices • But also by the equations of its edges • So by “representation” we mean a concrete algorithmic solution • In the following we assume that • L operates over a representation class H (it selects h from H) • H is large enough in the sense that it is able to represent any c C

EfficientPAC learning • We define the size of a representationH by some complexity measure for, for example the number of bits required for its storage. • We define the size of a concept cCas the size of the smallest representationwhich is able to represent it: • where c means thatis a possible representation of c • Efficient PAC learning: LetCdenote a concept class, and H denote a representation classthat is able to represent each element ofC. Cis efficiently learnable over H if Cis PAC-learnable by some algorithm L that operates over H,and L runs in polynomial time in n, size(c), 1/and 1/for any cC. • Remark: besides efficient training, for efficient applicability it is also required thatthe evaluation of h(x) should also run in polynomial time in n and size(h) for any arbitrary point x – but usually this is much easier to satisfy in practice

Example • 3-termdisjunctive normal form: T1vT2vT3, where Tiis the conjuction ofliterals (variables with or without negation) • The representation class of 3-termdisjunctive normal forms is not efficiently PAC-learnable. • That is, if we want to learn 3-termdisjunctive normal forms, and we represent them by themselves, then their PAC learning is not possible • However, the concept class of 3-termdisjunctive normal forms is efficiently PAC-learnable! • That is, the concepts class itself is PAC-learnable,but for this we must use a larger representation class which can be manipulated easier. With a cleverly chosen representation we can solve to keep the operation time of the learning algorithm within the polynomial time limit.

Occam learning • Earlier we mentioned Occam’s razor as a general machine learning heuristics:a simpler explanation usually generalizes better. No we define it more formally within the framework of PAC learning. By the simplicity of a hypothesis we will mean its size. As we will see, seeking a small hypothesis will actually mean that the algorithm has to compress the class labels of the training a samples • Definition: Let0 and 0≤β<1 be constants. An L algorithm (that works over hypothesis space H)is an(,β) –Occam-algorithmover the concept class Cif, given m samplesfrom cC it is able to find a hypothesis hH for which: • h is consistent with the samples • size(h)≤(n.size(c)).m (that is, size(h) increasese slowly as a function of size(c) and m ) • L is an efficient (,β) –Occam-algorithmif its running time is a polynomial function of n, m and size(c)

Occam learning • Why do we say that an Occamalgorithm compresses the samples? • For a given task, nand size(c) are fixed, and in practicem>>n • So we can say that in practice the above formula reduces tosize(h)<m, where<1 • Hypothesis his consistent with the m examples, which means that it can exactly return the label of the m examples. We have 0-1 labels, which corresponds to m bits of information • We defined size(h) as the number of bits required for storing h,sosize(h)<mmeans that the learner stored m bits of information in husing only m bits (remember that<1), so it stored the labels in a compressed form • Theorem: An efficient Occam learner is an efficient PAC learner at the same time • Which means that compression (according to a certain definition) guarantees learning (according to a certain definition)!

Sample complexity • By sample complexity we mean the number of training samples required for PAC learning. It practice it would be very useful if we could tell in advance the number of training samples required to satisfya given pair of ε and δ thresholds. • Sample complexity in the case of a finite hypothesis space: • Theorem: If the hypothesis space H is finite, then any cH concept is PAC-learnable by a consistent learner, and the number of required training samples is:where|H|is the number of hypotheses in H • Proof: consistent learner  it can only select a consistent hypothesis from H. Thus, the basis of the proof is that the error of all consistent hypotheses can be pushed under the threshold • This is the simplest possible proof, but the bound it gives will not be tight (that is, the m we obtain won’t be the smallest possible bound)

Proof • Let’s assume that there are k such hypotheses in H for whicherror>. • For a hypothesis h’, the probability of consistency with the m samples is • P(h’is consistent with the m samples)≤ (1-)m • For error> the hypothesis h selected for output should be one of the above k hypotheses: • P(error(h)>) ≤P(any of the k hypothesesis consistent with the m samples)≤k(1-)m≤|H|(1-)m≤|H|e-m • Where we usedthe inequalities k≤|H| and (1-x)≤e-x. • For a given,|H|e-m is an exponentially decreasing function • SoP(error(h)>) can be pushedbelowany arbitrarily small δ:

Example • Is the threshold obtained from the proof practically usable? • Let X be a finite spaceof n binary features. If we want to use a learner that contains all the possible logical formulas above X, then the size of the hypothesis space is (the number of all possible logical formulas with n variables) • Let’s substitute this into the formula : • This is exponential in n, so in practice it usually gives an unusable (too large)number for the minimal limit of m  • There are more complex variants of the above proof which achieve a lower limit for m,but these are also not really usable in practice

The case of infinite hypothesis spaces • For the case of finite H we obtained the threshold • Unfortunately, the above formula is not usable when|H|is infinite • In this case we will use the Vapnik-Chervonenkis dimenison, but before that we have to introduce the concept of “shattering”: • Let X be an object space and H a hypothesis space above X (both may be infinite!). Let S an arbitraryfinite subset of X. We say that H shatters S, if for any possible {+,-} labeling of S there exists a hH which partitions the points of S according to the labeling. • 1stremark: We try to say something about an infinite set by saying something about all of its finite subsets • 2ndremark: This measures the representation ability of H,because if H cannot shatter some finite set S, then by extending S we can define a concept over X-en which is not learnable by H

The Vapnik-Chervonenkis dimension • The Vapnik-Chervonenkis dimension of H is d ifthere exists such an S, |S|=d which it can shatter,but it cannot shatter any S for |S|=d+1 (Ifit can shatter any finite S then VCD=.) • Theorem: LetCbe a concept class,aand H a representation setfor which VCD(H)=d. Let L be a learning algorithm that learns cCby getting a set S of training samples with |S|=m, and it outputsa hypothesis hH which is consistent with S. The learning of C over H is PAC learning if (where c0 is a proper constant) • Remark: In contrary to the finite case, the bound obtained here is tight (that is, m samples are not only sufficient, but in certain cases they are necessary as well).

The Vapnik-Chervonenkis dimension • Let’s compare the bounds obtained for the finite and infinite cases: • Finite case: • Infinite case: • The two formulas look quite similar, but the role of |H| is taken by the Vapnik-Chervonenkis dimensionin the infinite case • Both formulas increase relatively slowly as a function of ε and δ,so in this sense these are not bad boundaries…

Examples of VC-dimension • Finite intervals over the line: VCD=2 • VCD≥2, as these two points can be shattered:(=separated for all label configurations) • VCD<3, as no 3 points can be shattered: • Separating the two classes by lines on the plane: VCD=3(in d-dimensional space: VCD=d+1) • VCD ≥3, as these 3 points can be shattered:(all labeling configurations should be tried!) • VCD<4, as no 4 points can be shattered:(all point arrangements should be tried!)

Examples of VC-dimension • Axis-aligned rectangles on the plane: VCD=4 • VCD≥4, as these 4 points can be shattered: • VCD<5, as no 5 points can be shattered: • Convex poligons on the plane: VCD=2d+1 (d is the number of vertices) (the book proofs only one of the directions) • Conjunctions of literals over {0,1}n : VCD=n (See Mitchell’sbook, only one direction is proved)

The probably approximately correct (PAC) learning model