1 / 13

batch online learning

transductive. i.i.d. i.i.d. [Littlestone89]. batch online learning. Toyota Technological Institute (TTI). Adam Kalai. Sham Kakade. Batch learning vs. Agnostic model [Kearns,Sch- apire,Sellie94]. (x 1 ,y 1 )…(x n ,y n ) 2 X £ { – , + }. –. X. +.

Download Presentation

batch online learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. transductive i.i.d. i.i.d. [Littlestone89] batch online learning Toyota Technological Institute (TTI) Adam Kalai Sham Kakade

  2. Batch learning vs. Agnostic model[Kearns,Sch- apire,Sellie94] (x1,y1)…(xn,yn) 2X£ {–,+} – X + dist  – + + + + + + + Alg.H – + – + (x1,y1),…,(xn,yn) h 2F – – + – + + – + Def. H learns F if, 8: E[err(h)]·minf2Ferr(f)+n-c and H runs in time poly(n) – – + + x1 – – + + + – – + + – – – – + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h h1 ERM = “best on data” Familyof functions F(e.g. halfspaces)

  3. Batch learning vs. (x1,y1)…(xn,yn) 2X£ {–,+} – X + – + + + + + + + – + – + – – + – + + – + + – – + + x2 x1 – – + + + – – + + – – – – h2 + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h ERM = “best on data” Familyof functions F(e.g. halfspaces)

  4. Batch learning vs. (x1,y1)…(xn,yn) 2X£ {–,+} – X + – + + + + + + + – + – + – – + – + + – + – + + – + + x3 x2 x1 – – + + + – – + + – – – – + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h h3 ERM = “best on data” Goal: err(alg) · minf2F err(f) + Familyof functions F(e.g. halfspaces)

  5. Batch learning vs. Analogous definition: (x1,y1)…(xn,yn) 2X£ {–,+} – X + {x1,x2,…,xn} – + + + Alg.H + + + hi2F + (x1,y1),…,(xi-1,yi-1) – + – + – – + – + + – H learns F if,8(x1,y1),…,(xn,yn): E[err(H)]·minf2Ferr(f)+n-c and H runs in time poly(n) + + + – – + + x2 x1 x3 – – + + + – – + + – – – – h2 + + h3 – – + – – + + Transductive Online learning [Ben-David,Kushilevitz,Mansour95] arbitrary dist.  over X£ {–,+} . . X . . . . . . . . . . . . . . . . . equivalent . . . . . . . . . . . . . . . . . . . h “proper” learning outputh(i)2F h1 . . . . . . . . . ERM = “best on data” Goal: err(alg) · minf2F err(f) + Familyof functions F(e.g. halfspaces)

  6. Our results H 4 HERM = Hallucination + ERM Theorem 1. In online trans. setting, HERM requires one ERM computation per sample. Theorem 2. These are equivalent for proper learning: • F is agnostically learnable • ERM agnostically learns F (ERM can be done efficiently and VC(F) is finite) • F is online transductively learnable • HERM online transductively learns F

  7. Online ERM algorithm (sucks) Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1) hi = argminf2F|{ j<i| f(xj)yj}| x1 = (0,0) y1 = – x2 = (0,0) y2 = + x3 = (0,0) y3 = – x4 = (0,0) y4 = + … F= {–,+}X = { (0,0) } h1(x) = + h2(x) = – h3(x) = + h4(x) = – …

  8. Online ERM algorithm Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1) hi = argminf2F|{ j<i| f(xj)yj}| err(ERM) · minf2Ferr(f) + Pi2{1,…,n}[hihi+1] Online “stability” lemma: [KVempala01] Proof by induction on n = #examples easy!

  9. Online HERM algorithm random from {1,2,…,R} Prxi,rxi[hi hi+1] · R-1 Stability: 8i, - + James Hannan … (xi,+),(xi,+),…,(xi,+) rxi + Inputs: ={x1,x2,…,xn}, int R For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–) Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1) + - , (xi,+)

  10. Online HERM algorithm random from {1,2,…,R} Prxi,rxi[hi hi+1] · R-1 Stability: 8i, - + H 4 Theorem 1 For R=n¼: It requires one ERM computation per example. Inputs: ={x1,x2,…,xn}, int R For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–) Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1) + - Online “stability” lemma Hallucination cost

  11. Being more adaptive(shifting bounds) (x1,y1),…,(xi,yi),…(xi+W,yi+W),…(xn,yn) window 4

  12. Related work • Inequivalence of batch and online learning in noiseless setting • ERM black box is noiseless • For computational reasons! • Inefficient alg. for online trans. learning: • List all · (n+1)VC(F) labelings (Sauer’s lemma) • Run weighted majority [Blum90,Balcan06] [Ben-David,Kushilevitz,Mansour95] [Littlestone,Warmuth92]

  13. Conclusions • Alg. for removing iid assumption, efficiently, using unlabeled data • Interesting way to use unlabeled data online, reminiscent of bootstrap/bagging • Adaptive version: can do well on every window • Find “right” algorithm/analysis

More Related