1.43k likes | 1.58k Views
Loss-based Learning with Weak Supervision. M. Pawan Kumar. Computer Vision Data. Log (Size). ~ 2000. Segmentation. Information. Computer Vision Data. ~ 1 M. Log (Size). Bounding Box. ~ 2000. Segmentation. Information. Computer Vision Data. > 14 M. ~ 1 M. Log (Size). Image-Level.
E N D
Loss-based Learning with Weak Supervision M. Pawan Kumar
Computer Vision Data Log (Size) ~ 2000 Segmentation Information
Computer Vision Data ~ 1 M Log (Size) Bounding Box ~ 2000 Segmentation Information
Computer Vision Data > 14 M ~ 1 M Log (Size) Image-Level Bounding Box ~ 2000 Segmentation Information “Chair” “Car”
Computer Vision Data > 6 B Noisy Label > 14 M ~ 1 M Log (Size) Image-Level Bounding Box ~ 2000 Segmentation Information
Computer Vision Data Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Learn with missing information (latent variables)
Outline • Two Types of Problems • Part I – Annotation Mismatch • Part II – Output Mismatch
Annotation Mismatch Action Classification x h Input x Annotation y Latent h y = “jumping” Desired outputduring test time is y Mismatch between desired and available annotations Exact value of latent variable is not “important”
Output Mismatch Action Classification x h Input x Annotation y Latent h y = “jumping”
Output Mismatch Action Detection x h Input x Annotation y Latent h y = “jumping” Desired outputduring test time is (y,h) Mismatch between output and available annotations Exact value of latent variable is important
Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
Weakly Supervised Data x Input x h Output y {-1,+1} Hidden h y = +1
Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1
Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Φ(x,h) Ψ(x,+1,h) = 0 y = +1
Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector 0 Ψ(x,-1,h) = Φ(x,h) y = +1
Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1 Score f : Ψ(x,y,h) (-∞, +∞) Optimize score over all possible y and h
Latent SVM Scoring function Parameters wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h)
Learning Latent SVM Training data {(xi,yi), i= 1,2,…,n} (yi, yi(w)) Σi minw Empirical risk minimization No restriction on the loss function Annotation mismatch
Learning Latent SVM Find a regularization-sensitive upper bound (yi, yi(w)) Σi minw Empirical risk minimization Non-convex Parameters cannot be regularized
Learning Latent SVM (yi, yi(w)) • wT(xi,yi(w),hi(w)) + • -wT(xi,yi(w),hi(w))
Learning Latent SVM (yi, yi(w)) • wT(xi,yi(w),hi(w)) + • -maxhiwT(xi,yi,hi) y(w),h(w) = argmaxy,hwTΨ(x,y,h)
Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Parameters can be regularized Is this also convex?
Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Convex - Convex Difference of convex (DC) program
Recap Scoring function wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h) Learning minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi
Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions
Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Difference of convex (DC) program
Concave-Convex Procedure + Linear upper-bound of concave part • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +
Concave-Convex Procedure + Optimize the convex upper bound • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +
Concave-Convex Procedure + Linear upper-bound of concave part • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +
Concave-Convex Procedure + Until Convergence • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +
Concave-Convex Procedure + Linear upper bound? • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +
Linear Upper Bound • -maxhiwT(xi,yi,hi) Current estimate = wt • hi* = argmaxhiwtT(xi,yi,hi) • -wT(xi,yi,hi*) • ≥ -maxhiwT(xi,yi,hi)
CCCP for Latent SVM Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Repeat until convergence
Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions
Action Classification Train Input xi Output yi Input x Jumping Phoning Playing Instrument Reading Riding Bike Output y = “Using Computer” Riding Horse Running PASCAL VOC 2011 Taking Photo 80/20 Train/Test Split UsingComputer 5 Folds Walking
Setup • 0-1 loss function • Poselet-based feature vector • 4 seeds for random initialization • Code + Data • Train/Test scripts with hyperparameter settings http://www.centrale-ponts.fr/tutorials/cvpr2013/
Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Annealing the Tolerance • Annealing the Regularization • Self-Paced Learning • Choice of Loss Function • Extensions
Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Overfitting in initial iterations Repeat until convergence
Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε’-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i ε’ = ε/K and ε’ = ε Repeat until convergence