1 / 99

Some Useful Machine Learning Tools

Some Useful Machine Learning Tools. M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay , Île-de-France. Outline. Part I : Supervised Learning Part II: Weakly Supervised Learning. Outline – Part I. Introduction to Supervised Learning Probabilistic Methods

nico
Download Presentation

Some Useful Machine Learning Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some Useful Machine Learning Tools M. Pawan Kumar ÉcoleCentrale Paris Écoledes PontsParisTech INRIA Saclay, Île-de-France

  2. Outline • Part I : Supervised Learning • Part II: Weakly Supervised Learning

  3. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine

  4. Image Classification Is this an urban or rural area? Input: x Output: y  {-1,+1}

  5. Image Classification Is this scan healthy or unhealthy? Input: x Output: y  {-1,+1}

  6. Image Classification Which city is this? Input: x Output: y  {1,2,…,C}

  7. Image Classification What type of tumor does this scan contain? Input: x Output: y  {1,2,…,C}

  8. Object Detection Where is the object in the image? Input: x Output: y  {Pixels}

  9. Object Detection Where is the rupture in the scan? Input: x Output: y  {Pixels}

  10. Segmentation sky tree car sky road grass What is the semantic class of each pixel? Input: x Output: y  {1,2,…,C}|Pixels|

  11. Segmentation What is the muscle group of each pixel? Input: x Output: y  {1,2,…,C}|Pixels|

  12. A Simplified View of the Pipeline Input x Features Φ(x) Extract Features http://deeplearning.net Compute Scores Learn f Prediction y(f) Scores f(Φ(x),y) maxy f(Φ(x),y)

  13. Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

  14. Learning Objective Training data {(xi,yi), i= 1,2,…,n} Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

  15. Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples Measure of prediction quality • f* = argminfΣi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth

  16. Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples • f* = argminfΣi Error(yi(f),yi) + λ R(f) Regularizer Relative weight (hyperparameter)

  17. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine

  18. Logistic Regression Input: x Output: y  {-1,+1} Features: Φ(x) f(Φ(x),y) = yθTΦ(x) Prediction: sign(θTΦ(x)) P(y|x) = l(f(Φ(x),y)) l(z) = 1/(1+e-z) Logistic function Is the distribution normalized?

  19. Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ R(θ) Regularizer Negative Log-likelihood

  20. Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Proof left as an exercise. Hint: Prove that Hessian H is PSD aTHa ≥ 0, for all a

  21. Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) dθ θt Repeat until decrease in objective is below a threshold

  22. Gradient Descent Small μ Large μ

  23. Gradient Descent Small μ Large μ

  24. Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) Small constant or Line search dθ θt Repeat until decrease in objective is below a threshold

  25. Newton’s Method Minimize g(z) Solution at iteration t = zt Define gt(Δz) = g(zt + Δz) Second-order Taylor’s Series gt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2 Derivative wrtΔz = 0, implies g’(zt) + g’’(zt) Δz= 0 Solving for Δzprovides the learning rate

  26. Newton’s Method Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 μ-1 = d2L(θ) θt+1 θt - μ dL(θ) dθ2 dθ θt θt Repeat until decrease in objective is below a threshold

  27. Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Train C 1-vs-all logistic regression binary classifiers Prediction: Maximum probability of +1 over C classifiers Simple extension, easy to code Loses the probabilistic interpretation

  28. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine

  29. Multiclass Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]

  30. Multiclass Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))

  31. Multiclass Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Gradient Descent, Newton’s Method, and many others

  32. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine

  33. Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]

  34. Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yij), for all i, j] [Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]

  35. Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))

  36. Regularized Maximum Likelihood Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Partition function is expensive to compute Approximate inference (Nikos Komodakis’ tutorial)

  37. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine

  38. Multiclass SVM Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]

  39. Multiclass SVM Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y)) Predicted Output: y(w) = argmaxywTΨ(x,y))

  40. Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)

  41. Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w)) - wTΨ(x,yi) ≤wTΨ(x,yi(w)) + Δ(yi,yi(w)) ≤maxy{ wTΨ(x,y) + - wTΨ(x,yi) Δ(yi,y) } Sensitive to regularization of w Convex

  42. Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with polynomial # of constraints Specialized software packages freely available http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html

  43. Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine

  44. Structured Output SVM Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y))

  45. Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with exponential # of constraints Many polynomial time algorithms

  46. Cutting Plane Algorithm Define working sets Wi = {} REPEAT Update w by solving the following problem minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  Wi Compute the most violated constraint for all samples ŷi=argmaxywTΨ(x,y) + Δ(yi,y) Update the working sets Wi by adding ŷi

  47. Cutting Plane Algorithm Termination criterion: Violation of ŷi < ξi + ε, for all i Number of iterations = max{O(n/ε),O(C/ε2)} At each iteration, convex dual of problem increases. Convex dual can be upper bounded. IoannisTsochantaridis et al., JMLR 2005 http://svmlight.joachims.org/svm_struct.html

  48. Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  {1,2,…,C}m Number of constraints = nCm

  49. Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  Y

  50. Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)≤ ξi for all zi Y

More Related