1 / 59

descent methods

descent methods. Peter Richt á rik. Parallel coordinate. Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013. Randomized Coordinate Descent in 2D. 2D Optimization. Contours of a function.

taber
Download Presentation

descent methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. descent methods • Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013

  2. Randomized Coordinate Descent in 2D

  3. 2D Optimization Contours of a function Find the minimizer of Goal:

  4. Randomized Coordinate Descent in 2D N E W S

  5. Randomized Coordinate Descent in 2D N E W 1 S

  6. Randomized Coordinate Descent in 2D 2 N E W 1 S

  7. Randomized Coordinate Descent in 2D 2 N 3 E W 1 S

  8. Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S

  9. Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S

  10. Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S

  11. Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S

  12. Convergence of Randomized Coordinate Descent Focus on n (big data = big n) Strongly convex F Smooth or ‘simple’ nonsmoothF ‘difficult’ nonsmoothF

  13. Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration

  14. How (not) to ParallelizeCoordinate Descent

  15. “Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates

  16. Failure of naive parallelization 1b 1a 0

  17. Failure of naive parallelization 1 1b 1a 0

  18. Failure of naive parallelization 1 2b 2a

  19. Failure of naive parallelization 1 2b 2a 2

  20. Failure of naive parallelization OOPS! 2

  21. Idea: averaging updates may help 1b SOLVED! 1 1a 0

  22. Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a

  23. Averaging may be too conservative 2 But we wanted: BAD!!! WANT

  24. What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:

  25. Optimization Problems

  26. Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow

  27. Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual

  28. Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss

  29. 3 models for f with small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

  30. General Theory

  31. Randomized Parallel Coordinate Descent Method New iterate Current iterate i-th unit coordinate vector Random set of coordinates (sampling) Update to i-th coordinate

  32. ESO: Expected SeparableOverapproximation Shorthand: Definition [RT’11b] Minimize in h Separable in h Can minimize in parallel Can compute updates for only

  33. Convergence rate: convex f Theorem [RT’11b] stepsize parameter # coordinates # iterations average # updated coordinates per iteration error tolerance implies

  34. Convergence rate: strongly convex f Theorem [RT’11b] Strong convexity constant of the regularizer Strong convexity constant of the loss f implies

  35. Partial SeparabilityandDoubly Uniform Samplings

  36. Serial uniform sampling Probability law:

  37. -nice sampling Good for shared memory systems Probability law:

  38. Doubly uniform sampling Can model unreliable processors / machines Probability law:

  39. ESO for partially separable functions and doubly uniform samplings 1 Smooth partially separable f [RT’11b ] Theorem [RT’11b]

  40. Theoretical speedup # coordinates degree of partial separability # coordinate updates / iter LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems WEAK OR NO SPEEDUP: Non-separable (dense) problems Much of Big Data is here!

  41. Theory n = 1000 (# coordinates)

  42. Practice n = 1000 (# coordinates)

  43. Experiment with a 1 billion-by-2 billionLASSO problem

  44. = Extreme* Mountain Climbing Optimization with Big Data * in a billion dimensional space on a foggy day

  45. Coordinate Updates

  46. Iterations

  47. Wall Time

  48. Distributed-Memory Coordinate Descent

More Related