descent methods

descent methods • Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013

Randomized Coordinate Descent in 2D

2D Optimization Contours of a function Find the minimizer of Goal:

Randomized Coordinate Descent in 2D N E W S

Randomized Coordinate Descent in 2D N E W 1 S

Randomized Coordinate Descent in 2D 2 N E W 1 S

Randomized Coordinate Descent in 2D 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S

Convergence of Randomized Coordinate Descent Focus on n (big data = big n) Strongly convex F Smooth or ‘simple’ nonsmoothF ‘difficult’ nonsmoothF

Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration

How (not) to ParallelizeCoordinate Descent

“Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates

Failure of naive parallelization 1b 1a 0

Failure of naive parallelization 1 1b 1a 0

Failure of naive parallelization 1 2b 2a

Failure of naive parallelization 1 2b 2a 2

Failure of naive parallelization OOPS! 2

Idea: averaging updates may help 1b SOLVED! 1 1a 0

Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a

Averaging may be too conservative 2 But we wanted: BAD!!! WANT

What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:

Optimization Problems

Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow

Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual

Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss

3 models for f with small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

General Theory

Randomized Parallel Coordinate Descent Method New iterate Current iterate i-th unit coordinate vector Random set of coordinates (sampling) Update to i-th coordinate

ESO: Expected SeparableOverapproximation Shorthand: Definition [RT’11b] Minimize in h Separable in h Can minimize in parallel Can compute updates for only

Convergence rate: convex f Theorem [RT’11b] stepsize parameter # coordinates # iterations average # updated coordinates per iteration error tolerance implies

Convergence rate: strongly convex f Theorem [RT’11b] Strong convexity constant of the regularizer Strong convexity constant of the loss f implies

Partial SeparabilityandDoubly Uniform Samplings

Serial uniform sampling Probability law:

-nice sampling Good for shared memory systems Probability law:

Doubly uniform sampling Can model unreliable processors / machines Probability law:

ESO for partially separable functions and doubly uniform samplings 1 Smooth partially separable f [RT’11b ] Theorem [RT’11b]

Theoretical speedup # coordinates degree of partial separability # coordinate updates / iter LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems WEAK OR NO SPEEDUP: Non-separable (dense) problems Much of Big Data is here!

Theory n = 1000 (# coordinates)

Practice n = 1000 (# coordinates)

Experiment with a 1 billion-by-2 billionLASSO problem

= Extreme* Mountain Climbing Optimization with Big Data * in a billion dimensional space on a foggy day

Coordinate Updates

Iterations

Wall Time

Distributed-Memory Coordinate Descent

descent methods

descent methods

Presentation Transcript

Steepest Descent Method

Semi-Stochastic Gradient Descent Methods

Semi-Stochastic Gradient Descent Methods

Gradient descent

Descent

Descent into Madness:

Shadow Descent

Unilineal Descent Groups

Lunar Descent Trajectory

The Descent Editing

Descent/Dissent

Descent Search Depth

Lunar Descent Trajectory

Lunar Descent Analysis

Kinship and Descent

Recursive Descent Parsers

Recursive Descent Parsers

Descent with Modification

Gradient Descent

Steepest Descent Method

Descent with Modification

Recursive Descent Parsing