Matrix Factorization via SGD

Matrix Factorizationvia SGD

Background

Recovering latent factors in a matrix r m movies m movies ~ H W V n users V[i,j] = user i’s rating of movie j

MF VIA SGD

Matrix factorization as SGD local gradient …scaled up by N to approximate gradient step size

Key claim:

What loss functions are possible?

ALS = alternating least squares

Distributed MF Via SGD

KDD 2011 talk pilfered from  …..

NAACL 2010

Parallel Perceptrons • Simplest idea: • Split data into S “shards” • Train a perceptron on each shard independently • weight vectors are w(1) , w(2) , … • Produce some weighted average of the w(i)‘s as the final result

Parallelizing perceptrons Instances/labels Split into example subsets Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute vk’s on subsets vk -1 vk- 2 vk-3 Combine by some sort of weighted averaging vk

Parallel Perceptrons – take 2 • Idea: do the simplest possible thing iteratively. • Split the data into shards • Let w = 0 • For n=1,… • Train a perceptron on each shard with one passstarting with w • Average the weight vectors (somehow) and let wbe that average • Extra communication cost: • redistributing the weight vectors • done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce

Parallelizing perceptrons – take 2 Instances/labels Split into example subsets w (previous) Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute local vk’s w -1 w- 2 w-3 Combine by some sort of weighted averaging w

Similar to McDonnell et al with perceptron learning

Slow convergence…..

More detail…. M= • Initialize W,H randomly • not at zero  • Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch” • Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size: • increase step size when loss decreases (in an epoch) • decrease step size when loss increases • Implemented in Hadoop and R/Snowfall

Wall Clock Time8 nodes, 64 cores, R/snow In-memory implementation

Number of Epochs

Varying rank100 epochs for all

Wall Clock TimeHadoop One map-reduce job per epoch

Hadoop scalability Hadoop process setup time starts to dominate

Hadoop scalability

Matrix Factorization via SGD