1 / 47

Matrix Factorization via SGD

Matrix Factorization via SGD. Background. Recovering latent factors in a matrix. r. m movies. m movies. ~. H. W. V. n users. V[ i,j ] = user i’s rating of movie j. MF VIA SGD. Matrix factorization as SGD. local gradient. …scaled up by N to approximate gradient. step size.

sammy
Download Presentation

Matrix Factorization via SGD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix Factorizationvia SGD

  2. Background

  3. Recovering latent factors in a matrix r m movies m movies ~ H W V n users V[i,j] = user i’s rating of movie j

  4. MF VIA SGD

  5. Matrix factorization as SGD local gradient …scaled up by N to approximate gradient step size

  6. Key claim:

  7. What loss functions are possible?

  8. ALS = alternating least squares

  9. Distributed MF Via SGD

  10. KDD 2011 talk pilfered from  …..

  11. NAACL 2010

  12. Parallel Perceptrons • Simplest idea: • Split data into S “shards” • Train a perceptron on each shard independently • weight vectors are w(1) , w(2) , … • Produce some weighted average of the w(i)‘s as the final result

  13. Parallelizing perceptrons Instances/labels Split into example subsets Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute vk’s on subsets vk -1 vk- 2 vk-3 Combine by some sort of weighted averaging vk

  14. Parallel Perceptrons – take 2 • Idea: do the simplest possible thing iteratively. • Split the data into shards • Let w = 0 • For n=1,… • Train a perceptron on each shard with one passstarting with w • Average the weight vectors (somehow) and let wbe that average • Extra communication cost: • redistributing the weight vectors • done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce

  15. Parallelizing perceptrons – take 2 Instances/labels Split into example subsets w (previous) Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute local vk’s w -1 w- 2 w-3 Combine by some sort of weighted averaging w

  16. Similar to McDonnell et al with perceptron learning

  17. Slow convergence…..

  18. More detail…. M= • Initialize W,H randomly • not at zero  • Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch” • Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size: • increase step size when loss decreases (in an epoch) • decrease step size when loss increases • Implemented in Hadoop and R/Snowfall

  19. Wall Clock Time8 nodes, 64 cores, R/snow In-memory implementation

  20. Number of Epochs

  21. Varying rank100 epochs for all

  22. Wall Clock TimeHadoop One map-reduce job per epoch

  23. Hadoop scalability Hadoop process setup time starts to dominate

  24. Hadoop scalability

More Related