1 / 70

Scaling up LDA - 2

Scaling up LDA - 2. William Cohen. SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR Synchronization. What if you try and parallelize?. Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”.

kendis
Download Presentation

Scaling up LDA - 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling up LDA - 2 William Cohen

  2. SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR Synchronization

  3. What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

  4. Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. • AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

  5. Gory details of VW Hadoop-AllReduce • Spanning-tree server: • Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server • Worker nodes (“fake” mappers): • Input for worker is locally cached • Workers all connect to spanning-tree server • Workers all execute the same code, which might contain AllReduce calls: • Workers synchronize whenever they reach an all-reduce

  6. HadoopAllReduce don’t wait for duplicate jobs

  7. Second-order method - like Newton’s method

  8. 2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

  9. 50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

  10. MORE LDA SPEEDUPSFirst - RECAP LDA DEtails

  11. More detail

  12. z=1 z=2 random z=3 unit height … …

  13. SPEEDUP 1 - Sparsity

  14. z=1 z=2 random z=3 unit height … …

  15. Running total of P(z=k|…) or P(z<=k)

  16. Discussion…. • Where do you spend your time? • sampling the z’s • each sampling step involves a loop over all topics • this seems wasteful • even with many topics, words are often only assigned to a few different topics • low frequency words appear < K times … and there are lots and lots of them! • even frequent words are not in every topic

  17. Discussion…. Idea: come up with approximations to Z at each stage - then you might be able to stop early….. • What’s the solution? Want Zi>=Z

  18. Tricks • How do you compute and maintain the bound? • see the paper • What order do you go in? • want to pick large P(k)’s first • … so we want large P(k|d) and P(k|w) • … so we maintain k’s in sorted order • which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

  19. Results

  20. Results

  21. Results

  22. SPEEDUP 2 - ANOTHER APPROACH FOR USING Sparsity

  23. KDD 09

  24. z=s+r+q

  25. If U<s: • lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … • If s<U<r: • lookup U on line segment for r Only need to check t such that nt|d>0 z=s+r+q

  26. If U<s: • lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … • If s<U<s+r: • lookup U on line segment for r • If s+r<U: • lookup U on line segment for q Only need to check t such that nw|t>0 z=s+r+q

  27. Only need to check occasionally (< 10% of the time) Only need to check t such that nt|d>0 Only need to check t such that nw|t>0 z=s+r+q

  28. Trick; count up nt|dfor d when you start working on d and update incrementally Only need to store (and maintain) total words per topic and α’s,β,V Only need to storent|dfor current d Need to storenw|t for each word, topic pair …??? z=s+r+q

  29. 1. Precompute, for each t, 2. Quickly find t’s such that nw|tis large for w Most (>90%) of the time and space is here… Need to storenw|t for each word, topic pair …??? z=s+r+q

  30. 1. Precompute, for each t, 2. Quickly find t’s such that nw|tis large for w • map w to an int array • no larger than frequency w • no larger than #topics • encode (t,n) as a bit vector • n in the high-order bits • t in the low-order bits • keep ints sorted in descending order Most (>90%) of the time and space is here… Need to storenw|t for each word, topic pair …???

  31. SPEEDUP 3 - Online LDA

  32. Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

  33. ASIDE: VARIATIONAL INFERENCE FOR LDA

More Related