1 / 10

Implementing Scalable Tree-based Algorithms using MRNet

Implementing Scalable Tree-based Algorithms using MRNet. Ting Chen Mark Cowlishaw. Introduction. Background on MRNet Our Applications Reverse index Online queries Description of Experiments Progress Next steps. Background: MRNet [Roth, Arnold, Miller03]. Tree-Based Overlay Network

urban
Download Presentation

Implementing Scalable Tree-based Algorithms using MRNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Scalable Tree-based Algorithms using MRNet Ting ChenMark Cowlishaw

  2. Introduction • Background on MRNet • Our Applications • Reverse index • Online queries • Description of Experiments • Progress • Next steps

  3. Background: MRNet [Roth, Arnold, Miller03] • Tree-Based Overlay Network • Nodes of a distributed application are arranged in a tree-structure • Leaves producing data that is aggregated and filtered by higher levels of the tree • Separation of programs and their running tree topology • A TBON program can run on tree-network of different topologies

  4. Reverse indexes for Keyword Queries • A keyword query is a list of words: < w1,w2, ...,wn > . • A typical reverse index is a list of words • Each word points to a list of document IDs containing the word. • A document ID list can either be sorted by document names or by the number of times the word appears in the document. Wisconsin Badgers

  5. On-line queries: N-best frequency (N=2) DocC: 7 DocB: 3 DocD: 0 DocF: 6 DocA: 5 DocC: 7 DocF: 6 DocD: 0 DocB: 3 DocA: 5 DocE: 2

  6. Objective / Experiments • How does tree topology affect application performance • Macro-benchmarks: throughput/time-to-completion for index building and response time for on-line queries • Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred • Scale-Up and Speed-Up curves with the increase of cluster nodes • Is Multiple-level ( > 2) helpful?

  7. Progress - data • Experiment Data • All Wikipedia documents • 8GB of data • 4 Million documents • Probably enough for a tree with 32 leaves • Data in Wiki-text format

  8. Progress – Design / Implementation • Message Formats • Document/keyword delivery • Acknowledgment • keys / statistics • Messages for Microbenchmarks • Shared Classes • DocumentStatistics (back end) [in test] • StatisticsEntry (all) [in test] • StatisticsList (all) • Familiarity with MRNet Toolset • Debugging MRNet Programs

  9. Next Steps • Deploy at medium scale (~32 Nodes) • Experiments to determine fan-out • Maximize throughput (bytes/second) • Minimize time-to-completion • Microbenchmarks • Collected and filtered using MRNet • Idle time • Messages per second, total message traffic

  10. Futures • Replace N-best with distance from median • Add support for new document types • XML • HTML • Generalize to other MapReduce[Dean04] Applications • More realistic relevance ranking • Reverse hyperlink count • Collection term vectors

More Related