1 / 32

Challenges and Successes in MRNet

Challenges and Successes in MRNet. Matthew LeGendre & Madhavi Krishnan. MRNet Refresher. Packet Filter. CP. Tree of Communication Processes. CP. CP. CP. CP. CP. CP. CP. CP. FE. FE. Front-end. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE.

otylia
Download Presentation

Challenges and Successes in MRNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges and Successes in MRNet Matthew LeGendre & Madhavi Krishnan

  2. MRNet Refresher PacketFilter CP Tree ofCommunicationProcesses CP CP CP CP CP CP CP CP FE FE Front-end BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE Back-ends

  3. MRNet Goals • Communications Network for Tools • Scalable - 212,992 nodes at LLNL’s BG/L • Multi-platform - Linux, BlueGene, Cray XT, AIX, Solaris, Windows • Reliable – Automatic fault recovery • Flexible – Programmable filters, customizable topology • Open Source

  4. Challenges in MRNet • System Constraints • IO Node/Compute nodes on BlueGene • Shared library availability • Light-weight kernels • Scalability • Need “Whole System” scalability • Building a general tool • Paradyn is like an OEM • Some users need lightweight MRNet BE

  5. MRNet on BlueGene • User launches FE LaunchMON FE Control Node • LaunchMONlaunces BEs via control node. Front End Nodes CP CP CP • MRNet launces CP processes. IO Nodes BE BE BE BE … Compute Nodes … … 256 …

  6. MRNet on Cray XT • User launches FE Front End Nodes FE • ALPs launches BEs • ALPs launches CP processes. CP CP CP • MRNet initializes network BE BE BE Compute Nodes

  7. BlueGene • BE runs on IO nodes • 256 cores per tool backend • Cray XT • BE runs on compute node • 12 cores per tool backend • MRNet on Cray XT has more BEs for same size job

  8. Scalable Topology Propagation • Topology may change during execution • Need to broadcast topology information • Needed for: • Startup in “Back-end Connect” mode. • Reliability System • Can lead to topology update storms FE FE T CP T T CP CP T T CP CP ! CP CP CP CP BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE

  9. Scalable Topology Propagation • Use timeout filters to propagate updates • Collect all updates from a time slice before propagating • Individual node delays are small • Reduces network traffic if many updates. FE FE CP CP CP CP CP CP CP CP CP BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE

  10. MRNet Backend • Traditional MRNet BE can’t run on BlueGene compute node • Multi-threading not supported • No multi-processing for dedicated tool processes • Don’t always want other MRNet BE side effects • C++ and threading introduce library overheads • Hard to embed in application processes

  11. Lightweight MRNet • Lightweight back-end as part of application • C library • Single-threaded • No filtering at back-end • Traditional MRNet as back-end part of tool • C++ library • Multi-threaded • Dedicated thread receives data • Can run

  12. Some MRNet Success Stories • Stack Trace Analysis Tool (STAT) • Cray Application Termination Processing (ATP) • TAU over MRNet (ToM) • Open|SpeedShop, Component Based Tool Framework (CBTF) • Krell Institute • On-line detection of large scale application structure • UPC Barcelona Tech • Paradyn Performance Tool • Group File Operations, FINAL • Totalview using TBON-FS • University of Wisconsin, Madison • …

  13. Stack Trace Analysis Tool (STAT) • Stack trace sampling and analysis for large scale applications • Reduce number of tasks to debug • Discover equivalent process behavior • Useful and powerful debugging tool • Extreme scaling • BG/L - 212,992 tasks • Jaguar - 147,456 tasks • Easy to develop • Built over MRNet, LaunchMon, StackwalkerAPI and SymbtabAPI

  14. STAT MRNet Backend • Collect stack trace from • application • Encode as call prefix tree • MRNet stream send operation • stream->send(callGraph) STAT Frontend FE CP ... ... Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App

  15. STAT MRNet Filter _ STAT Frontend FE void merge_Stacktrace_Filter { /* Receive and process packets* / for each input packet { inPkt = unpack packet; /* Implement filter for merge */ mergedGraph = merge(inPkt); } /* Send output packet */ new Packet pkt(mergedGraph); push_out(pkt); } CP Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App

  16. STAT MRNet Frontend _ • Store final merged graph • stream->recv(mergedGraph) • External visualization tools STAT Frontend FE CP Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App

  17. CRAY ATP Tool • Abnormal Termination Processing • One, many or all processes may crash • Reduce number of core files • STAT like analysis to find equivalent process behavior • Request core dump on a subset of processes • Released with Cray debugging Tool 1.0 • Multiple MRNet streams • Crash stream: Notifies crash and requests ATP analysis • Stacktrace stream: Collects stack traces • Control stream: Requests core-dumps

  18. ATP MRNet Crash Stream • Application • Triggers signal handler • Backends • Request ATP analysis • Filters • TFILTER_SUM • SFILTER_DONTWAIT ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App

  19. ATP MRNetStacktrace Stream • Frontend • Sends message to all • backends to collect • stack traces • Backends • Collect stack traces • from the application • Filters • TFILTER_Merge_Stacktrace • SFILTER_WAITFORALL ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App

  20. ATP MRNetStacktrace Stream _ • Frontend • Sends message to all • backends to collect • stack traces • Backends • Collect stack traces • from the application • Filters • TFILTER_Merge_Stacktrace • SFILTER_WAITFORALL ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App

  21. ATP MRNetControl Stream • Frontend • Requests core-dumps • Sends control messages • Shutdown • Disable ATP • Acknowledgements • Backends • Trigger core-dumps • from specific processes ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE Core-dump to disk App App App App App App

  22. TAU over MRNet (ToM) • Online performance monitoring • Long running applications at scale • Performance data collected and interpreted at runtime • Runtime feedback into measurement subsystem • Optimize measurement • MRNet support was added in about a week • Two types of MRNet streams • Data Stream – Collection and aggregation of data • Control stream – Monitoring, control and feedback

  23. TAU MRNet Data Stream • Backends • Collect performance data • Built-in filters • Sum, average, max, min • User-built filters • Mean, variance, histogram • Clustering • Frontend • Stores and analyses • aggregated data • Change filter parameters • to tune aggregation ToM Frontend FE CP Filters CP CP CP CP ToM Daemons BE BE BE BE App App App App App App

  24. TAU MRNet Control Stream • Frontend broadcasts • control messages to backends • Startup/finalize messages • Selection of events • Sample interval • Measurement options • Instrumentation options ToM Frontend FE CP Filters CP CP CP CP ToM Daemons BE BE BE BE App App App App App App

  25. Benefits of MRNet • Lightweight transportation fabric • Powerful and flexible data aggregation • Extremely scalable • Portable • Fault tolerant • Easy to use and integrate with other tool components

  26. Questions

  27. MRNet Filter Examples • Aggregating similar data for scalable presentation • Symbol table and call graphs - checksum • Stack traces – call graph prefix trees • Aggregating different data for scalable analysis • Parallel concatenation • Statistical reduction • Sum, average, max, min, mean, std deviation • Histogram • FE dynamically programs filter parameters for binning • Parallel processing to reduce workload at frontend • Hierarchical clustering • Parallel Smith-Waterman algorithm

  28. MRNet Filter Capabilities • Built-in and user-defined filters • Transformation filters • Concatenation, sum, average, minimum, maximum • Synchronization Filters • Wait-for-all, wait-for-any, time-out • Runtime configurable filter parameters • Simultaneous bi-directional output packets • Heterogeneous stream based filters • Local topology information at filter • Fault tolerant filter state

  29. MRNet Filter Types PacketFilter Packet Batching/Unbatching Transformation Filter Synchronization Filter Packet Batching/Unbatching

  30. MRNet Transformation Filter void reduceFilter (packets_in, packets_out, packets_out_reverse, filter_state, config_params) { /* Receive and Process Input Packets */ for ( i = 0; i < packets_in.size(); i++ ) { cur_pkt = packets_in[i]; cur_pkt->unpack(“format string”, &data); reduceData(&data, &FEdata, &BEdata); } /* Send Output Packet */ packerPtrFE_pkt = new Packet (FEdata, …); packets_out.push_back(FE_pkt); packerPtrBE_pkt = new Packet (BEdata, …); packets_out_reverse.push_back(BE_pkt); return; } • Built-in filters • Concatenation • Minimum • Maximum • Sum • Average • User build filter

  31. MRNet Synchronization Filter • Wait For All • Wait for Any • Time out • User-defined void batchFilter (packets_in, packets_out, filter_state, config_params { /* Get saved packets from filter state */ batch_size = getBatchSize(config_params); packets = getPrevPackets(filter_state); packets.push_back(packets_in); /* Batch up packets */ if (packets.size() >= batch_size ) { packets_out.push_back(packets); packets.clear(); } updateFilterState (filter_state, packets); return; }

  32. _start __libc_start_main main PMPI_WaitAll do_SendorStall PMPI_Barrier MPID_RecvComplete MPID_ELAN_Barrier elan_tport_RxWait elan_hgsync elan_tportRxWaitNormal elan_gsyncShm elan_hgsyncNet elan_gsyncShm [Unknown] Elan_waitBlk Elan_waitWord elan_gsyncNet Elan_pollWord elan_waitBlk elan_hgsyncNet elan_gsyncShm [Unknown]

More Related