1 / 40

Performance analysis and tuning of parallel/distributed applications

Performance analysis and tuning of parallel/distributed applications. 2 1 -0 8 -2007. Anna Morajko Anna.Morajko@uab.es. Introduction. Goals of our group Develop techniques and tools for application performance improvement Automatic analysis Automatic and dynamic tuning

tudor
Download Presentation

Performance analysis and tuning of parallel/distributed applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance analysis and tuning of parallel/distributed applications 21-08-2007 Anna Morajko Anna.Morajko@uab.es

  2. Introduction Goalsof our group • Develop techniques and tools for application performance improvement • Automatic analysis • Automatic and dynamic tuning • For parallel/distributed applications • For dedicated and non dedicated clusters as well as for grids 4 doctors, 5 graduate students, 2 extern doctors

  3. Introduction Main research projects • Monitoring, analysis, and tuning tools: • Kappa-Pi – post-mortem analysis • DMA – automatic analysis at run-time • MATE – automatic tuning at run-time • Performance models • For applications based on frameworks • For scientific libraries

  4. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  5. Problems / Solutions Metrics Trace Profile data Performance analysis and tuning Post-mortem manual analysis Modifications Tuning Application development Source Application User Performance analysis Execution Performance data Monitoring Tools Events Visualization

  6. Metrics Trace Profile data Performance analysis and tuning Post-mortem automatic analysis Modifications Tuning Application development Source Application User Execution Kappa-Pi Performance data Suggestions for user Monitoring Tools Performance analysis Events

  7. Performance analysis and tuning Run-time analysis Modifications Tuning Application development Source Application User Execution DMA Dynamic Monitor and Analyzer Performance data Instrumentation Problems and root causes Monitoring Tools Metrics Performance analysis

  8. Problems / Solutions MATE Monitoring Analysis and Tuning Environment Performance analysis and tuning Dynamic tuning Application development Source Application User Execution Modifications Performance data Instrumentation Monitoring Tuning Tools Events Performance analysis

  9. What can be tuned in an application? Application code API Framework code API Libraries code OS API Operating System kernel Hardware Performance analysis and tuning

  10. Knowledge representation Measure points Where the instrumentation must be inserted to provide measurements Performance model Determines minimal execution time of the entire application Tuning points/actions/synchronization What and when can be changed in the application point – element that may be changed action – what operation to invoke on a point synchronization – when a tuning action can be invoked to ensure application correctness Formulasand conditions for optimal behavior measurements optimal values Performance analysis and tuning

  11. Application code API Framework code API Libraries code OS API Operating System kernel Hardware Performance analysis and tuning Execution MATE Measure points Monitoring Tuning Performance analysis Performance model Tuning point, action, sync

  12. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  13. Post-mortem performance analysis Kappa-Pi tool • Knowledge-Based Automatic Parallel Program Analyser for Performance Improvement • Trace analyzer that automatically detects bottlenecks, relates them to source code and provides actionable suggestions for a user • Version 1 • Rule-based inference engine • Hard coded performance problems • Focused on analysis of inefficiency intervals • Basic source code analysis

  14. Post-mortem performance analysis • Statistical problem instances classification • Suggestions guided by static source code analysis Kappa-Pi 2 • More recent development, works with PVM and MPI • Declarative communication problems catalog (XML) • Problems declared as structured event patterns • Detection based on decision trees

  15. Post-mortem performance analysis e2 e1 e4 e3 Root Node (Late Sender & Blocked Sender) R S Intermediate Nodes (Late Sender) R Leaf Node (Final Node Blocked) Sender) S R (receive call) S (send call) • [Late sender] situation found in tasks [0, 1] and [1, 2], the sender primitive call can be initiated before (no data dependences) the current call on line [35] in file [ats_ls.c] Kappa-Pi 2: example

  16. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  17. Run-time performance analysis DMA: Dynamic Monitor and Analyzer • Primary objective • Develop a tool that is able to analyze the performance of parallel applications, detect bottlenecks and explain their reasons • Our approach • Dynamic on-the-fly analysis • Automatic extraction of application structure and patterns • Root-cause analysis based on causality paths • Declarative problem catalog for explanations (performance properties) • Tool primarily targeted to MPI-based parallel programs and communication problems

  18. Automatic performance analysis DMA: Application structure and patterns • Task Activity Graph (TAG) • Constructed dynamically and incrementally at run-time for each task • TAG nodes reflect pre-selected activities and events (e.g. MPI function calls, relevant task events and selected loops) • Edges represent sequential flow of execution (happens-before relation) • Aggregated timers/counters provide holistic view of task behavior and allow for high-level bottleneck identification • Other metrics can be attached to nodes/edges

  19. Automatic performance analysis DMA: Global model (GTAG) • Individual TAG graphs connected by message edges (P2P, Collective) provide global application model • This can be done at run-time: sender call-path id sent with every MPI message (by dynamically extending MPI data types) • TAG allows for detection of communication bottlenecks (e.g. wait-times like late sender, late receiver) • Other classes of bottlenecks (CPU, I/O) are visible as edge latencies

  20. Automatic performance analysis DMA: Causality paths • How to explain causes of latencies?For example: why sender is late? • Cause of waiting time between two nodesas the differences between their execution paths • We can track activities before the problem occurs on both nodes • But TAG is too generic to identify these activities as there might be different execution paths • So we need to detect executedpaths that lead to problems • We call them causality paths • We consider two sources of causality: • Sequential flow • Message flow

  21. Automatic performance analysis DMA: Root-cause analysis • Identify bottlenecks with TAG • Profile edges for non-communication problems • Search for causes of communication latencies (nodes) by following message edges backwards (sender-receiver pairs) • Detect causality paths in sender to explain latencies in receiver • There might be multiple explanations • Work in progress

  22. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  23. We can extract the knowledge given by frameworks to define performance models for automatic and dynamic performance tuning (MATE) Examples: M/W framework (UAB), eSkel (The Edinburg Skeleton library), llc (La Laguna Compiler) Application code Measure points API Framework code Performance model API Libraries code Tuning point, action, sync OS API Operating System kernel Hardware Performance models Framework-based applications • Frameworks help users to develop parallel applications providing APIs that hide low level details (M/W, pipeline, divide and conquer)

  24. Performance models Framework-based application example • M/W application • Entry, Exit CompFunc() • Message size • Entry, Exit Iteration Measure points Performance model • Problems for tuning: • Load imbalance • Improper number of workers Tuning point, action, sync • Change variable • Nopt

  25. Application code Measure points API Framework code Performance model API Libraries code OS API Tuning point, action, sync Operating System kernel Hardware Performance models Scientific libraries modeling • Libraries help users to solve domain-specific problems • Using the knowledge learned from scientific programming libraries to define performance models for automatic and dynamic performance tuning (MATE) • Examples: mathematical library PETSc (Portable Extensible Toolkit for Scientific Computation)

  26. Performance models Scientific libraries modeling: PETSc Study of different combinations of Matrix Types, Pre-conditioners, and Solvers Application Code Other higher Level solvers PDE Solvers TS Time Stepping SNES SLES Mathematical Algorithms Draw KSP PC Data Structure Matrices Vectors Indexset Level of abstraction BLAS LAPACK MPI

  27. Performance models Scientific libraries modeling: PETSc • PETSc supports different storage methods (dense, sparse, compressed, etc.) – no data structure is appropriate for all problems • Linear system calculations involve matrix’s preconditioner (PC) and Krylov Subspace method (KSP) • Different mathematical algorithms for preconditioner (jacobi, LU, etc.) and KSP (GMRES, CG, etc.) • Storage methods and algorithm selection can significantly affect the performance • We build a knowledge base that maps problems (different classes of matrices, matrix sizes, and other variables) to best configurations • An input matrix is compared to the knowledge base using K-Nearest-Neighbours algorithm to select the most similar problem and its configuration • Work still in progress

  28. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  29. Models for specific apps Models for framework-based apps Models for scientific library Application code Models for any app API Framework code API Libraries code OS API Operating System kernel Hardware MATE Execution Measure points Monitoring Tuning Performance analysis Performance model Tuning point, action, sync

  30. Analyzer Tunlet Application Performance model MATE Measure points Tuning points Framework scientific libs Tunlet Tunlet … Application Tunlet User MATE Tunlets specification • Tunlets: mechanism for specifying a performance model • Originally, each tunlet is written as C++ code library

  31. MATE Abstractions Tunlets specification • Simplify tunlet development: • Tunlet Specification Language • Automatic Tunlet Generator

  32. MATE Machine 1 Machine 2 MachineN pvmd pvmd pvmd AC AC AC Task1 … Task3 Task2 Task2 DMLib DMLib DMLib DMLib Actions Actions. Actions. Events Events Events Events Analyzer Analyzer machine MATE scalability • Scalability problems with centralized analysis • The volume of events gets too high

  33. pvmd Machine n Machine 1 pvmd . ... AC Task Task AC Task 1 2 3 DMLib DMLib DMLib actions . events . events events CP CP Machine q Machine r actions data data Global Analyzer Machine p Tunlets MATE MATE scalability • Distributed-hierarchical collecting-preprocessing approach • Distribution of event collection • Preprocessing of certain calculations: cumulative and comparative operations • Cooperating processes: • Global Analyzer - performance model evaluation • Collector-Preprocessor (CP) - events collection and preprocessing

  34. MATE G-MATE: MATE on Grid • New requirements: • Transparent process tracking and control • AC should follow application process to any cluster • Inter-cluster communication with analyzer • Lower inter-cluster event collection overhead • Inter-cluster communications generally have high latency and lower bandwidth • Remote trace events should be aggregated to trace event abstractions, saving bandwidth (Collector-Preprocessor)

  35. MATE G-MATE: Transparent process tracking • Alternative 1: System service • Machine or Cluster can have MATE enabled as daemon that detects startup of new processes • Alternative 2: Application wrapping • AC can be binary packaged with application binary AC Task DMLib Remote Machine Remote Machine detects Dyninst new ‘Task’ Taskn create new ‘Task’ create DMLib Task AC Job submission AC control AC DMLib Analyzersubscription

  36. G-MATE: Analyzer approach Central analyzer Collects performance data in a central site Hierarchical analyzer Local Analyzers process local data, generates abstract metrics and send to global analyzers Distributed analyzer Each analyzer evaluates its local performance data and abstract global data cooperating with the rest of analyzers MATE

  37. Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

  38. Future work • DMA – new metodology under development • Development of performance models: • for applications based on different frameworks • Join both strategies: load balance and number of workers • Replication/elimination of pipeline stages • PETSc-based applications – under construction • for black-box applications • for grid applications – under construction • Development of tuning techniques

  39. Future work • MATE’s analysis improvement • Performance model of the number of CPs • Hierarchy at the CPs level as the number of machines increases • Root cause analysis? • Instrumentation evaluation • Co-execution of tunlets • Finishing MATE for MPI applications

  40. Thank you very much 21-08-2007 Anna Morajko Anna.Morajko@uab.es

More Related