1 / 54

Clustering Technology

Clustering Technology. Clustering Schematic. Cluster Components. Cluster hardware (processor, main memory, hard disk, …) Cluster network (Fast Ethernet, Gigabit Ethernet, Myrinet, …) Cluster Software (operating system, programming environment, …). Cluster Operating System Characteristics.

adelie
Download Presentation

Clustering Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Technology

  2. Clustering Schematic

  3. Cluster Components • Cluster hardware (processor, main memory, hard disk, …) • Cluster network (Fast Ethernet, Gigabit Ethernet, Myrinet, …) • Cluster Software (operating system, programming environment, …)

  4. Cluster Operating System Characteristics • Manageability:An absolute necessity is remote and intuitive system administration; this is often associated with a Single System Image (SSI) which can be realized on different levels, ranging from a high-level set of special scripts down to real state-sharing on the OS level. • Stability:The most important characteristics are robustness against crashing processes, failure recovery by dynamic reconfiguration, and usability under heavy load. • Performance:The performance critical parts of the OS, such as memory management, process and thread scheduler, file I/O and communication protocols should work in as efficiently as possible. • Extensibility:The OS should allow the easy integration of cluster-specific extensions, which will most likely be related to the inter-node cooperation. A good example for this is the MOSIX system that is based on Linux. • Scalability:The scalability of a cluster is mainly influenced by the provision of the contained nodes, which is dominated by the performance characteristics of the interconnect. • Support:Many intelligent and technically superior approaches in computing failed due to the lack of support in its various aspects: which tools, hardware drivers and middleware environments are available. • Heterogeneity:Clusters provide a dynamic and evolving environment in that they can be extended or updated with standard hardware just as the user needs to or can afford. Therefore, a cluster environment does not necessarily consist of homogenous hardware.

  5. Cluster Solution • Hardware of Cluster nodes is typical PC’s (this selection reduces the cost per performance ratio). • Fast Ethernet was used as Cluster interconnection (17 nodes connected by Fast Ethernet infrastructure). • Linux was choosed as Cluster OS to provide our needs as much as possible (we can recompile and tune the kernel to meet our needs). • Using VMware software for virtualizing our computing resources. • Message Passing Interface was selected as the parallel programming environment.

  6. Configuring Cluster • Configuring the Cluster nodes (network configuration, packages installation, …). • Optimizing and securing the Linux OS to extract the maximum utilization from Cluster resources. • Cluster administration (Samba service, ssh, rlogin, rcp, administration scripts and …).

  7. Algorithm Identification

  8. Integer Factorization

  9. Sieving

  10. Trial Division

  11. QS Algorithm

  12. MPQS Algorithm

  13. SIQS Algorithm

  14. Algorithm Complexity Improvement QS  MPQS  SIQS  NFS

  15. OptimizingSerial Implementation

  16. Optimizing Serial Implementation • Algorithm level optimizations (the most important step for optimizing serial codes is to reduce the complexity of algorithm maximally). • Code level optimizations (in this phase we use • Compiler level optimizations

  17. Algorithm Optimization • In computation-intensive software programs, we will often find that 99% of the CPU time is used in the innermost loop. • Identifying the most critical part of your software is therefore necessary if you want to improve the speed of computation (by profilers). • Study the algorithm used in the critical part of your code and see if it can be improved.

  18. Innermost loop(Conventional Sieving)

  19. Pentium4 Memory Access Times

  20. An Optimized Sieving Approach (1)

  21. An Optimized Sieving Approach (2)

  22. Code level optimizationtechniques • Loop unrolling (Unrolling amortizes the branch overhead, since it eliminates branches and some of the code to manage induction variables. Unrolling allows you to aggressively schedule (or pipeline) the loop to hide latencies). • Function inlining (We can instruct the compiler to insert the code of a function into the code of its callers, to the point where actually the call is to be made. inlining method reduces the function-call overhead. In a compiler, inlining a function exposes more opportunity for optimization). • gcc inline assembly (Assembly routines written as inline functions. They are handy, speedy and very much useful).

  23. Builtin gcc functions (__builtin_prefetch) (This function is used to minimize cache-miss latency by moving data into a cache before it is accessed). • Using “unsigned int” type only (Use 32-bit integers instead of integers with smaller sizes (16-bit or 8-bit) to reduce the machine cycles needed). • Division-free arithmetic (change division to use multiplication by reciprocals). • Release allocated memory blocks. • and …

  24. Loop unrolling (code level opt.)

  25. gcc compiler optimizations

  26. Parallel Algorithm Design

  27. Parallel Algorithm Design Methodology • Partitioning (Domain decomposition or Functional Decomposition) • Communication • Agglomeration • Mapping

  28. Methodical Design (1) • Partitioning:The computation that is to be performed and the data operated on by this computation are decomposed into small tasks. Practical issues such as the number of processors in the target computer are ignored, and attention is focused on recognizing opportunities for parallel execution. • Communication:The communication required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined. • Agglomeration:The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs. If necessary, tasks are combined into larger tasks to improve performance or to reduce development costs. • Mapping:Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. Mapping can be specified statically or determined at runtime by load-balancing algorithms.

  29. Methodical Design (2)

  30. Load Balancing Mechanism • For load balancing Master/Slave mechanism was used. • Master node sends the initial data and assign the jobs to slave nodes.

  31. Data Decomposition Algorithm • Using SPMD model (SPMD program that creates exactly one task per processor). • We can sieve with multiple polynomials in SIQS. • To generate these polynomials, we must first compute ‘a’ factors. • Sieving with separated ‘a’ values can be done independently on different processors. • Thus we need to build ‘a’ values on several tasks without any coordination. • Duplicated ‘a’ factor conclude to weak concurrency.

  32. Data Decomposition Algorithm (1)(Initialization Data)

  33. Data Decomposition Algorithm (2)(Determining the number and size of ‘a’ value’s factors)

  34. Data Decomposition Algorithm (2)(Determining the number and size of ‘a’ value’s factors)

  35. Data Decomposition Algorithm (3)(Computing the factors of ‘a’ values)

  36. Data Decomposition Algorithm (3)(Computing the factors of ‘a’ values)

  37. Master Node Algorithm

  38. Slave Node Algorithm

  39. Double Large PrimeVariation effects

  40. Master Node Algorithm (1)(Improved version)

  41. Master Node Algorithm (2)(Improved version)

  42. Slave Node Algorithm(Improved version)

  43. Cluster Benchmarks

  44. Performance Evaluation • Speedup: Amdahl’s law gives the ideal speedup Sp: • Efficiency: The efficiency, p, of a p-node computation with speed-up Sp is given by:

  45. Total Execution Time

  46. Sieving Execution Time

  47. Total Speedup

  48. Sieving Speedup

  49. Total Efficiency

  50. Sieving Efficiency

More Related