1 / 76

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads. Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland. NUMA-multicore memory system. Processor 1. Processor 0. LOCAL_CACHE: 38 cycles. REMOTE_CACHE: 186 cycles. T. 0. 1. 2.

latona
Download Presentation

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Mis)Understanding theNUMA Memory System Performance of Multithreaded Workloads Zoltán MajóThomas R. Gross Department of Computer ScienceETH Zurich, Switzerland

  2. NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE:38 cycles REMOTE_CACHE:186 cycles T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM:190 cycles Last-level cache Last-level cache REMOTE_DRAM: 310 cycles DRAM IC MC DRAM MC IC All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])

  3. Experimental setup • Three benchmark programs from PARSEC • streamcluster, ferret, and dedup • Grown size of inputs more pressure on the memory system • Intel Westmere • 4 processors, 32 cores • 3 execution scenarios • w/o NUMA: Sequential • w/o NUMA: Parallel (8 cores/1 processor) • w/ NUMA: Parallel (32 cores/4 processors)

  4. Execution scenarios Processor 1 Processor 0 0 T T 1 2 T 3 T T 8 9 T T 10 T 11 4 T 5 T T 6 T 7 12 T T 13 14 T T 15 Last-level cache Last-level cache DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 T T 17 18 T T 19 24 T T 25 T 26 27 T T 20 T 21 22 T 23 T 28 T 29 T T 30 31 T

  5. Parallel performance

  6. CPU cycle breakdown dedup: good scaling (26X) streamcluster: poor scaling (11X)

  7. Outline • Introduction • Performance analysis • Data locality • Prefetcher effectiveness • Source-level optimizations • Performance evaluation • Language support for NUMA systems • Conclusions

  8. Data locality • Page placement policy • Commonly used policy: first-touch (default in Linux) • Measurement: data locality of the benchmarks • Data locality = [%] • Read transfers measured at the processor’s uncore Remote memory references Total memory references

  9. NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE REMOTE_CACHE T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM Last-level cache Last-level cache REMOTE_DRAM DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 17 18 19 24 25 26 27 20 21 22 23 28 29 30 31

  10. Data locality

  11. Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 Page 0 DRAM DRAM Page 1

  12. Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 DRAM DRAM Page 2 : 4000 accesses by Processor 0 5000 accesses by Processor 1 Page 2

  13. Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 DRAM DRAM Page 2 : 4000 accesses by Processor 0 5000 accesses by Processor 1 Page 2

  14. Inter-processor data sharing Cause of data sharing • streamcluster: data points to be clustered • ferret and dedup: in-memory databases

  15. Prefetcher performance • Experiment • Run each benchmarks with prefetcher on/off • Compare performance • Causes of prefetcher inefficiency • ferret and dedup: hash-based memory access patterns • streamcluster: random shuffling

  16. streamcluster: random shuffling while (input = read_data_points()) { clusters = process(input); } Randomlyshuffle data pointsto increase probability that each point is compared to each cluster.

  17. streamcluster: prefetcher effectiveness Original data layout (before shuffling) points coordinates A B C D E F G H T0 T1

  18. streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0 T1

  19. streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0

  20. Outline • Introduction • Performance analysis • Source-level optimizations • Prefetching • Data locality • Performance evaluation • Language support for NUMA systems • Conclusions

  21. streamcluster: Optimizing prefetching Copy-based shuffle Performance improvement over pointer-based shuffle • Westmere: 12% • Nehalem: 60% points coordinates A B C D E F G H G B C H F E A D T0 T1

  22. Data locality optimizations Control the mapping of data and computations: • Data placement • Supported by numa_alloc(), move_pages() • First-touch: also OK if data accessed at single processor • Interleaved page placement: reduce interconnect contention[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13] • Computation scheduling • Threads: affinity scheduling, supported by sched_setaffinity() • Loop parallelism: rely on OpenMP static loop scheduling • Pipeline parallelism: locality-aware task dispatch

  23. streamcluster points coordinates G C B H F E A D Executed at Processor 0 Placed at Processor 0 T0 Executed at Processor 1 Placed at Processor 1 T1

  24. ferret Image database Stage 4: Index T T T T T T T T T T T T T T T T T T T T Stage 5:Rank Stage 6: Output Stage 4: Index Stage 3: Extract Stage 2:Segment Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1

  25. ferret Placedat Proc. 0 Placedat Proc. 1 Image database Stage 4: Index’ Stage 4: Index Stage 5:Rank Stage 4: Index’’ Stage 6: Output Stage 2:Segment Stage 3: Extract Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1

  26. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement Scenario 1: default / FT Schedule: default Placement: first-touch (FT)

  27. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  28. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  29. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  30. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL)

  31. default / FT default / INTL Processor 0 Processor 1 T 0 T 1 T 2 3 T 8 T 9 T 10 T 11 T 4 T T 5 T 6 7 T T 12 T 13 T 14 15 T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T 16 T 17 18 T T 19 T 24 25 T 26 T 27 T T 20 21 T 22 T 23 T 28 T 29 T 30 T 31 T

  32. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)

  33. default / INTL NUMA/ INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  34. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule changeplacement Scenario 4: NUMA/ NUMA Schedule: NUMA-aware Placement: NUMA-aware (NA) Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)

  35. NUMA/ NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  36. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  37. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D Processor 2 Processor 3 D D D D D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  38. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  39. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D Processor 2 Processor 3 D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  40. Performance evaluation (cont’d) streamcluster dedup

  41. Data locality optimizations: summary • Data locality better than avoiding interconnect contention • Interleaved placement easy to control • Data locality: lack of tools for implementing optimizations • Other options • Data replication • Automatic data migration

  42. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT

  43. Outline • Introduction • Performance analysis • Source-level optimizations • Performance evaluation • Language support for data locality optimizations • Conclusions

  44. Data locality of loop-parallel programs Remote memory references / total memory references [%]

  45. Profile-based page placement • Proposed by Marathe et al. [PPoPP’06] • Combination of • Data address profiling • Frequency-based page placement • Hardware: 2-processor 8-core Intel Nehalem

  46. Profile-based page placement

  47. Profile-based page placement

  48. Inter-processor data sharing Shared heap / total heap [%]

  49. Inter-processor data sharing Shared heap / total heap [%]

  50. Inter-processor data sharing Shared heap / total heap [%] Performance improvement [%]

More Related