800 likes | 951 Views
(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads. Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland. NUMA-multicore memory system. Processor 1. Processor 0. LOCAL_CACHE: 38 cycles. REMOTE_CACHE: 186 cycles. T. 0. 1. 2.
E N D
(Mis)Understanding theNUMA Memory System Performance of Multithreaded Workloads Zoltán MajóThomas R. Gross Department of Computer ScienceETH Zurich, Switzerland
NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE:38 cycles REMOTE_CACHE:186 cycles T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM:190 cycles Last-level cache Last-level cache REMOTE_DRAM: 310 cycles DRAM IC MC DRAM MC IC All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])
Experimental setup • Three benchmark programs from PARSEC • streamcluster, ferret, and dedup • Grown size of inputs more pressure on the memory system • Intel Westmere • 4 processors, 32 cores • 3 execution scenarios • w/o NUMA: Sequential • w/o NUMA: Parallel (8 cores/1 processor) • w/ NUMA: Parallel (32 cores/4 processors)
Execution scenarios Processor 1 Processor 0 0 T T 1 2 T 3 T T 8 9 T T 10 T 11 4 T 5 T T 6 T 7 12 T T 13 14 T T 15 Last-level cache Last-level cache DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 T T 17 18 T T 19 24 T T 25 T 26 27 T T 20 T 21 22 T 23 T 28 T 29 T T 30 31 T
CPU cycle breakdown dedup: good scaling (26X) streamcluster: poor scaling (11X)
Outline • Introduction • Performance analysis • Data locality • Prefetcher effectiveness • Source-level optimizations • Performance evaluation • Language support for NUMA systems • Conclusions
Data locality • Page placement policy • Commonly used policy: first-touch (default in Linux) • Measurement: data locality of the benchmarks • Data locality = [%] • Read transfers measured at the processor’s uncore Remote memory references Total memory references
NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE REMOTE_CACHE T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM Last-level cache Last-level cache REMOTE_DRAM DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 17 18 19 24 25 26 27 20 21 22 23 28 29 30 31
Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 Page 0 DRAM DRAM Page 1
Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 DRAM DRAM Page 2 : 4000 accesses by Processor 0 5000 accesses by Processor 1 Page 2
Profiling memory accesses • Data address profiling • Based on hardware-performance monitoring • Consider only heap accesses Processor 0 Processor 1 Profile Page 0 : 1000 accesses by Processor 0 Page 1 : 3000 accesses by Processor 1 DRAM DRAM Page 2 : 4000 accesses by Processor 0 5000 accesses by Processor 1 Page 2
Inter-processor data sharing Cause of data sharing • streamcluster: data points to be clustered • ferret and dedup: in-memory databases
Prefetcher performance • Experiment • Run each benchmarks with prefetcher on/off • Compare performance • Causes of prefetcher inefficiency • ferret and dedup: hash-based memory access patterns • streamcluster: random shuffling
streamcluster: random shuffling while (input = read_data_points()) { clusters = process(input); } Randomlyshuffle data pointsto increase probability that each point is compared to each cluster.
streamcluster: prefetcher effectiveness Original data layout (before shuffling) points coordinates A B C D E F G H T0 T1
streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0 T1
streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0
Outline • Introduction • Performance analysis • Source-level optimizations • Prefetching • Data locality • Performance evaluation • Language support for NUMA systems • Conclusions
streamcluster: Optimizing prefetching Copy-based shuffle Performance improvement over pointer-based shuffle • Westmere: 12% • Nehalem: 60% points coordinates A B C D E F G H G B C H F E A D T0 T1
Data locality optimizations Control the mapping of data and computations: • Data placement • Supported by numa_alloc(), move_pages() • First-touch: also OK if data accessed at single processor • Interleaved page placement: reduce interconnect contention[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13] • Computation scheduling • Threads: affinity scheduling, supported by sched_setaffinity() • Loop parallelism: rely on OpenMP static loop scheduling • Pipeline parallelism: locality-aware task dispatch
streamcluster points coordinates G C B H F E A D Executed at Processor 0 Placed at Processor 0 T0 Executed at Processor 1 Placed at Processor 1 T1
ferret Image database Stage 4: Index T T T T T T T T T T T T T T T T T T T T Stage 5:Rank Stage 6: Output Stage 4: Index Stage 3: Extract Stage 2:Segment Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1
ferret Placedat Proc. 0 Placedat Proc. 1 Image database Stage 4: Index’ Stage 4: Index Stage 5:Rank Stage 4: Index’’ Stage 6: Output Stage 2:Segment Stage 3: Extract Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1
Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement Scenario 1: default / FT Schedule: default Placement: first-touch (FT)
default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T
default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T
default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T
Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL)
default / FT default / INTL Processor 0 Processor 1 T 0 T 1 T 2 3 T 8 T 9 T 10 T 11 T 4 T T 5 T 6 7 T T 12 T 13 T 14 15 T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T 16 T 17 18 T T 19 T 24 25 T 26 T 27 T T 20 21 T 22 T 23 T 28 T 29 T 30 T 31 T
Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)
default / INTL NUMA/ INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T
Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule changeplacement Scenario 4: NUMA/ NUMA Schedule: NUMA-aware Placement: NUMA-aware (NA) Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)
NUMA/ NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T
Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T
Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D Processor 2 Processor 3 D D D D D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T
Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T
Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D Processor 2 Processor 3 D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T
Performance evaluation (cont’d) streamcluster dedup
Data locality optimizations: summary • Data locality better than avoiding interconnect contention • Interleaved placement easy to control • Data locality: lack of tools for implementing optimizations • Other options • Data replication • Automatic data migration
Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT
Outline • Introduction • Performance analysis • Source-level optimizations • Performance evaluation • Language support for data locality optimizations • Conclusions
Data locality of loop-parallel programs Remote memory references / total memory references [%]
Profile-based page placement • Proposed by Marathe et al. [PPoPP’06] • Combination of • Data address profiling • Frequency-based page placement • Hardware: 2-processor 8-core Intel Nehalem
Inter-processor data sharing Shared heap / total heap [%]
Inter-processor data sharing Shared heap / total heap [%]
Inter-processor data sharing Shared heap / total heap [%] Performance improvement [%]