Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1 NUMA multicores Processor 0 0 Processor 1 1 2 Cache MC 3 Cache IC DRAM memory IC MC DRAM memory 2 NUMA multicores Two problems: Processor 0 A 0 Processor 1 1 B 2 Cache MC 3 Cache IC DRAM MA memory MB IC • NUMA: interconnect overhead MC DRAM memory 3 NUMA multicores Two problems: Processor 0 A 0 Processor 1 B 1 2 Cache MC 3 Cache IC DRAM MA memory MB IC MC • NUMA: interconnect overhead • multicore: cache contention DRAM memory 4 Outline • NUMA: experimental evaluation • Scheduling – N-MASS – N-MASS evaluation 5 Multi-clone experiments • Memory behavior of unrelated programs Processor 0 • Intel Xeon E5520 0 C 1 C Processor 1 2 3 4 C 5 C Cache • 4 clones of soplex (SPEC CPU2006) C C –C local clone C M M M –M remote clone C MC 6 7 Cache IC DRAM memory IC MC DRAM memory C 6 Local bandwidth: 80% Local bandwidth: 100% C C C C Cache 1 C C C Cache Cache 2 C Cache DRAM M M M M DRAM M M M M Local bandwidth: 57% C C Cache 3 C C Cache DRAM M M M M Local bandwidth: 32% C Cache DRAM M M M M 4 Local bandwidth: 0% C C C Cache Cache DRAM M M M M 5 C C C C Cache 7 Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M 8 Execution time Slowdown relative to baseline 2.4 2.2 C local clones C remote clones C average 2.0 1.8 1.6 1.4 1.2 1.0 0% 25% 50% 75% 100% Local memory bandwidth 9 Outline • NUMA: experimental evaluation • Scheduling – N-MASS – N-MASS evaluation 10 N-MASS (NUMA-Multicore-Aware Scheduling Scheme) Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 B MB Cache Cache C MC D MD DRAM DRAM 1 2 3 4 5 6 7 12 Default OS scheduling Processor 0 A 0 B 1 C 2 Processor 1 3 Cache DRAM MA MB D 4 5 6 7 Cache MC DRAM MD 13 N-MASS (NUMA-Multicore-Aware Scheduling Scheme) Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 Step 2: Cache-aware refinement Processor 0 In an SMP: 0 1 D 2 Processor 1 3C Cache DRAM MA MB 4 5 A 6 B 7 Cache MC DRAM MD 15 Step 2: Cache-aware refinement Processor 0 In an SMP: A 0 B 1 C2 Processor 1 3 Cache DRAM MA MB D 4 5 6 7 Cache MC DRAM MD 16 Step 2: Cache-aware refinement Processor 0 In an SMP: A 0 B 1 Processor 1 2 3 Cache DRAM MA MB D 4 5 C 6 7 Cache MC NUMA penalty DRAM MD Performance degradation A B A B C D D C 17 Step 2: Cache-aware refinement Processor 0 In a NUMA: A 0 B 1 C2 Processor 1 3 Cache DRAM MA MB D 4 5 6 7 Cache MC DRAM MD 18 Step 2: Cache-aware refinement Processor 0 In a NUMA: A 0 B 1 C2 Processor 1 3 Cache DRAM MA MB D 4 5 6 7 Cache MC DRAM MD 19 Step 2: Cache-aware refinement Processor 0 In a NUMA: 0 A 1 Processor 1 2 C 3 Cache DRAM MA MB 4 D 5 B 6 Cache MC NUMA allowance DRAM MD Performance degradation A A B 7 NUMA penalty C C D D B 20 Performance factors Two factors cause performance degradation: 1.2 1.1 26 21 1.0 16 local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 1.3 11 2. cache pressure 1.4 6 slowdown due to remote memory access 1 1. NUMA penalty NUMA penalty 1.5 SPEC programs 21 Implementation • User-mode extension to the Linux scheduler • Performance metrics – hardware performance counter feedback – NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor 22 Outline • NUMA: experimental evaluation • Scheduling – N-MASS – N-MASS evaluation 23 Workloads • SPEC CPU2006 subset NUMA penalty 1.5 1.4 • 11 multi-program workloads (WL1 WL11) 4-program workloads (WL1 WL9) 1.3 1.2 CPU-bound 1.1 Memorybound 1.0 100 10 1 0.1 0.01 0.001 0.0001 8-program workloads (WL10, WL11) 1E-05 0.9 not usedMPKI programs used programs 24 Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps 25 Memory allocation maps A M A B M B C M C D M D Allocation map: 0000 Processor 0 Processor 1 Cache Cache M M M M DRAM A B C D DRAM 26 Memory allocation maps A B Allocation map: 0000 C D Allocation map: 0011 Processor 0 Processor 1 Processor 0 Processor 1 Cache Cache Cache Cache M M M M DRAM A B C D DRAM M M DRAM A B DRAMM M C D 27 Memory allocation maps A B Allocation map: 0000 C D Allocation map: 0011 Processor 0 Processor 1 Processor 0 Processor 1 Cache Cache Cache Cache M M M M DRAM A B C D DRAM Unbalanced M M DRAM A B DRAMM M C D Balanced 28 Evaluation • Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information 29 WL9: Linux average Average slowdown relative to single-program mode 1.6 1.5 1.4 1.3 Linux average 1.2 1.1 1.0 0000 1000 0100 0010 0001 1100 Allocation maps 1010 1001 30 WL9: N-MASS Average slowdown relative to single-program mode 1.6 1.5 1.4 1.3 Linux average N-MASS 1.2 1.1 1.0 0000 1000 0100 0010 0001 1100 Allocation maps 1010 1001 31 WL1: Linux average and N-MASS Average slowdown relative to single-program mode 1.6 1.5 1.4 1.3 Linux average N-MASS 1.2 1.1 1.0 0000 1000 0100 0010 0001 1100 1010 1001 Allocation maps 32 N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33 Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average 25% 20% 15% 10% maximum-local 5% 0% -5% -10% 0000 1000 0100 0010 0001 1100 1010 1001 Allocation maps N-MASS (maximum-local + cache refinement step) 34 Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average 25% 20% 15% 10% maximum-local 5% 0% -5% -10% 0000 1000 0100 0010 0001 1100 1010 1001 Allocation maps N-MASS (maximum-local + cache refinement step) 35 Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available? 36 Estimating NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data NUMA penalty 1.5 1.4 1.3 1.2 1.1 1.0 0 10 20 30 40 50 MPKI 37 Estimate-based N-MASS: performance Performance improvement relative to Linux average 8% 6% 4% 2% 0% -2% WL1 WL2 WL3 WL4 maximum-local WL5 WL6 N-MASS WL7 WL8 WL9 WL10 WL11 Estimate-based N-MASS 38 Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling 39 Thank you! Questions? 40