Memory Management in
NUMA Multicore Systems:
Trapped between Cache Contention and
Interconnect Overhead
Zoltan Majo and Thomas R. Gross
Department of Computer Science
ETH Zurich
1
NUMA multicores
Processor 0
0
Processor 1
1
2
Cache
MC
3
Cache
IC
DRAM memory
IC
MC
DRAM memory
2
NUMA multicores
Two problems:
Processor 0
A
0
Processor 1
1
B
2
Cache
MC
3
Cache
IC
DRAM
MA memory
MB
IC
• NUMA:
interconnect overhead
MC
DRAM memory
3
NUMA multicores
Two problems:
Processor 0
A
0
Processor 1
B
1
2
Cache
MC
3
Cache
IC
DRAM
MA memory
MB
IC
MC
• NUMA:
interconnect overhead
• multicore:
cache contention
DRAM memory
4
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
5
Multi-clone experiments
• Memory behavior of unrelated programs
Processor 0
• Intel Xeon E5520
0
C
1
C
Processor 1
2
3
4
C
5
C
Cache
• 4 clones of soplex
(SPEC CPU2006)
C
C
–C local
clone
C
M
M
M
–M remote
clone
C
MC
6
7
Cache
IC
DRAM memory
IC
MC
DRAM memory
C
6
Local bandwidth: 80%
Local bandwidth: 100%
C C C C
Cache
1
C C C
Cache
Cache
2
C
Cache
DRAM
M
M M M
DRAM
M
M M M
Local bandwidth: 57%
C C
Cache
3
C C
Cache
DRAM
M
M M M
Local bandwidth: 32%
C
Cache
DRAM
M
M M M
4
Local bandwidth: 0%
C C C
Cache
Cache
DRAM
M
M M M
5
C C C C
Cache
7
Performance of schedules
• Which is the best schedule?
• Baseline: single-program execution mode
C
Cache
Cache
M
8
Execution time
Slowdown relative to baseline
2.4
2.2
C
local clones
C
remote clones
C
average
2.0
1.8
1.6
1.4
1.2
1.0
0%
25%
50%
75% 100%
Local memory bandwidth
9
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
10
N-MASS
(NUMA-Multicore-Aware Scheduling Scheme)
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
11
Step 1: Maximum-local mapping
Processor 0
Processor 1
A
MA
0
B
MB
Cache
Cache
C
MC
D
MD
DRAM
DRAM
1
2
3
4
5
6
7
12
Default OS scheduling
Processor 0
A
0
B
1
C
2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
13
N-MASS
(NUMA-Multicore-Aware Scheduling Scheme)
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
14
Step 2: Cache-aware refinement
Processor 0
In an SMP:
0
1
D
2
Processor 1
3C
Cache
DRAM
MA MB
4
5
A
6
B
7
Cache
MC
DRAM
MD
15
Step 2: Cache-aware refinement
Processor 0
In an SMP:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
16
Step 2: Cache-aware refinement
Processor 0
In an SMP:
A
0
B
1
Processor 1
2
3
Cache
DRAM
MA MB
D
4
5
C
6
7
Cache
MC
NUMA
penalty
DRAM
MD
Performance degradation
A
B
A
B
C
D
D
C
17
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
18
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
19
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
0
A
1
Processor 1
2
C
3
Cache
DRAM
MA MB
4
D
5
B
6
Cache
MC
NUMA
allowance
DRAM
MD
Performance degradation
A
A
B
7
NUMA
penalty
C
C
D
D
B
20
Performance factors
Two factors cause performance degradation:
1.2
1.1
26
21
1.0
16
local processes:
misses / KINST (MPKI)
remote processes:
MPKI x NUMA penalty
1.3
11
2. cache pressure
1.4
6
slowdown due to
remote memory access
1
1. NUMA penalty
NUMA penalty
1.5
SPEC programs
21
Implementation
• User-mode extension to the Linux scheduler
• Performance metrics
– hardware performance counter feedback
– NUMA penalty
• perfect information from program traces
• estimate based on MPKI
• All memory for a process allocated on one
processor
22
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
23
Workloads
• SPEC CPU2006 subset
NUMA penalty
1.5
1.4
• 11 multi-program
workloads (WL1  WL11)
4-program workloads
(WL1  WL9)
1.3
1.2
CPU-bound
1.1
Memorybound
1.0
100
10
1
0.1
0.01
0.001
0.0001
8-program workloads
(WL10, WL11)
1E-05
0.9
not usedMPKI
programs
used programs
24
Memory allocation setup
• Where the memory of each process is
allocated influences performance
• Controlled setup: memory allocation maps
25
Memory allocation maps
A
M
A
B
M
B
C
M
C
D
M
D
Allocation map: 0000
Processor 0
Processor 1
Cache
Cache
M
M M M
DRAM
A B C D
DRAM
26
Memory allocation maps
A
B
Allocation map: 0000
C
D
Allocation map: 0011
Processor 0
Processor 1
Processor 0
Processor 1
Cache
Cache
Cache
Cache
M
M M M
DRAM
A B C D
DRAM
M
M
DRAM
A B
DRAMM M
C D
27
Memory allocation maps
A
B
Allocation map: 0000
C
D
Allocation map: 0011
Processor 0
Processor 1
Processor 0
Processor 1
Cache
Cache
Cache
Cache
M
M M M
DRAM
A B C D
DRAM
Unbalanced
M
M
DRAM
A B
DRAMM M
C D
Balanced
28
Evaluation
• Baseline: Linux average
– Linux scheduler non-deterministic
– average performance degradation in all possible
cases
• N-MASS with perfect NUMA penalty
information
29
WL9: Linux average
Average slowdown relative to single-program mode
1.6
1.5
1.4
1.3
Linux average
1.2
1.1
1.0
0000
1000
0100
0010 0001 1100
Allocation maps
1010
1001
30
WL9: N-MASS
Average slowdown relative to single-program mode
1.6
1.5
1.4
1.3
Linux average
N-MASS
1.2
1.1
1.0
0000
1000
0100
0010 0001 1100
Allocation maps
1010
1001
31
WL1: Linux average and N-MASS
Average slowdown relative to single-program mode
1.6
1.5
1.4
1.3
Linux average
N-MASS
1.2
1.1
1.0
0000 1000 0100 0010 0001 1100 1010 1001
Allocation maps
32
N-MASS performance
• N-MASS reduces performance degradation by
up to 22%
• Which factor more important:
interconnect overhead or cache contention?
• Compare:
- maximum-local
- N-MASS (maximum-local
+ cache refinement step)
33
Data-locality vs. cache balancing (WL9)
Performance improvement relative to Linux average
25%
20%
15%
10%
maximum-local
5%
0%
-5%
-10%
0000 1000 0100 0010 0001 1100 1010 1001
Allocation maps
N-MASS
(maximum-local +
cache refinement
step)
34
Data-locality vs. cache balancing (WL1)
Performance improvement relative to Linux average
25%
20%
15%
10%
maximum-local
5%
0%
-5%
-10%
0000 1000 0100 0010 0001 1100 1010 1001
Allocation maps
N-MASS
(maximum-local +
cache refinement
step)
35
Data locality vs. cache balancing
• Data-locality more important than cache
balancing
• Cache-balancing gives performance benefits
mostly with unbalanced allocation maps
• What if information about NUMA penalty not
available?
36
Estimating NUMA penalty
• NUMA penalty is not
directly measurable
• Estimate: fit linear
regression onto MPKI
data
NUMA penalty
1.5
1.4
1.3
1.2
1.1
1.0
0
10
20
30
40
50
MPKI
37
Estimate-based N-MASS: performance
Performance improvement relative to Linux average
8%
6%
4%
2%
0%
-2%
WL1
WL2
WL3
WL4
maximum-local
WL5
WL6
N-MASS
WL7
WL8
WL9 WL10 WL11
Estimate-based N-MASS
38
Conclusions
• N-MASS: NUMAmulticore-aware scheduler
• Data locality optimizations more beneficial
than cache contention avoidance
• Better performance metrics needed for
scheduling
39
Thank you! Questions?
40