On Realistic Divisible Load Scheduling for
Highly Heterogeneous Distributed Systems
Aleksandar Ilić and Leonel Sousa
INESC-ID/
IST, TU Lisbon, Portugal
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
PDP 2012, Special Session on GPU and Hybrid Computing, Germany, 2012
technology
Motivation
from seed
COMMODITY COMPUTERS = HETEROGENEOUS SYSTEMS
– Multi-core CPUs, GPUs, AMD Fusion, accelerators/co-processors, FPGAs …
– Interconnection: bidirectional, full-duplex, asymmetric, concurrency levels
CLUSTERS OF COMMODITY COMPUTERS = HIGHLY HETEROGENEOUS
SYSTEMS
– (POWER OF COMMODITY COMPUTERS)N => (HETEROGENEITY PROBLEMS)N
– Communication: more complex if one-port model is not considered
Heterogeneity offers tremendous computing power
Heterogeneity makes life difficult even for a single application
– Traditionally: Different computation/communication speeds (not at all constants)
– Additionally: Different programming models, languages, tools, implementations …
=> REALISTIC PERFORMANCE MODELING, SCHEDULING AND LOAD
BALANCING2/27/2012
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2
technology
Outline
from seed
DISCRETELY DIVISIBLE LOAD (DDL) PROCESSING
HETEROGENEOUS DISTRIBUTED SYSTEMS WITH HETEROGENEOUS
CPU+GPU DESKTOPS AS COMPUTING NODES
PERFORMANCE MODELING AND 3-STEP HIERARCHICAL DDL
SCHEDULING ALGORITHM
CASE STUDY: 2D FFT Batch Execution
CONCLUSIONS AND FUTURE WORK
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
3
technology
Discretely Divisible Load Processing
from seed
• DISCRETELY DIVISIBLE LOAD (DDL) APPLICATIONS
– Computations divisible into pieces of arbitrary sizes (integers)
– Fractions independently processed in parallel with no precedence constraints
• APPLICABLE TO A WIDE RANGE OF SCIENTIFIC PROBLEMS
– Linear algebra, digital signal and image processing, database applications …
• STATE OF THE ART DDL APPROACHES IN HETEROGENEOUS DISTRIBUTED
COMPUTING
– Very recent, but only for traditional distributed systems (no hierarchy, CPU-only)
– Assume symmetric bandwidth and an one-port model for communication links
– Limited memory: only the size of input load is considered, or the exceeding load is simply
redistributed among the nodes with available memory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
– Unrealistic: Computation/communication time is a linear/affine function of the #chunks
2/27/2012
4
technology
from seed
Heterogeneous Desktop Systems
HETEROGENEOUS STAR NETWORK (MASTER-WORKER)
• MULTI-CORE CPU (Master)
CPU
CONT
ROL
CORE CORE
#1
#2
...
CORE
#N
– Global execution controller; access the whole global memory
CACHE
– All cores employed for execution
DRAM
• INTERCONNECTION BUSES
– Bidirectional full-duplex asymmetric communication
– Different concurrency levels
– Potential execution bottleneck
• DEVICES (Distant workers)
INTERCONNECTION
BUSES
DEVICE #N
DEVICE #2
GPU #1
– Different architectures and programming models
– Computation performed using local memories
DRAM
DRAM
DRAM
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
5
technology
Commodity Cluster Systems
from seed
HETEROGENEOUS STAR NETWORK
• MASTER NODE
– Heterogeneous desktop system
– Positioned at the center; employed for
execution as whole
• INTERCONNECTION NETWORK
– Limited asymmetric one-port communication
– Potential execution bottleneck
• WORKER NODES
– Heterogeneous desktop systems
– All available devices employed in each
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
6
technology
Proposed Algorithm Outline (1)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
7
technology
Proposed Algorithm Outline (2)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
8
technology
Proposed Algorithm Outline (3)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
9
technology
Proposed Algorithm Outline (4)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
• STEP 2 – NODE-LEVEL LOAD BALANCING
– How many chunks to send to each device from
a given node’s load?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
10
technology
Proposed Algorithm Outline (5)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
• STEP 2 – NODE-LEVEL LOAD BALANCING
– How many chunks to send to each device from
a given node’s load?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
11
technology
Proposed Algorithm Outline (6)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
• STEP 2 – NODE-LEVEL LOAD BALANCING
– How many chunks to send to each device from
a given node’s load?
• STEP 3 – DEVICE-LEVEL SCHEDULING
– How to sub-partition the device load to:
• Reduce delays when distributing and retrieving
• Overlap computation and communication
• Efficiently use the bidirectional asymmetric
bandwidth of buses
• Respect the amount of supported concurrency
• Fit into device limited memory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
12
technology
Proposed Algorithm Outline (7)
from seed
3-STEP HIERARCHICAL DDL SCHEDULING:
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– How many load units to send to each node?
• STEP 2 – NODE-LEVEL LOAD BALANCING
– How many chunks to send to each device from
a given node’s load?
• STEP 3 – DEVICE-LEVEL SCHEDULING
– How to sub-partition the device load to:
• Reduce delays when distributing and retrieving
• Overlap computation and communication
• Efficiently use the bidirectional asymmetric
bandwidth of buses
• Respect the amount of supported concurrency
• Fit into device limited memory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
13
technology
from seed
Performance Modeling
FUNCTIONAL PERFORMANCE MODELS
– Built from the real application execution
• No assumptions being made to ease modeling!
• NODE-LEVEL PERFORMANCE MODELS
– Computation performance models (
)
• For each master core and distant worker
– Full-duplex communication bandwidth (
,
)
• Bidirectional and asymmetric for each link
– Total performance (
) of each device
• Including computation and communication
• SYSTEM-LEVEL PERFORMANCE MODELS
– Computation performance models (
)
• For each node (heterogeneous desktop system)
– Communication bandwidth (
,
)
• Bidirectional, asymmetric for each network link
– Total performance (
) of each node
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
14
technology
Determination of Load Fractions (1)
from seed
• STEP 1 – SYSTEM-LEVEL LOAD BALANCING
– The optimal distribution between nodes lies on a straight line passing through the
origin of coordinate system and intersecting total performance curves ( )*:
• STEP 2 – NODE-LEVEL LOAD BALANCING
– Each load fraction
is sub-partitioned between node’s devices; the optimal
distribution that lies on a straight line passing through the origin of coordinate system
and intersecting communication-aware total performance curves ( ), such that*:
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
*Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their
Functional Performance
Models", HeteroPar 2009, LNCS, vol. 6043, Springer, pp. 91-101, 2010.
2/27/2012
15
technology
Determination of Load Fractions (2)
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– Per-device distributions
are allowed to exceed the device memory limits, bj
• Device-level multi-installment processing with multi-distributions
–
sub-distributions with
sub-load fractions
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
16
technology
Determination of Load Fractions (3)
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– Per-device distributions
are allowed to exceed the device memory limits, bj
• Device-level multi-installment processing with multi-distributions
–
sub-distributions with
sub-load fractions
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
17
technology
Determination of Load Fractions (4)
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– Per-device distributions
are allowed to exceed the device memory limits, bj
• Device-level multi-installment processing with multi-distributions
–
sub-distributions with
sub-load fractions
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
18
Determination of Load Fractions (5)
- Limited Memory -
technology
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– Per-device distributions
are allowed to exceed the device memory limits, bj
• Device-level multi-installment processing with multi-distributions
–
sub-distributions with
sub-load fractions
– Application memory requirements are modeled with three functions of load size:
• Input memory requirements,
• Output memory requirements,
• Execution memory requirements,
– Different implementations of the same problem might have different memory requirements!
⇒ In each
sub-distribution the whole amount of memory may be consumed, such that:
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
19
Determination of Load Fractions (6)
- Computation/Communication Overlapping
-
technology
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– For each
sub-distribution,
sizes are carefully chosen to allow as best as possible
overlapping of computation and communication between subsequent sub-fractions
Execution Time (no overlap)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
20
Determination of Load Fractions (7)
- Computation/Communication Overlapping
-
technology
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– For each
sub-distribution,
sizes are carefully chosen to allow as best as possible
overlapping of computation and communication between subsequent sub-fractions
Execution Time (overlapped)
Execution Time (no overlap)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
21
Determination of Load Fractions (8)
- Computation/Communication Overlapping
-
technology
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– For each
sub-distribution,
sizes are carefully chosen to allow as best as possible
overlapping of computation and communication between subsequent sub-fractions
Execution Time (overlapped)
GAI
N
Execution Time (no overlap)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
22
Determination of Load Fractions (9)
- Computation/Communication Overlapping
-
technology
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING
– For each
sub-distribution,
sizes are carefully chosen to allow as best as possible
overlapping of computation and communication between subsequent sub-fractions
– The decisions are made according to the amount of overlapping concurrency
supported by the device:
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t o (γk 1 )
t ι (γk 2 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
...
(a) Overlap of a single communication with computation at the time
γk 1
t ι (γk 1 )
γk 2
t w (γk 1 )
t o (γk 1 )
t ι (γk 2 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 ) ...
(b) Complete concurrency between communication and computation
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
23
technology
Determination of Load Fractions (10)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
24
technology
Determination of Load Fractions (11)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
25
technology
Determination of Load Fractions (12)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
• Step 3-II. Generate additional three-fraction distributions.
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
26
technology
Determination of Load Fractions (13)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
• Step 3-II. Generate additional three-fraction distributions.
• Step 3-III. Insert additional load fractions into existing sub-distributions (iterative).
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
27
technology
Determination of Load Fractions (14)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
• Step 3-II. Generate additional three-fraction distributions.
• Step 3-III. Insert additional load fractions into existing sub-distributions (iterative).
• Step 3-IV. Generate new sub-distributions by restarting.
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
28
technology
Determination of Load Fractions (15)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
• Step 3-II. Generate additional three-fraction distributions.
• Step 3-III. Insert additional load fractions into existing sub-distributions (iterative).
• Step 3-IV. Generate new sub-distributions by restarting.
• Step 3-V. Expand all sub-distributions.
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
29
technology
Determination of Load Fractions (16)
- DEVICE-LEVEL SCHEDULING ALGORITHM -
from seed
• STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/**
γk 1
t ι (γk 1 )
t w (γk 1 )
γk 2
t ι (γk 2 )
t o (γk 1 )
t w (γk 2 )
γk 3
t ι (γk 3 )
t o (γk 2 )
t ι (γk 4 )
t w (γk 3 )
t w (γk 4 )
t o (γk 3 )
– According to the performance models for device computation (
asymmetric full-duplex communication links ( , )
...
) and bidirectional
• Step 3-I. Determination of the initial optimal distribution with three load fractions.
• Step 3-II. Generate additional three-fraction distributions.
• Step 3-III. Insert additional load fractions into existing sub-distributions (iterative).
• Step 3-IV. Generate new sub-distributions by restarting.
• Step 3-V. Expand all sub-distributions.
• Step 3-VI. Select the distribution with maximum relative performance.
* Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in
HeteroPar 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance
Models”, Tech. rep.,
INESC-ID (May 2011)
2/27/2012
30
technology
from seed
Case Study: 2D FFT Batch
DOUBLE FLOATING POINT COMPLEX 2D FFT BATCH EXECUTION
– Size: 1024 times 512 × 512; divisible in the first dimension
– The optimal vendor-provided FFT implementations are used
– NVIDIA’s CUFFT 3.2 for the GPU and Intel MKL 10.3 for the CPU
DISTRIBUTED HETEROGENEOUS SYSTEM WITH 4 CPU+GPU NODES
#CPU Cores
CPU
GPU
nVIDIA GeForce
285GTX
NODE 1
3
Intel Core 2 Quad
NODE 2
2
Speed/Core (GHz)
2.83
1.476
NODE 3
1
Global Memory (MB)
4096
1024
NODE 4
-
Experimental Setup
#GPUs
1
ITERATIVE PROCEDURE FOR ONLINE PERFORMANCE MODELING
– PERFORMANCE
ESTIMATION
of all heterogeneous devices DURING THE EXECUTION
– No prior knowledge on the performance of an application is available on any of the devices
– INITIALLY, the load is distributed among nodes/devices using FACTORING-BY-TWO STRATEGY*
– Limited Memory: Factoring-by-two partitioning of the largest loads into new sub-distributions until satisfying the
memory limitations
– IN EACH
FOLLOWING ITERATION,
the load is distributed using the PRESENTED APPROACH
* Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”,
Tech. rep., INESC-ID (May 2011)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
31
Case Study: 2D FFT Batch
ITERATION 1: FACTORING-BY-TWO
STRATEGY
GPU
Core1
β1 = 64
320
β2 = 64
Core2
from seed
Core3
β3 = 64
GPU
β4 = 64
240
Γ= { {32,16,8,4,2,1,1}
}
160
80
85
Node 2:
α2 = 256
240
160
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
128
320
128
256
320
384
448
Γ
160
80
Node 4:
α4 = 256
GPU
256
320
240
Γ1
192
Numberofchunks
Rela vePerformance
Rela vePerformance
Core2
85
80
0
Node 3:
α3 = 256
Core1
86
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 256
technology
240
160
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
32
technology
Case Study: 2D FFT Batch
ITERATION 1: PROPOSED ALGORITHM
Node 1:
α1 = 256
GPU
Core1
β1 = 166
320
β2 = 25
Core2
Core3
β3 = 21
GPU
β4 = 21
240
Γ= { {22,41,18,8,4,1},
{8,14,18,20,9,3} }
160
80
Node 2:
α2 = 198
Node 2:
α2 = 256
240
160
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
127
320
192
256
320
384
448
Numberofchunks
21
240
160
80
Node 4:
α4 = 445
GPU
445
320
Rela vePerformance
Rela vePerformance
22
0
0
Node 3:
α3 = 256
Core2
22
80
0
Node 3:
α3 = 148
Core1
154
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 233
from seed
Node 4:
α4 = 256
240
160
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
33
technology
Case Study: 2D FFT Batch
ITERATION 2: PERFORMANCE MODELS
GPU
Core1
Core2
Core3
GPU
320
240
Γ2
Γ1
Γ
160
80
240
160
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
192
256
320
384
448
Numberofchunks
GPU[nooverlap]
Node 4:
α4 = 445
GPU
320
320
Rela vePerformance
Rela vePerformance
Node 2:
α2 = 198
Core2
80
0
Node 3:
α3 = 148
Core1
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 233
from seed
240
Multifunctional Performance Modeling
160
80
240
160
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
34
technology
Case Study: 2D FFT Batch
ITERATION 2: MODELING EFFCIENCY
GPU
Core1
Core2
Core3
GPU
320
240
28 points for
modeling
(22 actual, 6
accuracy)
160
80
240
37 points for modeling
(23 actual, 14
accuracy)
160
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
192
256
320
384
448
Numberofchunks
GPU[nooverlap]
Node 4:
α4 = 445
GPU
320
320
Rela vePerformance
Rela vePerformance
Node 2:
α2 = 198
Core2
80
0
Node 3:
α3 = 148
Core1
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 233
from seed
240
34 points for modeling
(9 actual, 12 GPU-no, 15
accuracy)
160
80
240
70 points for modeling
(15 actual, 55
accuracy)
160
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
35
technology
Case Study: 2D FFT Batch
ITERATION 2: PERFORMANCE
GPU
Core1
Core2
Core3
GPU
320
Core1
4.3
x
240
28 points for
modeling
(22 actual, 6
accuracy)
CUF
FT
160
80
240
160
4.3
x
37 points for modeling
CUF
FT
(23 actual, 14
accuracy)
80
0
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
192
256
320
384
448
Numberofchunks
GPU[nooverlap]
Node 4:
α4 = 445
GPU
320
320
4.6
x
240
34 points for modeling
(9 actual, 12 GPU-no, 15
accuracy)
CUF
FT
160
Rela vePerformance
Rela vePerformance
Node 3:
α3 = 148
Node 2:
α2 = 198
Core2
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 233
from seed
80
Problem does not fit in the GPU
memory!
∞
240
70 points for modeling CUF
FT
(15 actual, 55
accuracy)
160
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
36
technology
Case Study: 2D FFT Batch
ITERATION 3
GPU
Core1
Core2
Core3
GPU
320
4.3
x
240
46 points for modeling
(31 actual, 15 accuracy)
CUF
FT
160
80
4.1
x
240
57 points for modeling
(33 actual, 24
accuracy)
CUF
FT
160
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
192
256
320
384
GPU[nooverlap]
Node 4:
α4 = 276
GPU
320
320
4.2
x
240
CUF
FT
160
74 points for modeling
(13 actual, 14 GPU-no, 51
accuracy)
80
448
Numberofchunks
Rela vePerformance
Rela vePerformance
Node 2:
α2 = 258
Core2
80
0
Node 3:
α3 = 241
Core1
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 249
from seed
240
∞
CUF
FT
160
119 points for
modeling
(23 actual, 96
accuracy)
80
0
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
37
technology
Case Study: 2D FFT Batch
ITERATION 4
GPU
Core1
Core2
Core3
GPU
320
4.1
x
240
91 points for modeling
(34 actual, 57 accuracy)
CUF
FT
160
80
4.2
x
240
81 points for modeling
(36 actual, 45
accuracy)
CUF
FT
160
0
0
64
128
192
256
320
384
448
0
64
128
Numberofchunks
GPU
Core1
192
256
320
384
GPU[nooverlap]
Node 4:
α4 = 236
GPU
320
320
4.4
x
240
CUF
FT
160
89 points for modeling
(17 actual, 17 GPU-no, 59
accuracy)
80
448
Numberofchunks
Rela vePerformance
Rela vePerformance
Node 2:
α2 = 250
Core2
80
0
Node 3:
α3 = 263
Core1
320
Rela vePerformance
Rela vePerformance
Node 1:
α1 = 275
from seed
240
4.1
x
160
CUF
FT
modeling
(28 actual, 127
accuracy)
80
0
155 points for
0
0
64
128
192
256
320
384
448
Numberofchunks
0
64
128
192
256
320
384
448
Numberofchunks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
38
technology
Conclusions
from seed
LOAD BALANCING
– OBTAINED IN 4 ITERATIONS
AND
6.1 SECONDS
TRADITIONAL APPROACHES FOR PERFORMANCE MODELING
– Approximate the performance using number of points equal to the number of iterations
– In this case, 40 POINTS in total
PRESENTED DDL SCHEDULING APPROACH
– Models the performance using 416 POINTS, in this case ~10x more than with traditional modeling
– Load balancing solution is 3x faster than the current state of the art approaches
– IN THIS CASE, OBTAINED GPU PERFORMANCE IS AT LEAST 4.1X BETTER THAN THE “OPTIMAL” CUFFT
EXECUTION
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
39
Questions?
technology
from seed
Thank you
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2/27/2012
40