On Realistic Divisible Load Scheduling for Highly Heterogeneous Distributed Systems Aleksandar Ilić and Leonel Sousa INESC-ID/ IST, TU Lisbon, Portugal Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa PDP 2012, Special Session on GPU and Hybrid Computing, Germany, 2012 technology Motivation from seed COMMODITY COMPUTERS = HETEROGENEOUS SYSTEMS – Multi-core CPUs, GPUs, AMD Fusion, accelerators/co-processors, FPGAs … – Interconnection: bidirectional, full-duplex, asymmetric, concurrency levels CLUSTERS OF COMMODITY COMPUTERS = HIGHLY HETEROGENEOUS SYSTEMS – (POWER OF COMMODITY COMPUTERS)N => (HETEROGENEITY PROBLEMS)N – Communication: more complex if one-port model is not considered Heterogeneity offers tremendous computing power Heterogeneity makes life difficult even for a single application – Traditionally: Different computation/communication speeds (not at all constants) – Additionally: Different programming models, languages, tools, implementations … => REALISTIC PERFORMANCE MODELING, SCHEDULING AND LOAD BALANCING2/27/2012 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2 technology Outline from seed DISCRETELY DIVISIBLE LOAD (DDL) PROCESSING HETEROGENEOUS DISTRIBUTED SYSTEMS WITH HETEROGENEOUS CPU+GPU DESKTOPS AS COMPUTING NODES PERFORMANCE MODELING AND 3-STEP HIERARCHICAL DDL SCHEDULING ALGORITHM CASE STUDY: 2D FFT Batch Execution CONCLUSIONS AND FUTURE WORK Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 3 technology Discretely Divisible Load Processing from seed • DISCRETELY DIVISIBLE LOAD (DDL) APPLICATIONS – Computations divisible into pieces of arbitrary sizes (integers) – Fractions independently processed in parallel with no precedence constraints • APPLICABLE TO A WIDE RANGE OF SCIENTIFIC PROBLEMS – Linear algebra, digital signal and image processing, database applications … • STATE OF THE ART DDL APPROACHES IN HETEROGENEOUS DISTRIBUTED COMPUTING – Very recent, but only for traditional distributed systems (no hierarchy, CPU-only) – Assume symmetric bandwidth and an one-port model for communication links – Limited memory: only the size of input load is considered, or the exceeding load is simply redistributed among the nodes with available memory Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa – Unrealistic: Computation/communication time is a linear/affine function of the #chunks 2/27/2012 4 technology from seed Heterogeneous Desktop Systems HETEROGENEOUS STAR NETWORK (MASTER-WORKER) • MULTI-CORE CPU (Master) CPU CONT ROL CORE CORE #1 #2 ... CORE #N – Global execution controller; access the whole global memory CACHE – All cores employed for execution DRAM • INTERCONNECTION BUSES – Bidirectional full-duplex asymmetric communication – Different concurrency levels – Potential execution bottleneck • DEVICES (Distant workers) INTERCONNECTION BUSES DEVICE #N DEVICE #2 GPU #1 – Different architectures and programming models – Computation performed using local memories DRAM DRAM DRAM Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 5 technology Commodity Cluster Systems from seed HETEROGENEOUS STAR NETWORK • MASTER NODE – Heterogeneous desktop system – Positioned at the center; employed for execution as whole • INTERCONNECTION NETWORK – Limited asymmetric one-port communication – Potential execution bottleneck • WORKER NODES – Heterogeneous desktop systems – All available devices employed in each Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 6 technology Proposed Algorithm Outline (1) from seed 3-STEP HIERARCHICAL DDL SCHEDULING Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 7 technology Proposed Algorithm Outline (2) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 8 technology Proposed Algorithm Outline (3) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 9 technology Proposed Algorithm Outline (4) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? • STEP 2 – NODE-LEVEL LOAD BALANCING – How many chunks to send to each device from a given node’s load? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 10 technology Proposed Algorithm Outline (5) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? • STEP 2 – NODE-LEVEL LOAD BALANCING – How many chunks to send to each device from a given node’s load? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 11 technology Proposed Algorithm Outline (6) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? • STEP 2 – NODE-LEVEL LOAD BALANCING – How many chunks to send to each device from a given node’s load? • STEP 3 – DEVICE-LEVEL SCHEDULING – How to sub-partition the device load to: • Reduce delays when distributing and retrieving • Overlap computation and communication • Efficiently use the bidirectional asymmetric bandwidth of buses • Respect the amount of supported concurrency • Fit into device limited memory Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 12 technology Proposed Algorithm Outline (7) from seed 3-STEP HIERARCHICAL DDL SCHEDULING: • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – How many load units to send to each node? • STEP 2 – NODE-LEVEL LOAD BALANCING – How many chunks to send to each device from a given node’s load? • STEP 3 – DEVICE-LEVEL SCHEDULING – How to sub-partition the device load to: • Reduce delays when distributing and retrieving • Overlap computation and communication • Efficiently use the bidirectional asymmetric bandwidth of buses • Respect the amount of supported concurrency • Fit into device limited memory Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 13 technology from seed Performance Modeling FUNCTIONAL PERFORMANCE MODELS – Built from the real application execution • No assumptions being made to ease modeling! • NODE-LEVEL PERFORMANCE MODELS – Computation performance models ( ) • For each master core and distant worker – Full-duplex communication bandwidth ( , ) • Bidirectional and asymmetric for each link – Total performance ( ) of each device • Including computation and communication • SYSTEM-LEVEL PERFORMANCE MODELS – Computation performance models ( ) • For each node (heterogeneous desktop system) – Communication bandwidth ( , ) • Bidirectional, asymmetric for each network link – Total performance ( ) of each node Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 14 technology Determination of Load Fractions (1) from seed • STEP 1 – SYSTEM-LEVEL LOAD BALANCING – The optimal distribution between nodes lies on a straight line passing through the origin of coordinate system and intersecting total performance curves ( )*: • STEP 2 – NODE-LEVEL LOAD BALANCING – Each load fraction is sub-partitioned between node’s devices; the optimal distribution that lies on a straight line passing through the origin of coordinate system and intersecting communication-aware total performance curves ( ), such that*: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa *Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their Functional Performance Models", HeteroPar 2009, LNCS, vol. 6043, Springer, pp. 91-101, 2010. 2/27/2012 15 technology Determination of Load Fractions (2) from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – Per-device distributions are allowed to exceed the device memory limits, bj • Device-level multi-installment processing with multi-distributions – sub-distributions with sub-load fractions Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 16 technology Determination of Load Fractions (3) from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – Per-device distributions are allowed to exceed the device memory limits, bj • Device-level multi-installment processing with multi-distributions – sub-distributions with sub-load fractions Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 17 technology Determination of Load Fractions (4) from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – Per-device distributions are allowed to exceed the device memory limits, bj • Device-level multi-installment processing with multi-distributions – sub-distributions with sub-load fractions Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 18 Determination of Load Fractions (5) - Limited Memory - technology from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – Per-device distributions are allowed to exceed the device memory limits, bj • Device-level multi-installment processing with multi-distributions – sub-distributions with sub-load fractions – Application memory requirements are modeled with three functions of load size: • Input memory requirements, • Output memory requirements, • Execution memory requirements, – Different implementations of the same problem might have different memory requirements! ⇒ In each sub-distribution the whole amount of memory may be consumed, such that: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 19 Determination of Load Fractions (6) - Computation/Communication Overlapping - technology from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – For each sub-distribution, sizes are carefully chosen to allow as best as possible overlapping of computation and communication between subsequent sub-fractions Execution Time (no overlap) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 20 Determination of Load Fractions (7) - Computation/Communication Overlapping - technology from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – For each sub-distribution, sizes are carefully chosen to allow as best as possible overlapping of computation and communication between subsequent sub-fractions Execution Time (overlapped) Execution Time (no overlap) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 21 Determination of Load Fractions (8) - Computation/Communication Overlapping - technology from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – For each sub-distribution, sizes are carefully chosen to allow as best as possible overlapping of computation and communication between subsequent sub-fractions Execution Time (overlapped) GAI N Execution Time (no overlap) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 22 Determination of Load Fractions (9) - Computation/Communication Overlapping - technology from seed • STEP 3 – DEVICE-LEVEL SCHEDULING – For each sub-distribution, sizes are carefully chosen to allow as best as possible overlapping of computation and communication between subsequent sub-fractions – The decisions are made according to the amount of overlapping concurrency supported by the device: γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t o (γk 1 ) t ι (γk 2 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) ... (a) Overlap of a single communication with computation at the time γk 1 t ι (γk 1 ) γk 2 t w (γk 1 ) t o (γk 1 ) t ι (γk 2 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) ... (b) Complete concurrency between communication and computation Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 23 technology Determination of Load Fractions (10) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 24 technology Determination of Load Fractions (11) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 25 technology Determination of Load Fractions (12) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. • Step 3-II. Generate additional three-fraction distributions. * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 26 technology Determination of Load Fractions (13) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. • Step 3-II. Generate additional three-fraction distributions. • Step 3-III. Insert additional load fractions into existing sub-distributions (iterative). * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 27 technology Determination of Load Fractions (14) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. • Step 3-II. Generate additional three-fraction distributions. • Step 3-III. Insert additional load fractions into existing sub-distributions (iterative). • Step 3-IV. Generate new sub-distributions by restarting. * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 28 technology Determination of Load Fractions (15) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. • Step 3-II. Generate additional three-fraction distributions. • Step 3-III. Insert additional load fractions into existing sub-distributions (iterative). • Step 3-IV. Generate new sub-distributions by restarting. • Step 3-V. Expand all sub-distributions. * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 29 technology Determination of Load Fractions (16) - DEVICE-LEVEL SCHEDULING ALGORITHM - from seed • STEP 3 – DEVICE-LEVEL SCHEDULING ALGORITHM*/** γk 1 t ι (γk 1 ) t w (γk 1 ) γk 2 t ι (γk 2 ) t o (γk 1 ) t w (γk 2 ) γk 3 t ι (γk 3 ) t o (γk 2 ) t ι (γk 4 ) t w (γk 3 ) t w (γk 4 ) t o (γk 3 ) – According to the performance models for device computation ( asymmetric full-duplex communication links ( , ) ... ) and bidirectional • Step 3-I. Determination of the initial optimal distribution with three load fractions. • Step 3-II. Generate additional three-fraction distributions. • Step 3-III. Insert additional load fractions into existing sub-distributions (iterative). • Step 3-IV. Generate new sub-distributions by restarting. • Step 3-V. Expand all sub-distributions. • Step 3-VI. Select the distribution with maximum relative performance. * Ilić, A., and Sousa, L., “Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory” (in HeteroPar 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa ** Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) 2/27/2012 30 technology from seed Case Study: 2D FFT Batch DOUBLE FLOATING POINT COMPLEX 2D FFT BATCH EXECUTION – Size: 1024 times 512 × 512; divisible in the first dimension – The optimal vendor-provided FFT implementations are used – NVIDIA’s CUFFT 3.2 for the GPU and Intel MKL 10.3 for the CPU DISTRIBUTED HETEROGENEOUS SYSTEM WITH 4 CPU+GPU NODES #CPU Cores CPU GPU nVIDIA GeForce 285GTX NODE 1 3 Intel Core 2 Quad NODE 2 2 Speed/Core (GHz) 2.83 1.476 NODE 3 1 Global Memory (MB) 4096 1024 NODE 4 - Experimental Setup #GPUs 1 ITERATIVE PROCEDURE FOR ONLINE PERFORMANCE MODELING – PERFORMANCE ESTIMATION of all heterogeneous devices DURING THE EXECUTION – No prior knowledge on the performance of an application is available on any of the devices – INITIALLY, the load is distributed among nodes/devices using FACTORING-BY-TWO STRATEGY* – Limited Memory: Factoring-by-two partitioning of the largest loads into new sub-distributions until satisfying the memory limitations – IN EACH FOLLOWING ITERATION, the load is distributed using the PRESENTED APPROACH * Ilić, A., and Sousa, L., “Algorithm For Divisible Load Scheduling on Heterogeneous Systems with Realistic Performance Models”, Tech. rep., INESC-ID (May 2011) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 31 Case Study: 2D FFT Batch ITERATION 1: FACTORING-BY-TWO STRATEGY GPU Core1 β1 = 64 320 β2 = 64 Core2 from seed Core3 β3 = 64 GPU β4 = 64 240 Γ= { {32,16,8,4,2,1,1} } 160 80 85 Node 2: α2 = 256 240 160 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 128 320 128 256 320 384 448 Γ 160 80 Node 4: α4 = 256 GPU 256 320 240 Γ1 192 Numberofchunks Rela vePerformance Rela vePerformance Core2 85 80 0 Node 3: α3 = 256 Core1 86 320 Rela vePerformance Rela vePerformance Node 1: α1 = 256 technology 240 160 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 32 technology Case Study: 2D FFT Batch ITERATION 1: PROPOSED ALGORITHM Node 1: α1 = 256 GPU Core1 β1 = 166 320 β2 = 25 Core2 Core3 β3 = 21 GPU β4 = 21 240 Γ= { {22,41,18,8,4,1}, {8,14,18,20,9,3} } 160 80 Node 2: α2 = 198 Node 2: α2 = 256 240 160 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 127 320 192 256 320 384 448 Numberofchunks 21 240 160 80 Node 4: α4 = 445 GPU 445 320 Rela vePerformance Rela vePerformance 22 0 0 Node 3: α3 = 256 Core2 22 80 0 Node 3: α3 = 148 Core1 154 320 Rela vePerformance Rela vePerformance Node 1: α1 = 233 from seed Node 4: α4 = 256 240 160 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 33 technology Case Study: 2D FFT Batch ITERATION 2: PERFORMANCE MODELS GPU Core1 Core2 Core3 GPU 320 240 Γ2 Γ1 Γ 160 80 240 160 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 192 256 320 384 448 Numberofchunks GPU[nooverlap] Node 4: α4 = 445 GPU 320 320 Rela vePerformance Rela vePerformance Node 2: α2 = 198 Core2 80 0 Node 3: α3 = 148 Core1 320 Rela vePerformance Rela vePerformance Node 1: α1 = 233 from seed 240 Multifunctional Performance Modeling 160 80 240 160 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 34 technology Case Study: 2D FFT Batch ITERATION 2: MODELING EFFCIENCY GPU Core1 Core2 Core3 GPU 320 240 28 points for modeling (22 actual, 6 accuracy) 160 80 240 37 points for modeling (23 actual, 14 accuracy) 160 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 192 256 320 384 448 Numberofchunks GPU[nooverlap] Node 4: α4 = 445 GPU 320 320 Rela vePerformance Rela vePerformance Node 2: α2 = 198 Core2 80 0 Node 3: α3 = 148 Core1 320 Rela vePerformance Rela vePerformance Node 1: α1 = 233 from seed 240 34 points for modeling (9 actual, 12 GPU-no, 15 accuracy) 160 80 240 70 points for modeling (15 actual, 55 accuracy) 160 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 35 technology Case Study: 2D FFT Batch ITERATION 2: PERFORMANCE GPU Core1 Core2 Core3 GPU 320 Core1 4.3 x 240 28 points for modeling (22 actual, 6 accuracy) CUF FT 160 80 240 160 4.3 x 37 points for modeling CUF FT (23 actual, 14 accuracy) 80 0 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 192 256 320 384 448 Numberofchunks GPU[nooverlap] Node 4: α4 = 445 GPU 320 320 4.6 x 240 34 points for modeling (9 actual, 12 GPU-no, 15 accuracy) CUF FT 160 Rela vePerformance Rela vePerformance Node 3: α3 = 148 Node 2: α2 = 198 Core2 320 Rela vePerformance Rela vePerformance Node 1: α1 = 233 from seed 80 Problem does not fit in the GPU memory! ∞ 240 70 points for modeling CUF FT (15 actual, 55 accuracy) 160 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 36 technology Case Study: 2D FFT Batch ITERATION 3 GPU Core1 Core2 Core3 GPU 320 4.3 x 240 46 points for modeling (31 actual, 15 accuracy) CUF FT 160 80 4.1 x 240 57 points for modeling (33 actual, 24 accuracy) CUF FT 160 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 192 256 320 384 GPU[nooverlap] Node 4: α4 = 276 GPU 320 320 4.2 x 240 CUF FT 160 74 points for modeling (13 actual, 14 GPU-no, 51 accuracy) 80 448 Numberofchunks Rela vePerformance Rela vePerformance Node 2: α2 = 258 Core2 80 0 Node 3: α3 = 241 Core1 320 Rela vePerformance Rela vePerformance Node 1: α1 = 249 from seed 240 ∞ CUF FT 160 119 points for modeling (23 actual, 96 accuracy) 80 0 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 37 technology Case Study: 2D FFT Batch ITERATION 4 GPU Core1 Core2 Core3 GPU 320 4.1 x 240 91 points for modeling (34 actual, 57 accuracy) CUF FT 160 80 4.2 x 240 81 points for modeling (36 actual, 45 accuracy) CUF FT 160 0 0 64 128 192 256 320 384 448 0 64 128 Numberofchunks GPU Core1 192 256 320 384 GPU[nooverlap] Node 4: α4 = 236 GPU 320 320 4.4 x 240 CUF FT 160 89 points for modeling (17 actual, 17 GPU-no, 59 accuracy) 80 448 Numberofchunks Rela vePerformance Rela vePerformance Node 2: α2 = 250 Core2 80 0 Node 3: α3 = 263 Core1 320 Rela vePerformance Rela vePerformance Node 1: α1 = 275 from seed 240 4.1 x 160 CUF FT modeling (28 actual, 127 accuracy) 80 0 155 points for 0 0 64 128 192 256 320 384 448 Numberofchunks 0 64 128 192 256 320 384 448 Numberofchunks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 38 technology Conclusions from seed LOAD BALANCING – OBTAINED IN 4 ITERATIONS AND 6.1 SECONDS TRADITIONAL APPROACHES FOR PERFORMANCE MODELING – Approximate the performance using number of points equal to the number of iterations – In this case, 40 POINTS in total PRESENTED DDL SCHEDULING APPROACH – Models the performance using 416 POINTS, in this case ~10x more than with traditional modeling – Load balancing solution is 3x faster than the current state of the art approaches – IN THIS CASE, OBTAINED GPU PERFORMANCE IS AT LEAST 4.1X BETTER THAN THE “OPTIMAL” CUFFT EXECUTION Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 39 Questions? technology from seed Thank you Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2/27/2012 40