Exponential Challenges, Exponential Rewards— The Future of Moore’s Law Based on lecture of Shekhar Borkar Intel Fellow Circuit Research, Intel Labs R ® 1 ISSCC 2003— Gordon Moore said… “No exponential is forever… But We can delay Forever” 2 Goal: 1TIPS by 2010 1,000,000.00 100,000.00 Pentium® 4 Architecture 10,000.00 Pentium® Pro Architecture MIPS 1,000.00 Pentium® Architecture 100.00 10.00 1.00 286 386 486 8086 0.10 0.01 1970 1980 1990 2000 2010 How do you get there? 3 Transistors Scaling Will high K happen? Would you count on it? 4 Technology Scaling GATE SOURCE Xj GATE DRAIN SOURCE BODY DRAIN D Tox BODY Leff Dimensions scale down by Doubles transistor density 30% Oxide thickness scales Faster transistor, higher down performance Vdd & Vt scaling Lower active power Technology has scaled well, will it in the future? 5 Gate Oxide is Near Limit GATE GATE Tox DRAIN SOURCE DRAIN SOURCE BODY 70 nm BODY 130nm Transistor CoSi2 Si3N4 70 nm Will high K happen? Would you count on it? 6 3D-Gate Transistor 7 Transistor Integration Capacity Transistors (Million) 1000 1 Billion 100 10 1 0.1 0.01 0.001 10 5 2 1 0.5 0.25 0.13 0.065 Technology (m) On track for 1billion transistor integration capacity 8 35 Years of Microprocessor Trend C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011 9 Transistor Integration Capacity 10 Transistor Integration Capacity 11 Transistor Integration Capacity 12 Transistor Integration Capacity 13 Exponential Challenge #1 14 Is Transistor a Good Switch? I=0 I≠0 On I=∞ I=0 I = 1ma/u I≠0 Off I=0 I≠0 Sub-threshold Leakage 15 Sub-threshold Leakage Ioff (na/u) 10000 45nm 1000 100 10 0.25u 1 30 Assume: 0.25mm, Ioff = 1na/m 5X increase each generation at 30ºC 80 130 Temp (C) Sub-threshold leakage increases exponentially 16 Leakage Power Leakage Power (% of Total) 50% Must stop at 50% 40% 30% 20% 10% 0% 1.5 0.7 0.35 0.18 Technology (m) 0.09 0.045 A. Grove, IEDM 2002 Leakage power limits Vt scaling 17 The Power Crisis 1200 15 mm Die Power (W) 1000 800 Leakage Active 600 400 200 0 0.25u 0.18u 0.13u 90nm 65nm 45nm 18 How Power Should Have Scaled A. Danowitz et al. CPU DB: Recording Microprocessor History. ACMQueue Processors, vol. 10, issue 4, pp1-18. 2012 19 Exponential Challenge #4 20 Path Delay Probability Impact on Path Delays Delay Due to variations in: Vdd, Vt, and Temp Path delay variability due to technological variations Impacts individual circuit performance and power Optimize each circuit for performance and power 21 Probability Impact on Path Delays Path Delay Delay Due to variations in: Vdd, Vt, and Temp Path delay variability due to technological variations Impacts individual circuit performance and power How many silicon atoms (111pm) have on transistor channel (20nm)? 3D transistor is a solution? Optimize each circuit for performance and power 22 Shift in Design Paradigm From deterministic design to probabilistic and statistical design –A path delay estimate is probabilistic (not deterministic) Multi-variable design optimization for – Parameter variations – Active and leakage power – Performance 23 Exponential Challenge #6 24 Exponential Costs $10,000 Litho Cost FAB Cost $10,000 $1,000 Fab Cost ($M) Litho Tool Cost ($K) $100,000 $1,000 $100 $10 1970 1.E-01 1990 www.icknowledge.com $1 2010 1960 1970 1.E+04 $ per Transistor 1980 1990 2000 2010 $ per MIPS 1.E+03 1.E-02 1.E+02 1.E-03 $/MIPs $/Transistor $10 G. Moore ISSCC 03 $1 1950 1.E-04 1.E+01 1.E+00 1.E-05 1.E-06 1960 $100 1.E-01 1970 1980 1990 2000 2010 1.E-02 1960 1970 1980 1990 2000 2010 25 Some Implications Vdd (Volts) 100 Tox scaling will slow down—may stop? Vdd scaling will slow down—may stop? Vt scaling will slow down—may stop? Approaching constant Vdd scaling 10 1 ~1 Volt 0.13 0.045 0.1 10 3 1 0.35 Energy/Logic Operation (Normalized) Technology ( m ) 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 Energy/logic op will Slow Down?not scale 10 3 1 0.35 Technology ( 0.13 0.045 m ) 26 The Terascale Dilemma Many billion transistor capacity will be available integration – But could be unusable due to power Logic transistor growth will slow down Transistor performance will be limited Solutions Low power design techniques Improve design efficiency 27 Exponential Challenge #5 28 System Volume ( cubic inch) Platform Requirements 3000 Shrinking volume 2500 2000 Quieter 1500 1000 Yet, High Performance 500 0 PC tower Mini towerm-tower Slim line Small pc Thermal Budget (oC/W) Pentium ® III 75 1.0 Pentium ® 4 50 0.5 0 25 0 0 50 100 150 Power (W) 200 Heat-Sink Volume (in3) Air Flow Rate (CFM) 100 1.5 Thermal budget decreasing Higher heat sink volume Higher air flow rate 250 29 Slow Fast Slow High Supply Voltage Low Supply Voltage Active Power Reduction Multiple Vdd Throughput oriented design Vdd Logic Block Vdd/2 Freq = 1 Vdd = 1 Logic Block Throughput = 1 Power = 1 Logic Block Area = 1 Pwr Den = 1 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 30 4 3 Same Process Technology Die Area Performance Power 2 1 40% Reduction in MIPS/Watt Growth (X) from previous uArch Design & mArch Efficiency Same Process Technology Enegry efficiency drops ~20% 20% 0% 0 S-Scalar Dynamic Deep Pipeline S-Scalar Dynamic Deep Pipeline Employ efficient design & mArchitectures 31 Improve mArch Efficiency Thermals & Power Delivery designed for full HW utilization Single Thread ST Wait for Mem Multi-Threading MT1 Wait for Mem Wait MT2 Computer Architecture: A Quantitative Approach (Hennessy;Patterson, 2011) MT3 Multi-threading improves performance without impacting thermals & power delivery Increase on-die Memory 100 100% Cache % of full chip area Power Density (Watts/cm2) 80% Logic 60% Memory Pentium ®4 10 40% 20% PentiumPentium Pro 0% 1 0.25m 0.18m 0.13m ? 0.1m 0.7m 0.5m Pentium II Pentium III & 4 Pentium III 0.35m 0.25m 0.18m 0.13m 0.10m Large on die memory provides: 1. Increased Data Bandwidth & Reduced Latency 2. Hence, higher performance for much lower power 33 Chip Multi-Processing Keynote presentation (L. Benini, RSP 2010). 34 C1 C2 Cache C3 C4 Relative Performance Chip Multi-Processing 3.5 3 CMP 2.5 2 1.5 ST 1 1 2 3 4 Die Area, Power • • • • • Multi-core, each core Multi-threaded Shared cache and front side bus Each core has different Vdd & Freq Spreading hot spots Lower junction temperature 35 Example (Itanium Tukwila) 36 Example (Itanium Tukwila) 130 Watts 30 MBytes cache 37 Example (Itanium Tukwila) 38 What the Cores Will look like? 39 What the Cores Will look like? 40 What the Cores Will look like? 41 What the Cores Will look like? clocks run with the same frequency but unknown phases 42 What the Cores Will look like? 43 What the Cores Will look like? • Intelligent redistribution workload • Multiple functionalities • Improvement of energy efficiency 44 What the Cores Will look like? • Several interconnection possibilities • Mesh • Ring 45 Tera-Scale RMS - Recognition, Mining and Synthesis 46 Tera-Scale 47 Tera-Scale 48 Tera-Scale 49 The Exponential Reward 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 Speculative, OOO MIPS 1000 Super Scalar 100 10 1 0.1 0.01 1970 286 8086 486 Era of Instruction Era of Level Pipelined Architecture Parallelism 386 1980 1990 Era of Thread & Processor Level Parallelism Special Purpose HW 2000 2010 50 Summary—Delaying Forever Terascale transistor integration capacity will be available - Power and Energy are the barriers Variations will be even more prominent shift from Deterministic to Probabilistic design Improve design efficiency Exploit integration capacity to deliver performance in power/cost envelope 51 Exercícios 1. Discuta um dispositivos problema associados a integração dos 2. Comente a afirmação: - “A redução do tamanho dos transistores muda o paradigma de avaliação de consumo de energia e tempo de execução de determinístico para probabilístico” 3. Porque o consumo de energia estático é tão problemático para as tecnologias futuras? 4. Porque a redução da voltagem é um dos principais elementos a tratar para reduzir o consumo de energia? 5. Como um sistema com várias alimentações pode contribuir para a redução do consumo de energia? Qual o efeito sobre o tempo de execução? 52 Exercícios 6. Faça uma ilustração que mostre como um programa multithread pode ocupar melhor os recursos de um sistema, reduzindo o gargalo de comunicação com a memória 7. Qual o motivo do percentual de memória interno a um circuito integrado passar de 50% nos processadores atuais? 8. Dada a limitação do escalamento, o que pode ser feito para continuar o crescente aumento do desempenho das máquinas? 9. Quais as tendências em termos de computação (cores), infra-estrutura de comunicação e armazenamento para os próximos processadores? 53