Exponential Challenges,
Exponential Rewards—
The Future of Moore’s
Law
Based on lecture of Shekhar Borkar
Intel Fellow
Circuit Research, Intel Labs
R
®
1
ISSCC 2003—
Gordon Moore said…
“No exponential is forever…
But
We can delay Forever”
2
Goal: 1TIPS by 2010
1,000,000.00
100,000.00
Pentium® 4 Architecture
10,000.00
Pentium® Pro Architecture
MIPS
1,000.00
Pentium® Architecture
100.00
10.00
1.00
286
386
486
8086
0.10
0.01
1970
1980
1990
2000
2010
How do you get there?
3
Transistors Scaling
Will high K happen? Would you count on it?
4
Technology Scaling
GATE
SOURCE Xj
GATE
DRAIN
SOURCE
BODY
DRAIN
D
Tox
BODY
Leff
Dimensions scale down by Doubles transistor density
30%
Oxide thickness scales
Faster transistor, higher
down
performance
Vdd & Vt scaling
Lower active power
Technology has scaled well, will it in the future?
5
Gate Oxide is Near Limit
GATE
GATE
Tox
DRAIN
SOURCE
DRAIN
SOURCE
BODY
70 nm
BODY
130nm Transistor
CoSi2
Si3N4
70 nm
Will high K happen? Would you count on it?
6
3D-Gate Transistor
7
Transistor Integration Capacity
Transistors (Million)
1000
1 Billion
100
10
1
0.1
0.01
0.001
10
5
2
1
0.5
0.25
0.13 0.065
Technology (m)
On track for 1billion transistor integration capacity
8
35 Years of Microprocessor Trend
C Moore, Data Processing in ExaScale-Class Computer Systems,
Salishan, April 2011
9
Transistor Integration Capacity
10
Transistor Integration Capacity
11
Transistor Integration Capacity
12
Transistor Integration Capacity
13
Exponential Challenge #1
14
Is Transistor a Good Switch?
I=0
I≠0
On
I=∞
I=0
I = 1ma/u
I≠0
Off
I=0
I≠0
Sub-threshold Leakage
15
Sub-threshold Leakage
Ioff (na/u)
10000
45nm
1000
100
10
0.25u
1
30
Assume:
0.25mm, Ioff = 1na/m
5X increase each generation
at 30ºC
80
130
Temp (C)
Sub-threshold leakage increases exponentially
16
Leakage Power
Leakage Power
(% of Total)
50%
Must stop
at 50%
40%
30%
20%
10%
0%
1.5
0.7
0.35
0.18
Technology (m)
0.09
0.045
A. Grove, IEDM 2002
Leakage power limits Vt scaling
17
The Power Crisis
1200
15 mm Die
Power (W)
1000
800
Leakage
Active
600
400
200
0
0.25u 0.18u 0.13u 90nm 65nm 45nm
18
How Power Should Have Scaled
A. Danowitz et al. CPU DB: Recording Microprocessor History.
ACMQueue Processors, vol. 10, issue 4, pp1-18. 2012
19
Exponential Challenge #4
20
Path Delay
Probability
Impact on Path Delays
Delay
Due to variations in:
Vdd, Vt, and Temp
Path delay variability due to technological variations
Impacts individual circuit performance and power
Optimize each circuit for performance and power
21
Probability
Impact on Path Delays
Path Delay
Delay
Due to variations in:
Vdd, Vt, and Temp
Path delay variability due to technological variations
Impacts individual circuit performance and power
How many silicon atoms (111pm) have on transistor
channel (20nm)? 3D transistor is a solution?
Optimize each circuit for performance and power
22
Shift in Design Paradigm

From deterministic design to
probabilistic and statistical design
–A path delay estimate is probabilistic (not
deterministic)

Multi-variable design optimization for
– Parameter variations
– Active and leakage power
– Performance
23
Exponential Challenge #6
24
Exponential Costs
$10,000
Litho Cost
FAB Cost
$10,000
$1,000
Fab Cost ($M)
Litho Tool Cost ($K)
$100,000
$1,000
$100
$10
1970
1.E-01
1990
www.icknowledge.com
$1
2010
1960
1970
1.E+04
$ per Transistor
1980
1990
2000
2010
$ per MIPS
1.E+03
1.E-02
1.E+02
1.E-03
$/MIPs
$/Transistor
$10
G. Moore
ISSCC 03
$1
1950
1.E-04
1.E+01
1.E+00
1.E-05
1.E-06
1960
$100
1.E-01
1970
1980
1990
2000
2010
1.E-02
1960
1970
1980
1990
2000
2010
25
Some Implications
Vdd (Volts)
100

Tox scaling will slow
down—may stop?

Vdd scaling will slow
down—may stop?

Vt scaling will slow
down—may stop?

Approaching
constant Vdd scaling
10
1
~1
Volt
0.13 0.045
0.1
10
3
1
0.35
Energy/Logic Operation
(Normalized)
Technology (
m
)
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08
Energy/logic op will
Slow Down?not scale

10
3
1
0.35
Technology (
0.13
0.045
m
)
26
The Terascale Dilemma

Many billion transistor
capacity will be available
integration
– But could be unusable due to power
Logic transistor growth will slow
down
 Transistor performance will be limited
Solutions
 Low power design techniques
 Improve design efficiency

27
Exponential Challenge #5
28
System Volume ( cubic inch)
Platform Requirements
3000
Shrinking volume
2500
2000
Quieter
1500
1000
Yet, High Performance
500
0
PC tower Mini towerm-tower
Slim line Small pc
Thermal Budget (oC/W)
Pentium ® III
75
1.0
Pentium ® 4
50
0.5
0
25
0
0
50
100
150
Power (W)
200
Heat-Sink Volume (in3)
Air Flow Rate (CFM)
100
1.5
Thermal budget
decreasing
Higher heat sink volume
Higher air flow rate
250
29
Slow
Fast
Slow
High Supply
Voltage
Low Supply
Voltage
Active Power Reduction
Multiple Vdd
Throughput oriented design
Vdd
Logic Block
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
30
4
3
Same Process Technology
Die Area
Performance
Power
2
1
40%
Reduction in MIPS/Watt
Growth (X) from previous uArch
Design & mArch Efficiency
Same Process Technology
Enegry efficiency
drops ~20%
20%
0%
0
S-Scalar
Dynamic
Deep
Pipeline
S-Scalar
Dynamic
Deep
Pipeline
Employ efficient design & mArchitectures
31
Improve mArch Efficiency
Thermals
&
Power
Delivery designed for
full HW utilization
Single Thread
ST
Wait for Mem
Multi-Threading
MT1
Wait for Mem
Wait
MT2
Computer Architecture: A Quantitative Approach
(Hennessy;Patterson, 2011)
MT3
Multi-threading improves performance without
impacting thermals & power delivery
Increase on-die Memory
100
100%
Cache % of
full chip area
Power Density (Watts/cm2)
80%
Logic
60%
Memory
Pentium
®4
10
40%
20% PentiumPentium
Pro
0%
1
0.25m
0.18m
0.13m
?
0.1m
0.7m
0.5m
Pentium
II
Pentium
III & 4
Pentium
III
0.35m 0.25m 0.18m 0.13m 0.10m
Large on die memory provides:
1. Increased Data Bandwidth & Reduced Latency
2. Hence, higher performance for much lower power
33
Chip Multi-Processing
Keynote presentation (L. Benini, RSP 2010).
34
C1
C2
Cache
C3
C4
Relative Performance
Chip Multi-Processing
3.5
3
CMP
2.5
2
1.5
ST
1
1
2
3
4
Die Area, Power
•
•
•
•
•
Multi-core, each core Multi-threaded
Shared cache and front side bus
Each core has different Vdd & Freq
Spreading hot spots
Lower junction temperature
35
Example (Itanium Tukwila)
36
Example (Itanium Tukwila)
130 Watts
30 MBytes
cache
37
Example (Itanium Tukwila)
38
What the Cores Will look like?
39
What the Cores Will look like?
40
What the Cores Will look like?
41
What the Cores Will look like?
clocks run with the
same frequency but
unknown phases
42
What the Cores Will look like?
43
What the Cores Will look like?
• Intelligent
redistribution
workload
• Multiple
functionalities
• Improvement
of energy
efficiency
44
What the Cores Will look like?
• Several interconnection possibilities
• Mesh
• Ring
45
Tera-Scale
RMS - Recognition, Mining and Synthesis
46
Tera-Scale
47
Tera-Scale
48
Tera-Scale
49
The Exponential Reward
1000000
Multi-Threaded, Multi-Core
100000
Multi Threaded
10000
Speculative, OOO
MIPS
1000
Super Scalar
100
10
1
0.1
0.01
1970
286
8086
486
Era of
Instruction
Era of
Level
Pipelined
Architecture Parallelism
386
1980
1990
Era of
Thread &
Processor
Level
Parallelism
Special
Purpose HW
2000
2010
50
Summary—Delaying Forever
Terascale
transistor
integration
capacity will be available - Power and
Energy are the barriers
 Variations will be even more prominent
shift
from
Deterministic
to
Probabilistic design
 Improve design efficiency
 Exploit integration capacity to deliver
performance in power/cost envelope

51
Exercícios
1.
Discuta um
dispositivos
problema
associados
a
integração
dos
2.
Comente a afirmação: - “A redução do tamanho dos
transistores muda o paradigma de avaliação de consumo de
energia e tempo de execução de determinístico para
probabilístico”
3.
Porque o consumo de energia estático é tão problemático
para as tecnologias futuras?
4.
Porque a redução da voltagem é um dos principais elementos
a tratar para reduzir o consumo de energia?
5.
Como um sistema com várias alimentações pode contribuir
para a redução do consumo de energia? Qual o efeito sobre o
tempo de execução?
52
Exercícios
6.
Faça uma ilustração que mostre como um programa multithread pode ocupar melhor os recursos de um sistema,
reduzindo o gargalo de comunicação com a memória
7.
Qual o motivo do percentual de memória interno a um
circuito integrado passar de 50% nos processadores atuais?
8.
Dada a limitação do escalamento, o que pode ser feito para
continuar o crescente aumento do desempenho das
máquinas?
9.
Quais as tendências em termos de computação (cores),
infra-estrutura de comunicação e armazenamento para os
próximos processadores?
53
Download

DTTC Presentation Template