Arquitectura de
Computadores II
4. Exemplos de Alguns
Processadores Actuais
2004/2005
4.1. Arquitectura IA-32
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
» The x86 isn’t that all complex –
It just doesn’t make a lot of sense «
Mike Johnson, Leader of the 80x86 design at AMD
Microprocessor Report (1994)
2
Uma breve história...










1978: The Intel 8086 is announced (16 bit architecture)
1980: The 8087 floating point coprocessor is added
1982: The 80286 increases address space to 24 bits, +instructions
1985: The 80386 extends to 32 bits, new addressing modes
1989-1995: The 80486, Pentium, Pentium Pro add a few
instructions (mostly designed for higher performance)
1997: 57 new “MMX” instructions are added, Pentium II
1999: The Pentium III added another 70 instructions (SSE)
2001: Another 144 instructions (SSE2)
2003: AMD extends the architecture to increase address space to 64
bits, widens all registers to 64 bits and other changes (AMD64)
2004: Intel capitulates and embraces AMD64 (calls it EM64T) and
adds more media extensions
Problema do “legado” e
“compatibilidade para trás”
3
Visão geral

Complexidade:





Instruções podem ter um tamanho de 1 a 17 bytes
Um operando funciona sempre como origem e destino
Um operando pode vir de memória
Formas de endereçamento complexas
O que “salvou” a arquitectura ao longo dos anos:




As instruções mais frequentes não são difíceis de implementar
Os compiladores não geram as instruções lentas e não usam a
parte da arquitectura que é lenta
O processador foi convertido à arquitectura RISC, mantendo
apenas um front-end que descodifica as instruções complexas em
µOPs RISC, simples.
... Volume de mercado
4
Registos (FP não mostrados)
5
Instruções


De dois operandos (e.g. ADD AX, BX)
Diferentes tipos de origem/destino






Register/Register
Register/Immediate
Register/Memory
Memory/Register
Memory/Immediate
Múltiplos modos de endereçamento







Absoluto (e.g. MOV AX, [1000])
Indirecto via Registo (e.g. MOV AX, [SI])
Base mode with 8/16/32 displacement (e.g. MOV AX, [SI+100])
Indexed (e.g. MOV AX, [SI+BX])
Based Indexed (e.g. MOV AX, [SI+BX+100])
Base+Scaled Indexed (endereço = BaseReg+2^Scale*IndexReg)
Base+Scaled Index with Displacement (como acima + displ.)
6
Múltiplos modos de endereçamento
7
Instruções (apenas algumas...)
Os registos, em muitos casos,
não são General Purpose!
8
Codificação das Instruções
9
Extensões à arquitectura IA-32

Instruções MMX, SSE, SSE2

Consistem em:



MMX: Operações sobre vectores de inteiros (vectores de 64 bits
contendo números de 8, 16 ou 32 bits)
SSE: Operações sobre vectores de virgula flutuante simples
(vectores de 4 floats IEEE745)
SSE2: Operações sobre vectores de vírgula flutuante dupla
(vectores de 2 double IEEE754) + extensão aos vectores de
inteiros (vectores de 128 bits contendo números de 8, 16, 32 ou
64 bits)
10
Arquitectura de
Computadores II
4. Exemplos de Alguns
Processadores Actuais
2004/2005
4.2. Intel Pentium 4
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Instruções IA-32 e µOPs

Todas as implementações modernas da arquitectura IA-32
convertem as instruções originais numa sequência de
micro-instruções.



No caso da Intel, estas são chamadas µOPs
As µOPS são bastante semelhantes às instruções RISC: tamanho
constante, formato uniforme, etc.
Uma instrução IA-32 é no mínimo 1 µOP. Uma instrução complexa
pode corresponder a centenas delas (!) (e.g. REP MOVSB)
µOP 1
MOV AX, [1000]
µOP 2
µOP 3
µOP 4
12
Algumas das características do Pentium 4 (2000)

Pipeline com execução especulativa com diversas
unidades funcionais (Arquitectura NetBurst)







Utilização de uma Trace Cache
Dois Branch Target Buffers



Pipeline de 20 fases
7 Unidades Funcionais
Até 126 µOPs em Execução no Pipeline (dos quais 48 LOADs e 24
STOREs)
Completa até 3 µOPs por ciclo de relógio
ALUs funcionam ao dobro da velocidade de relógio
Front-end: 4K entradas
Trace-cache: 512 entradas
Utilização de Register Renaning (8 registos  128) para
além de um Re-order Buffer


Register Renaning elimina dependências de nome
Re-order buffer garante a ordem de commit das instruções
13
Visão Geral do Pentium4
14
Aspecto do Pipeline
15
Trace Cache


Uma trace cache é uma versão sofisticada de uma
Instruction Cache (L1)
Quando a trace cache é acedida com o endereço de uma
certa instrução IA-32, acontece uma de 3 coisas:




A tradução da instrução está na cache. Até 3 µOPs são
produzidas. As 3 podem representar entre 1 e 3 instruções IA-32.
Portanto, o PC IA-32 é avançado entre 1 e 3 instruções.
A tradução da instrução está na cache, mas são necessárias mais
do que 4 µOPs para a mesma. No caso destas “instruções
complexas”, o controlo é passado a um programa numa microROM até que a sequência completa é produzida.
A tradução não está na cache. Neste caso, o descodificador IA-32
é utilizado para traduzir a instrução. O resultado é colocado na
cache.
Note-se que da próxima vez que a instrução for
executada, tipicamente já estará descodificada na cache
16
Trace Cache (2)

A Trace-Cache guarda sequências de instruções
executadas para além dos saltos
17
Visão Detalhada do Pentium 4 (2000)
18
Pentium 4 Die
19
Arquitectura de
Computadores II
4. Exemplos de Alguns
Processadores Actuais
2004/2005
4.3. AMD Opteron (& Athlon64)
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Top processors on SPEC2000 (July/04)
CPU INTEGER PERFORMANCE
1800
1600
1400
CPUINT2000
1200
1000
800
600
400
200
0
Intel Pentium4 HT
3.4GHz
ExtremeEdition
(Mar/04)
AMD Opteron150
2.4GHz (May/04)
Intel Xeon 3.2GHz Fujitsu SPARC64V
(Feb/04)
1.9GHz (Jun/04)
Itanium2 1.5GHz
(Dec/03)
IBM POWER4+
1.9GHz (May/04)
Alpha 21264C
1.2GHz (Nov/02)
PowerMac G5
2.0GHz (Dec/03)***
21
Top processors on SPEC2000 (July/04)
CPU FLOATING POINT PERFORMANCE
2500
CFP2000
2000
1500
1000
500
0
HP / Itanium2
1.5GHz (Feb/04)
Fujitsu
SPARC64V
1.9GHz (Jun/04)
IBM POWER4+ AMD Opteron248
1.7GHz (May/04) 2.2GHz (May/04)
Pentium4 HT
3.4GHz
ExtremeEdition
(Mar/04)
Alpha21364
AMD AthlonFX-51
1.2GHz (May/03) 2.2GHz (Sep/03)
Xeon 3.2GHz
(Apr/04)
22
Processor Market

The PC market has lead Intel and AMD to really boost the integer
performance of their processors


Floating point performance is increasing although RISC/Vector/VLIW
processors still have an edge



To a point they largely passed the performance available in classical RISC
chips
No consumer need in the PC market
Scientific workstations need FP performance
In the server market the important is not so much the peek
performance, but throughput and reliability



Xeon systems
Itanium
POWER4+
23
64-bit World

64-bit machines have been available for a long time in the
scientific and business market


What does 64-bit brings?



e.g. SPARCv9, Alpha, POWER4+, ...
Increased address space (32-bit: 4GByte max; 64-bit:
16.384PByte!)
Increased dynamic range for variables (32-bit int:0-4294967295;
64-bit int: 0-18446744073709551615)
64-bit does not bring increased performance
automatically!

It may have the contrary effect, memory traffic doubles when
going from 32-bit to 64-bit!
24
Main contenders in the 64-bit server market

SPARCv9 (Sun and Fujitsu)
Future uncertain, mostly used on high-end market,
keeps on going partly because of installed consumer base.

Intel Itanium2
Future uncertain. AMDs are much better and Intel EM64T
is a copy of AMD. Bad performance for its price when compared
with the competition.

AMD64 Opteron (and Athlon64)
Have taken the lead of the market by proposing an architecture that
enables to execute 32 and 64 bit applications with performance.
Superior memory bandwidth. Problem: IT’S NOT INTEL!

Intel’s Extended Memory 64 Processors
Intel licensed the AMD technology and has launched an architecture
exactly (or almost) equal. It is currently available in high-end Xeon
machines
Note: IBM POWER4+ still dominates on the high-end multi-way server market
25
AMD64 – Dual Mode
Operating System
(e.g. Linux64 or Windows2003-64)
“Legacy” 32-bit
Application
(4GB memory limit)

AMD has proposed an architecture
which allows the execution of 32 and
64-bit applications (x32-64)

64-bit
Application



No need to recompile old
applications
32-bit applications execute with
same performance
64-bit applications take advantage of
a larger address space, more
registers, etc.
Operating System Support:





Linux (SuSE, Redhat, ...)
Windows Server 2003 (beta)
Solaris (2nd Half 2004)
FreeBSD & NetBSD
“Java 1.5”
26
The Instruction Set Architecture
In x86
Added by AMD64
63
31
RAX
(INTEL’s
look alike!)
127
15
EAX
7
AH AL
0
XMM0
0
79
G
P
R
EAX
0
x
8
7
Registers
XMM7
XMM8
EDI
R8
XMM8
XMM15
EIP
R15
IA-32 instructions + new prefixes
Instructions
Next 64-bit mode instructions
27
Why More Registers?
Number of Registers Each
Function in the Program Needs
Question: If processors do Register Renaming, why do
we need more programmer visible registers?
28
AMD Opteron Architecture

AMD Opteron™ processor
architecture
DDR Memory
Controller
The memory controller is
included in the CPU


HyperTransport

Directly to
memory
L1
AMD64
Core
Instruction
Cache
L2
Cache
L1
Data
Cache
HyperTransport™
technology
6.4GB/sec


Point-to-point link for highspeed circuits standard
(international consortium)
3x 6.4GB/sec inter-processor
connections
Up to 19.2GB/s peak
aggregate bandwidth
(AMD Athlon64 only has one
HyperTransport link)
To other processors/devices
29
Difference to traditional systems
DDR Memory
CPU
Other CPUs
or devices
DDR
Other CPUs
or devices
DDR
Opteron
CPU
North
Bridge
PCI-X
Bridge
PCI-X
PCI-X
DDR
IDE, FDC,
USB, Etc.
I/O
Hub
PCI-X
Bridge
PCI
IDE, FDC,
USB, Etc.
South
Bridge
PCI
30
AMD64 Core (Opteron – Hammer)


Superscalar Out-of-Order Multi-Issue Processor
10 Execution Units






single-part MOps: arithmetic operations or memory accesses
two-part MOps: an arithmetic operation and a memory access
Dynamic Branch Prediction



17 stages for FP
The IA-32 instructions are translated into MacroOps (MOPS)


Integer ALUs
FP ALUs
Address calculation Units
Load/Store Unit
12 stage pipeline


3
3
3
1
Local history table + Global history table (16K entries)
Branch Target Buffer: 2K branches
Integrated DDR Memory Controller
31
Opteron’s Core
32
Moving Instructions from Memory to Cache


When code is first moved into the
Athlon's L1 instruction cache, the
processor's predecode logic
examines the newly cached lump
of code in order to detect individual
instruction boundaries, and it
marks those boundaries with a
small amount of "metadata" so that
the front end has less work to
perform. The predecode logic also
marks static branches.
This predecoding process moves
some of the front-end work to an
earlier portion of the pipeline,
speeding the actual fetch and
decode phases later. The drawback
is that the extra metadata eats up
valuable L1 I-cache space
Memória
Cache Instruções
Processador
33
Processor Frontend
Micro ROM (everything else)
- max 1 IA-32 Instr. clock
- max 3 MOPs clock
issue slots
(3 instructions)
16 bytes are read
at a time ( 5 IA-32
instructions)
FastPath Decoder
(instr. that translate
into 2 MOPs max)
- max 3 IA-32 Instr. clock
- max 3 MOPs clock
34
Opteron’s Pipeline
35
Opteron’s Die
36
Material para ler

Computer Architecture: A Quantitative Approach



Secção 3.10
Apêndice D
Artigos




Jon "Hannibal" Stokes, “The Pentium 4 and the G4e: an
Architectural Comparison: Part I”, in Ars Technica, July 2001
http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/1
Jon "Hannibal" Stokes, “The Pentium 4 and the G4e: an
Architectural Comparison: Part II”, in Ars Technica, July 2001
http://arstechnica.com/articles/paedia/cpu/p4andg4e2.ars
Jon "Hannibal" Stokes, “Inside AMD's Hammer: the 64-bit
architecture behind the Opteron and Athlon 64”, in Ars Technica,
January 2005
http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars
Viktor Kartunov, “Facts & Assumptions about the Architecture of
AMD Opteron and Athlon 64”, in Digit-Life
http://www.digit-life.com/articles2/amd-hammer-family/index.html
37