Arquitectura de
Computadores II
6. Multi-Processamento
2004/2005
6.1. Introdução
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Nota Importante

A apresentação desta parte da matéria é largamente
baseada num curso internacional leccionado no DEI, em
Set/2003 sobre “Cluster Computing and Parallel
Programming”. Os slides originais podem ser encontrados
em:
http://eden.dei.uc.pt/~pmarques/courses/best2003/pmarques_best.pdf

Para além desses materiais, é principalmente utilizado o
Cap. 6 do [CAQA] e o Cap. 9 do “Computer Organization
and Design”
2
Motivation

I have a program that takes 7 days to execute, which is
far too long for practical use. How do I make it run in 1
day?

Work smarter!
(i.e. find better algorithms)

Work faster!
(i.e. buy a faster processor/memory/machine)

Work harder!
(i.e. add more processors!!!)
3
Motivation

We are interested in the last approach:


Add more processors!
(We don’t care about being too smart or spending too
much $$$ in bigger faster machines!)
Why?





It may no be feasible to find better algorithms
Normally, faster, bigger machines are very expensive
There are lots of computers available in any institution (especially
at night)
There are computer centers from where you can buy parallel
machine time
Adding more processors enables you not only to run things faster,
but to run bigger problems
4
Motivation

“Adding more processors enables you not only to run
things faster, but to run bigger problems”?!



“9 women cannot have a baby in 1 month, but they
can have 9 babies in 9 months”
This is called the Gustafson-Barsis law (informally)
What the Gustafson-Barsis law tell us is that when the size
of the problem grows, normally there’s more parallelism
available
5
Arquitectura de
Computadores II
6. Multi-Processamento
2004/2005
6.2. Arquitectura das Máquinas
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
von Neumann Architecture


Based on the fetch-decode-execute cycle
The computer executes a single sequence of instructions
that act on data. Both program and data are stored in
memory.
Flow of instructions
A
B
C
Data
7
Flynn's Taxonomy

Classifies computers according to…


The number of execution flows
The number of data flows
Number of data flows
Number of
execution
flows
SISD
SIMD
Single-Instruction
Single-Data
Single-Instruction
Multiple-Data
MISD
MIMD
Multiple-Instruction
Single-Data
Multiple-Instruction
Multiple-Data
8
Single Instruction, Single Data (SISD)




A serial (non-parallel) computer
Single instruction: only one instruction stream is being
acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input
during any one clock cycle
Most PCs, single CPU workstations, …
9
Single Instruction, Multiple Data (SIMD)





A type of parallel computer
Single instruction: All processing units execute the same
instruction at any given clock cycle
Multiple data: Each processing unit can operate on a
different data element
Best suited for specialized problems characterized by a
high degree of regularity, such as image processing.
Examples: Connection Machine CM-2, Cray J90, Pentium
MMX instructions
ADD V3, V1, V2
1
3
4
5
21
3
3
5
V1
32 43
2
46 87 65 43 32
V2
V3
10
The Connection Machine 2 (SIMD)

The massively parallel Connection Machine 2 was a
supercomputer produced by Thinking Machines
Corporation, containing 32,768 (or more) processors of
1-bit that work in parallel.
11
Multiple Instruction, Single Data (MISD)


Few actual examples of this class of parallel computer
have ever existed
Some conceivable examples might be:



multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message
the Data Flow Architecture
12
Multiple Instruction, Multiple Data (MIMD)





Currently, the most common type of parallel computer
Multiple Instruction: every processor may be executing a
different instruction stream
Multiple Data: every processor may be working with a
different data stream
Execution can be synchronous or asynchronous,
deterministic or non-deterministic
Examples: most current supercomputers, computer
clusters, multi-processor SMP machines (inc. some types
of PCs)
13
IBM BlueGene/L DD2


Department of Energy's,
Lawrence Livermore
National Laboratory
(California, USA)
Currently the fastest
machine on earth
(70TFLOPS)
Some Facts
- 32768x 700MHz PowerPC440 CPUs (Dual Processors)
- 512MB RAM per node, total = 16TByte of RAM
- 3D Torus Network; 300MB/sec per node.
14
IBM BlueGene/L DD2
System
(64 cabinets, 64x32x32)
Cabinet
(32 Node boards, 8x8x16)
Node Board
(32 chips, 4x4x2)
16 Compute Cards
Compute Card
(2 chips, 2x1x1)
180/360 TF/s
16 TB DDR
Chip
(2 processors)
90/180 GF/s
8 GB DDR
2.8/5.6 GF/s
4 MB
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
15
What about Memory?

The interface between CPUs and Memory in Parallel
Machines is of crucial importance


The bottleneck on the bus, many times between memory and
CPU, is known as the von Neumann bottleneck
It limits how fast a machine can operate:
relationship between computation/communication
16
Communication in Parallel Machines


Programs act on data.
Quite important: how do processors access each others’
data?
Message Passing Model
Shared Memory Model
CPU
Memory
CPU
Memory
CPU
network
Memory
CPU
Memory
CPU
Memory
CPU
CPU
CPU
17
Shared Memory




Shared memory parallel computers vary widely, but
generally have in common the ability for all processors to
access all memory as a global address space
Multiple processors can operate independently but share
the same memory resources
Changes in a memory location made by one processor are
visible to all other processors
Shared memory machines can be divided into two main
classes based upon memory access times: UMA and
NUMA
18
Shared Memory (2)
Single 4-processor
Machine
A 3-processor
NUMA Machine
CPU
CPU
Memory
CPU
CPU
UMA: Uniform Memory Access
CPU
CPU
CPU
Memory
Memory
Memory
Fast Memory
Interconnect
NUMA: Non-Uniform Memory Access
19
Uniform Memory Access (UMA)






Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA.
Cache coherent means if one processor updates a location
in shared memory, all the other processors know about
the update. Cache coherency is accomplished at the
hardware level.
Very hard to scale
20
Non-Uniform Memory Access (NUMA)




Often made by physically linking two or more
SMPs. One SMP can directly access memory of
another SMP.
Not all processors have equal access time to all
memories
Sometimes called DSM – Distributed Shared
Memory
Advantages




User-friendly programming perspective to memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory and CPUs
More scalable than SMPs
Disadvantages



Lack of scalability between memory and CPUs
Programmer responsibility for synchronization constructs that ensure "correct"
access of global memory
Expensive: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors
21
UMA and NUMA
The Power MAC G5 features
2 PowerPC 970/G5 processors
that share a common central
memory (up to 8Gbyte)
SGI Origin 3900:
- 16 R14000A processors per brick,
each brick with 32GBytes of RAM.
- 12.8GB/s aggregated memory bw
(Scales up to 512 processors and
1TByte of memory)
22
Distributed Memory (DM)









Processors have their own local memory.
Memory addresses in one processor do not map to
another processor (no global address space)
Because each processor has its own local memory, cache
coherency does not apply
Requires a communication network to connect interprocessor memory
When a processor needs access to data in another
processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
Synchronization between tasks is the programmer's
responsibility
Very scalable
Cost effective: use of off-the-shelf processors and
networking
Slower than UMA and NUMA machines
23
Distributed Memory
Computer
Computer
Computer
CPU
CPU
CPU
Memory
Memory
Memory
TITAN@DEI, a PC cluster
interconnected by FastEthernet
network interconnect
24
Hybrid Architectures

Today, most systems are an hybrid featuring shared distributed
memory.



Each node has several processors that share a central memory
A fast switch interconnects the several nodes
In some cases the interconnect allows for the mapping of memory among
nodes; in most cases it gives a message passing interface
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
fast network interconnect
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
25
ASCI White at the
Lawrence Livermore National Laboratory





Each node is an IBM POWER3 375 MHz NH-2 16-way SMP
(i.e. 16 processors/node)
Each node has 16GB of memory
A total of 512 nodes, interconnected by a 2GB/sec
network node-to-node
The 512 nodes feature a total of 8192 processors,
having a total of 8192 GB of memory
It currently operates at 13.8 TFLOPS
26
Summary
Architecture
CC-UMA
CC-NUMA
Distributed/
Hybrid
Examples
-
SMPs
Sun Vexx
SGI Challenge
IBM Power3
- SGI Origin
- HP Exemplar
- IBM Power4
- Cray T3E
- IBM SP2
Programming
-
MPI
Threads
OpenMP
Shmem
-
- MPI
Scalability
Draw Backs
Software Availability
MPI
Threads
OpenMP
Shmem
<10 processors
<1000 processors
~1000 processors
- Limited mem bw
- Hard to scale
- New architecture
- Point-to-point
communication
- Costly system
administration
- Programming is hard
to develop and
maintain
- Great
- Great
- Limited
27
Summary (2)

Plot of top 500 supercomputer sites over a decade
28
Arquitectura de
Computadores II
6. Multi-Processamento
2004/2005
6.3. Modelos de Programação
e Desafios
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Warning

We will now introduce the main ways how you can
program a parallel machine.
Don’t worry if you don’t immediately visualize all the
primitives that the APIs provide. We will cover that latter.
For now, you just have to understand the main ideas
behind each paradigm.

In summary: DON’T PANIC!

30
The main programming models…



A programming model abstracts the programmer from the
hardware implementation
The programmer sees the whole machine as a big virtual
computer which runs several tasks at the same time
The main models in current use are:




Shared Memory
Message Passing
Data parallel / Parallel Programming Languages
Note that this classification is not all inclusive. There are
hybrid approaches and some of the models overlap (e.g.
data parallel with shared memory/message passing)
31
Shared Memory Model
Process
or Thread
B
Process
or Thread
A
double matrix_A[N];
double matrix_B[N];
Process
or Thread
C
double result[N];
Globally Accessible Memory (Shared)
Process
or Thread
D
32
Shared Memory Model






Independently of the hardware, each program sees a
global address space
Several tasks execute at the same time and read and
write from/to the same virtual memory
Locks and semaphores may be used to control access to
the shared memory
An advantage of this model is that there is no notion of
data “ownership”. Thus, there is no need to explicitly
specify the communication of data between tasks.
Program development can often be simplified
An important disadvantage is that it becomes more
difficult to understand and manage data locality.
Performance can be seriously affected.
33
Shared Memory Modes

There are two major shared memory models:


All tasks have access to all the address space
(typical in UMA machines running several threads)
Each task has its address space. Most of the address space is
private. A certain zone is visible across all tasks. (typical in DSM
machines running different processes)
Memory
Memory
A
A B C
Memory
B
A
B
Shared
memory
(all the tasks share
the same address
space)
34
Shared Memory Model –
Closely Coupled Implementations


On shared memory platforms, the compiler translates user
program variables into global memory addresses
Typically a thread model is used for developing the
applications




POSIX Threads
OpenMP
There are also some parallel programming languages that
offer a global memory model, although data and tasks are
distributed
For DSM machines, no standard exists, although there are
some proprietary implementations
35
Shared Memory – Thread Model




A single process can have multiple threads of execution
Each thread can be scheduled on a different processor,
taking advantage of the hardware
All threads share the same address space
From a programming perspective, thread implementations
commonly comprise:



A library of subroutines that are called from within parallel code
A set of compiler directives imbedded in either serial or parallel
source code
Unrelated standardization efforts have resulted in two very
different implementations of threads: POSIX Threads and
OpenMP
36
POSIX Threads





Library based; requires parallel coding
Specified by the IEEE POSIX 1003.1c standard (1995),
also known as PThreads
C Language
Most hardware vendors now offer PThreads
Very explicit parallelism; requires significant programmer
attention to detail
37
OpenMP






Compiler directive based; can use serial code
Jointly defined and endorsed by a group of major
computer hardware and software vendors. The OpenMP
Fortran API was released October 28, 1997. The C/C++
API was released in late 1998
Portable / multi-platform, including Unix and Windows NT
platforms
Available in C/C++ and Fortran implementations
Can be very easy and simple to use - provides for
“incremental parallelism”
No free compilers available
38
Message Passing Model

The programmer must send and receive messages
explicitly
39
Message Passing Model


A set of tasks that use their own local memory during
computation.
Tasks exchange data through communications by sending
and receiving messages


Multiple tasks can reside on the same physical machine as well as
across an arbitrary number of machines.
Data transfer usually requires cooperative operations to be
performed by each process. For example, a send
operation must have a matching receive operation.
40
Message Passing Implementations


Message Passing is generally implemented as libraries
which the programmer calls
A variety of message passing libraries have been available
since the 1980s


These implementations differed substantially from each other
making it difficult for programmers to develop portable
applications
In 1992, the MPI Forum was formed with the primary goal
of establishing a standard interface for message passing
implementations
41
MPI – The Message Passing Interface

Part 1 of the Message Passing Interface (MPI), the core,
was released in 1994.



MPI is now the “de facto” industry standard for message
passing


Part 2 (MPI-2), the extensions, was released in 1996.
Freely available on the web:
http://www.mpi-forum.org/docs/docs.html
Nevertheless, most systems do not implement the full
specification. Especially MPI-2
For shared memory architectures, MPI implementations
usually don’t use a network for task communications

Typically a set of devices is provided. Some for network
communication, some for shared memory. In most cases, they can
coexist.
42
Data Parallel Model

Typically a set of tasks performs the same operations on
different parts of a big array
43
Data Parallel Model

The data parallel model demonstrates the following
characteristics:






Most of the parallel work focuses on performing operations on a
data set
The data set is organized into a common structure, such as an
array or cube
A set of tasks works collectively on the same data structure,
however, each task works on a different partition of the same data
structure
Tasks perform the same operation on their partition of work, for
example, “add 4 to every array element”
On shared memory architectures, all tasks may have
access to the data structure through global memory.
On distributed memory architectures the data structure is
split up and resides as "chunks" in the local memory of
each task
44
Data Parallel Programming

Typically accomplished by writing a program with data
parallel constructs



In most cases, parallel compilers are used:



calls to a data parallel subroutine library
compiler directives
High Performance Fortran (HPF):
Extensions to Fortran 90 to support data parallel programming
Compiler Directives: Allow the programmer to specify the
distribution and alignment of data. Fortran implementations are
available for most common parallel platforms
DM implementations have the compiler convert the
program into calls to a message passing library to
distribute the data to all the processes.

All message passing is done invisibly to the programmer
45
Summary

Middleware for parallel programming:



Shared memory: all the tasks (threads or processes) see a global
address space. They read and write directly from memory and
synchronize explicitly.
Message passing: the tasks have private memory. For exchanging
information, they send and receive data through a network. There
is always a send() and receive() primitive.
Data parallel: the tasks work on different parts of a big array.
Typically accomplished by using a parallel compiler which allows
data distribution to be specified.
46
Final Considerations…
Beware of Amdahl's Law!
47
Load Balancing

Load balancing is always a factor to consider when
developing a parallel application.



Too big granularity  Poor load balancing
Too small granularity  Too much communication
The ratio computation/communication is of crucial
importance!
Task 1
Task 2
Task 3
time
Work
Wait
48
Amdahl's Law

The speedup depends on the amount of code that cannot
be parallelized:
T
1
speedup(n, s) 

T (1 s )
(1 s )
T s  n
s n
n: number of processors
s: percentage of code that cannot be made parallel
T: time it takes to run the code serially
49
Amdahl's Law – The Bad News!
Speedup vs. Percentage of Non-Parallel Code
30
0%
25
Linear Speedup
Speedup
20
15
5%
10
10%
20%
5
0
1
2 3
4 5
6 7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Number of Processors
50
Efficiency Using 30 Processors
100
speedup (n, s )
efficiency (n, s ) 
n
90
80
Efficiency (%)
70
60
50
40
30
20
10
0
0%
5%
10%
20%
Percentage of Non-Parallel Code
51
What Is That “s” Anyway?

Three slides ago…


Actually, it’s worse than that. Actually it’s the percentage
of time that cannot be executed in parallel. It can be:





“s: percentage of code that cannot be made parallel”
Time
Time
Time
Time
spent communicating
spent waiting for/sending jobs
spent waiting for the completion of other processes
spent calling the middleware for parallel programming
Remember…

if s is even as small as 0.05, the maximum speedup is only 20
52
Maximum Speedup
If you have  processors
this will be 0, so the maximum
possible speedup is 1/s
1
speedup(n, s) 
(1 s )
s n
non-parallel (s)
maximum speedup
0%
 (linear speedup)
5%
20
10%
10
20%
5
25%
4
53
On the Positive Side…
You can run bigger problems
You can run several simultaneous jobs
(you have more parallelism available)
 Gustafson-Barsis with no equations:
“9 women cannot have a baby in 1 month,
but they can have 9 babies in 9 months”
54
Arquitectura de
Computadores II
6. Multi-Processamento
2004/2005
6.4. Hardware
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Problema da Coerência das Caches (UMA)
56
Mantendo a Coerência: Snooping
57
Snooping

Leituras e Escritas de Blocos




As múltiplas cópias de um bloco, quando existem leituras, não são
um problema
No entanto, quando existe uma escrita, um processador tem de
ter acesso exclusivo ao bloco que quer escrever
Os processadores, quando fazem uma leitura, têm também de ter
sempre o valor mais recente do bloco em causa
Nos protocolos de snooping, o hardware tem de localizar
todas as caches que contêm uma cópia do bloco, quando
existe uma escrita. Existem então duas abordagens
possíveis:


Invalidar todas caches que contêm esse bloco (write-invalidate)
Actualizar todas as caches que contêm esse bloco
58
Protocolo de Snooping (Exemplo)
59
Problema da Coerência das Caches (NUMA)


A abordagem de Snooping não é escalável para máquinas
com dezenas/centenas de processadores (NUMA)
Nesse caso utiliza-se um outro tipo de protocolos –
baseados em Directorias

Uma Directoria é um local centralizado que mantém informação
sobre quem é que tem cada bloco
60
Arquitectura de
Computadores II
5. Multi-Processamento
2004/2005
5.4. Aspectos Recentes e Exemplos
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
[email protected]
Tendências

Neste momento torna-se extremamente complicado escalar os
processadores em termos de performance individual e clock-rate


O futuro é o MULTI-PROCESSAMENTO!!!
A Intel, à semelhança de outros fabricantes introduz o Simultaneous
Multi-Threading (SMT), na sua terminologia, chamado
HyperThreading



Um aumento de desempenho potencialmente razoável (max=30%) à
custa de um pequeno gasto de transístores (5%)
Atenção: pode levar a uma performance pior!
Prepara os programadores para a programação concorrente!!! 
(a opinião generalizada é que o Hyperthreading serviu apenas para tal)

Os dual-core (dois processadores no mesmo die e/ou pacote) irão ser
banais nos próximos 2/3 anos

Os servidores multi-processador (SMP – Symmetrical MultiProcessing) estão neste momento banalizados

Os clusters estão neste momento banalizados
62
Anúncios...
63
Como é que funciona o HyperThreading (1)?
Processador super-escalar
“normal”
Dual Processor (SMP)
64
Como é que funciona o HyperThreading (2)?
Time-sliced Multithreaded CPU
(Super-Threaded CPU)
Hyper-Threaded CPU
65
Motivações para o uso de Simultaneous Multi-Threading (SMT)

Normalmente existem mais unidades funcionais
disponíveis do que aquelas que estão a ser utilizadas


Os computadores actuais estão constantemente a
executar mais do que um programa/thread


Limitações do tamanho dos blocos básicos e/ou paralelismo
disponível a nível das instruções (ILP)
Existe trabalho disponível, independente, para fazer. Não se
encontra é na mesma thread!
Um dos aspectos em que esta abordagem é muito útil é a
esconder latências inevitáveis de acesso a memória ou
previsões erradas de saltos

E.g. uma thread que tenha de ler dados de memória pode ficar
bastante tempo à espera enquanto os dados não chegam. Nessas
alturas, tendo SMT, é possível outra thread executar.
66
Implementação (Ideia Básica)

Replicar o Front-end do processador e tudo o que seja
visível em termos de ISA (Instruction Set Architecture)



e.g. Registos, Program Counters, etc.
Desta forma, um processador físico torna-se dois processadores
Particionam-se alguns recursos (e.g. filas de issue de
instruções) e Partilham-se outros (e.g. Reorder-Buffers)
67
Para terminar... Exemplo de um cluster!

Cluster da GOOGLE





Máquinas do cluster GOOGLE





Tem de servir 1000 queries/segundo, cada query não demorando mais de
0.5s!
8 biliões de páginas indexadas (8.058.044.651, 01/Maio/2005)
Técnica para manter a indexação: Tabelas Invertidas (ver TC/BD)
Todas as páginas são revisitadas mensalmente
PCs “baratos” com processadores Intel, c/ 256MB RAM
Cerca de 6.000 processadores, 12.000 discos (1 PByte de espaço,
2 discos por máquina)
Linux Red Hat
2 sites na Califórnia e 2 na Virgínia
Ligação à rede


Cada site tem uma ligação OC48 (2.5 Gbps) à Internet
Entre cada par de sites existe um link de backup de OC12 (622 Mbps)
68
Racks e Racks
No google, a aborgagem
à redundância é utilizar um
conjunto maciço de máquinas
completas!
40 PCs/rack
40 Racks
69
Máquinas super-rápidas??
70
Material para ler

Computer Architecture: A Quantitative Approach, 3rd Ed.


Alternativamente (ou complementarmente), a matéria encontra-se
bastante bem explicada no Capítulo 9 do




Secções 6.1, 6.3, 6.5 (brevemente), 6.9, 6.15
Computer Organization and Design, 3rd Ed.
D. Patterson & J. Hennessy
Morgan Kaufmann, ISBN 1-55860-604-1
August 2004
Em particular, a descrição do cluster Google foi retirada de lá. A única
matéria não coberta foi a Secção 9.6
Este capítulo do livro será colocado online no site da cadeira, disponível
apenas para utilizadores autenticados
Jon Stokes, “Introduction to Multithreading, Superthreading and
Hyperthreading”, in Ars Technica, October 2003
http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars
71