OpenMP
António Abreu
Instituto Politécnico de Setúbal
António Abreu (Instituto Politécnico de Setúbal)
1 de Março de 2013
OpenMP
1 de Março de 2013
1 / 37
openMP – what?
It’s an Application Program Interface (API) that allows parallel programs
to be explicitly and simply developed, in C/C++, for multi-platform,
shared memory, multiprocessor computers (including Solaris, AIX, HP-UX,
GNU/Linux, Mac OS X, and Windows platforms), supported by the major
computer hardware and software vendors (including AMD, IBM, Intel,
Cray, HP, Fujitsu, Nvidia, NEC, Microsoft, Texas Instruments, Oracle
Corporation, and others.).
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
2 / 37
cores and memory
Multicore computers have a memory system where some memories are
shared while others are not. The next figure makes this distinction clear.
TLB stands for Translation Lookaside Buffer, which is an address cache.
When making parallel programs one must know which memory is shared
and which memory is not.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
3 / 37
Fork – join
OpenMP is based on multithreading, i.e., a form of parallelization whereby
a master thread forks a specified number of slave threads, with the
runtime environment allocating threads to different processors.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
4 / 37
How many cores does my machine have?
In linux, the file /proc/cpuinfo contains a lot of information about the
hardware of the machine. Typing less /proc/cpuinfo allows one to see
it all.
To see info about memory, see the contents of the file /proc/meminfo.
The first number one wants to see is the one corresponding to MemTotal.
In order to use openMP, one has to have a propoer compiler. In linux,
GCC 4.2 or higher supports openMP. To see the version of your (linux)
compiler, type the command gcc -v.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
5 / 37
parallel directive
#pragma omp parallel [clause ...]
{
structured_block
}
newline
where clause can be
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
OpenMP
num_threads (integer-expression)
António Abreu (Instituto Politécnico de Setúbal)
1 de Março de 2013
6 / 37
Hello world
#include <stdio.h>
#include <omp.h>
int main(void)
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf("Hello (%d)\n",ID);
printf("world (%d)\n",ID);
printf("! (%d)\n",ID);
}
return 0;
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
7 / 37
Compile with gcc -fopenmp hello.c -o hello
Hello
world
! (0)
Hello
world
! (1)
Hello
world
! (2)
Hello
world
! (3)
(0)
(0)
(1)
(1)
(2)
(2)
(3)
(3)
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
8 / 37
The code between the curly brackets (after the pragma directive) is set to
execute in a predetermined number of threads.
After the first curly bracket there is a fork, i.e., the master thread creates
a team of parallel threads, and after the second curly bracket there is a
join, i.e., the master thread continues execution after all the slave threads
end. The second curly bracket constitutes a barrier, of which only the
master thread passes.
The number of threads is typically set to the number of cores in the
microprocessor; it can be set by the command line
export OMP_NUM_THREADS=4.
omp_get_thread_num() is a function that returns the Id of the respective
thread. The master thread has Id 0 and makes part of the thread team.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
9 / 37
We observe an ordered output, but sometimes this may not happen; in
fact there is a race condition because the four threads share the standard
output.
Note that openMP is not necessarily implemented identically by all
vendors. Also, it does not provide check for data dependencies, data
conflicts, race conditions, or deadlocks. In particular, it does not guarantee
that input or output to the same file is synchronous when executed in
parallel. It is up to the programmer to synchronize input and output.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
10 / 37
Synchronization Constructs – barriers
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
11 / 37
Hello
Hello
Hello
Hello
There
World
World
World
World
are 4
from thread
from thread
from thread
from thread
threads
António Abreu (Instituto Politécnico de Setúbal)
1
3
0
2
OpenMP
1 de Março de 2013
12 / 37
Barriers are a synchronization primitive. This means that all threads in the
team wait for the last one to reach the barrier. At that moment, all
threads in the team resume execution in parallel. If there is a thread that
does not reach the barrier, all threads in the team wait, and the process
hangs without any work being produced.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
13 / 37
Quiz
If we comment the barrier pragma in the code above the output will be,
Hello
There
Hello
Hello
Hello
World
are 4
World
World
World
from thread
threads
from thread
from thread
from thread
0
3
1
2
Explain why.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
14 / 37
Quiz
If we add the code
printf("Bye from thread %d\n", th_id);
after the if, what would be the output?
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
15 / 37
Workshare directives – for
#pragma omp for [clause ...]
for_loop
newline
where clause can be,
schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
reduction (operator: list)
collapse (n)
nowait
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
16 / 37
parallel for example
#include <omp.h>
#define CHUNKSIZE 100
#define N 1000
main ()
{
int i, chunk = CHUNKSIZE;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
} /* end of parallel section */
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
17 / 37
The for pragma asks the compiler to create threads from the N iterations
of the for loop.
The clause schedule informs the OS (operating system) about how to
schedule those threads. In this case, the scheduling policy is dynamic,
which means that threads are dynamically assigned on a
first-come-first-serve basis.
In this case each thread will execute chunk (i.e., 100) iterations of the
total of 1000 in the loop.
The nowait clause makes the implied barrier at the end of the for
directive to be ignored. Put differently, if there was not such a clause, all
team threads stop at the end of the for primitive, and only thread 0
would continue past this point.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
18 / 37
Quiz
In the following program, which for cycle is executed in parallel: the first,
or both? Before answering, note that the clauses parallel and for are
combined in a single one. This is valid.
#include <stdio.h>
int main(int argc, char *argv[])
{
const int N = 100;
int i, a[N];
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
for (i = 0; i < N; i++)
printf("%d ",a[i]);
return 0;
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
19 / 37
Workshare directives – sections
#pragma omp sections [clause ...]
{
#pragma omp section
newline
structured_block
#pragma omp section
newline
structured_block
}
newline
where clause can be,
private (list)
firstprivate (list)
lastprivate (list)
reduction (operator: list)
nowait
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
20 / 37
section directive example
#include <stdio.h>
#include <omp.h>
int main(void)
{
#pragma omp parallel sections
{
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
}
printf("Bye from thread %d\n",omp_get_thread_num());
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
21 / 37
A few executions
First execution
hello from thread 0
hello from thread 0
hello from thread 0
Bye from thread 0
Second execution
hello from thread 0
hello from thread 1
hello from thread 3
Bye from thread 0
Third execution
hello from thread 2
hello from thread 1
hello from thread 0
Bye from thread 0
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
22 / 37
Another example
#include <stdio.h>
#include <omp.h>
int main(void)
{
int i=0;
#pragma omp parallel sections if (i==1)
{
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
#pragma omp section
{
printf("hello from thread %d\n",omp_get_thread_num());
}
}
printf("Bye from thread %d\n",omp_get_thread_num());
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
23 / 37
Unique result
hello from thread 0
hello from thread 0
hello from thread 0
Bye from thread 0
Since the condition is false, the team of threads is not created; but the
master thread stands. Note that the assigned work (three blocks of code)
is executed serially; so the if clause permits to parallelize work or not
(i.e., to seriallize it), and the decision is made at runtime.
Also, there is an implicit barrier at the end of each section. This explains
why Bye from ... (in the last two examples) is always the last message
to print.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
24 / 37
Clause reduction
reduction (operator: list)
At the creation of a team of threads the variables in list are created as
private. At the end of the threads in the team, operator is applied to the
variables in list, a process known as reduction, and the final result is
written back to the variables in list, now seen as global shared variables.
Variables in list must be scalar; not arrays or structures.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
25 / 37
#include <stdio.h>
#include <omp.h>
int main(void)
{
int t=0;
omp_set_num_threads(4);
#pragma omp parallel reduction(+:t)
{
t = omp_get_thread_num() + 1;
printf("local %d\n", t);
}
printf("reduction %d\n", t);
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
26 / 37
Result
local 1
local 2
local 3
local 4
reduction 10
The function of omp_set_num_threads() is self explanatory. As
expected, it cannot be called from a parallelized block of code.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
27 / 37
Synchronization Constructs – atomic
Used to identify a memory location that should not be modified
simultaneously by more than one thread in the team. In other words, it
provides an atomic access to the memory location.
#pragma omp atomic
<statement_block>
The directive applies only to a single statement.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
28 / 37
Synchronization Constructs – single
Used when there is a block of code that must be executed by a single
thread in the team. Note that by no means this implies that the code is
made atomic. It may happen that other threads (outside this team) access
the same memory location, thus creating a race condition.
#pragma omp single [clause[[,] clause] ...]
statement_block
Threads in the team that do not execute this directive, wait at the end of
the code block, unless a nowait clause is specified.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
29 / 37
Synchronization Constructs – master
Used to identify a block of code that must executed only by the master
thread.
#pragma omp master
statement_block
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
30 / 37
Synchronization Constructs – critical
Specifies a block of code that must be executed by only one thread at a
time. In other words, if the code in a critical region is executing, no other
thread with that code will run in parallel.
#pragma omp critical [(name)]
statement_block
Different critical regions with the same name are treated as the same
region. All unnamed critical regions are treated as the same region.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
31 / 37
Example
#include <omp.h>
main()
{
int x;
x = 0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
} /* end of parallel section */
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
32 / 37
Synchronization Constructs – flush
This directive identifies a point at which a consistent view of memory must
exist, i.e., thread-visible variables are written back to memory is response
to this directive.
#pragma omp flush [ (list) ]
Remember the first figure of these course notes. This directive forces the
data in the data cache of each core to be written to the shared unified
cache memory (and not necessarily to the main memory; that decision is
made by the virtual memory system).
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
33 / 37
openMP functions about threads
#include <stdio.h>
#include <omp.h>
int main(void)
{
printf("omp_get_max_threads=%d\n",omp_get_max_threads());
omp_set_num_threads(2);
printf("omp_get_num_procs=%d\n",omp_get_num_procs());
#pragma omp parallel
printf("omp_get_thread_num=%d\n",omp_get_thread_num());
printf("omp_get_thread_num=%d\n",omp_get_thread_num());
}
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
34 / 37
omp_get_max_threads=4
omp_get_num_procs=2
omp_get_thread_num=0
omp_get_thread_num=1
omp_get_thread_num=0
omp_get_num_procs() returns the number of processors in the machine.
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
35 / 37
Synchronization – locks
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp,id)
{
id = omp_get_thread_num();
tmp = do_lots_of_work(id); // critical region wrt tmp
omp_set_lock(&lck);
printf(%d %d",id,tmp); // atomic access to id and tmp
omp_unset_lock(&lck);
tmp = do_more_lots_of_work(id); // critical region wrt tmp
}
omp_destroy_lock(&lck);
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
36 / 37
Bibliography
wikipedia
http://openmp.org
https://computing.llnl.gov/tutorials/openMP/
http://msdn.microsoft.com/
http://publib.boulder.ibm.com
António Abreu (Instituto Politécnico de Setúbal)
OpenMP
1 de Março de 2013
37 / 37