λ
[email protected]
Programming Languages Laboratory
Divergence Analysis with
Affine Constraints
Diogo Sampaio, Sylvain Collange and
Fernando Pereira
The Federal University of Minas Gerais - Brazil
λ
[email protected]
The Objective of this work is
to speedup code that runs on
GPUs.
We will achieve
this goal via two
contributions.
Which
enables…
λ
[email protected]
Motivation
• General Purpose Programming in
Graphics Processing Units is a reality
today.
– Lots of academic research.
– Many industrial applications
• Yet, programming efficient GPGPU
applications is hard.
– Complex interplay with the
hardware.
– Threads execute in lock step, but
divergences may happen.
λ
[email protected]
What are Divergences?
• Below we have a simple kernel, and its Control Flow
Graph:
__global__ void
ex (float* v) {
if (v[tid] < 0.0) {
v[tid] /= 2;
} else {
v[tid] = 0.0;
}
}
• Why do we have divergences
in this kernel?
λ
[email protected]
Why are Divergences a Problem?
λ
[email protected]
Uniform and Divergent Variables
• If a variable has always the same value for all the
threads in execution, then we call it uniform.
• If different threads in execution may see the same
variable name with different values, this variable is
called divergent.
• Which variables are divergent?
– The thread identifier is always divergent.
– Variables that depend on divergent variables are also
divergent.
• Data dependences.
• Control dependences.
λ
[email protected]
Data Dependences
• If a variable v is defined
by an in instruction that
uses a variable u, then v
is data-dependent on u.
• In the figure, %r1
depends on v and on
%tid.
The value of %r1 may be different
for different threads.
λ
[email protected]
Control Dependences
• If the value assigned to a
variable v is controlled by
a variable u, then v is
control-dependent on u.
• In the figure, %f2 is
control dependent on
%p1.
Depending on how each thread
branches at the end of B0, %f2
may be %f1/2 or 0.0 at BST.
λ
[email protected]
Affine Variables
• Some divergent variables are special: they are affine
expressions of the thread identifier, e.g,. v = C×Tid + N.
• Example: the kernel below computes the average of
each column of a matrix:
The loop always executes the same number
of iterations for all the threads
λ
[email protected]
Affine Variables
• Variable i is divergent, yet, it is very regular: each
thread sees it as "Tid + N × c", where N is the current
loop iteration.
• We say that i is an affine variable.
In this case, i = Tid + 10 * c
λ
[email protected]
The Divergent Analysis with Affine Constraints
• This analysis classifies variables as uniform, affine or
divergent.
• Our divergence analysis is a dataflow analysis.
– We associate an abstract state with each variable.
– This abstract state is a pair (a, b), which means
a × Tid + b.
– Each element in the pair can be:
• A constant, which we denote by 'C'
• A non-initialized value, which we denote by '?'
• An unknown value, which we denote by 'D'
λ
[email protected]
Uniform Variables
• A uniform variable v is bound to the state (0, X), which
means 0 × Tid + X.
– If X is a known constant, then v is a constant.
No worries:
we shall explain
how we find
these abstract
states!
λ
[email protected]
Divergent Variables
• A divergent variable v is bound to the state (D, D), which
means that we do not know anything about the runtime
values that this variable can assume.
No worries:
we shall explain
how we find
these abstract
states!
λ
[email protected]
Affine Variables
• An affine variable v is bound to the state (c, X), which
means c × Tid + X. The factor c is always a known
constant, X can be either a known constant, or D.
Ok: it is about
time to explain
how we find
these abstract
states.
λ
[email protected]
Solving Divergence Analysis
• Initially every variable is bound to the abstract state (?,
?), unless…
• It is initialized with a constant, e.g., if we have the
assignment v = 10, then [v] = (0, 10). Unless….
• It is initialized with a constant expression of Tid, e.g., if
v = 10 * Tid + 3, then [v] = (10, 3). Unless…
• The variable is a function parameter, and its abstract
state is (0, D).
Once we have initialized every variable,
then we start iterating a few propagation
rules, until we reach a fixed point.
λ
[email protected]
The Propagation Rules
• There are many different propagation tables (we call them
dataflow equations).
– We have one table for each different program instruction.
– Lets consider, for instance, that the program contains an
instruction v = v1 + v2. The abstract state of v1, e.g.,
[v1] is given by the blue column, and [v2] by the
cantaloupe.
+
(0, b1)
(0, D)
(a1, b1)
(a1, D)
(D, D)
(0, b2)
(0, b1+b2)
(0, D)
(a1+a2, b1+b2)
(a1+a2, D)
(D, D)
(0, D)
(0, D)
(0, D)
(a1, D)
(a1, D)
(D, D)
(a2, b2)
(a2, b1+b2)
(a2, D)
(a1+a2, b1+b2)
(a1+a2, D)
(D, D)
(a2, D)
(a2, D)
(a2, D)
(a1+a2, D)
(a1+a2, D)
(D, D)
(D, D)
(D, D)
(D, D)
(D, D)
(D, D)
(D, D)
λ
[email protected]
Applying the Rules
• We work on the program dependence graph.
• Variables to be processed are placed in a worklist.
λ
[email protected]
Applying the Rules
• Where there is any variable v in the worklist, we try to
process the instructions that use v.
λ
[email protected]
Applying the Rules
• If all the dependences of a variable v have been processed,
then we can remove v from the worklist.
• If we process an instruction that defines variable w, then we
add w to the worklist.
We have
removed
Tid from the
worklist, and
added i0 to
it.
λ
[email protected]
Reaching a Fixed Point
• We keep performing this abstract interpretation, until
the worklist is empty.
– This happens once we reach a fixed point.
λ
[email protected]
How to Use the Divergence Analysis
• There are many compiler optimizations that need the
information provided by the divergence analysis.
• We are using the results of our divergence analysis with
affine constraints to guide a register allocator.
– We call it The Divergence Aware Register Allocator.
λ
[email protected]
What is Register Allocation?
• Register allocation is the problem of finding locations for
the variables in a program.
• Variables can stay in registers or in memory.
– Variables sent to memory are called spills.
• In Graphics Processing Units we have roughly three
types of memory:
– Local: outside-chip and private to each thread.
– Global: outside-chip and visible to every thread.
– Shared: inside-chip and visible to every thread (in the
same warp – lets abstract this detail away).
λ
[email protected]
The Key Insight: where to place spills
• A traditional allocator moves every spilled variable to
the local memory. However, we can do much better:
– Uniform spilled variables can be placed in the shared
memory.
– And affine spilled variables can be also placed in the
shared memory.
• But this is a bit trickier, and I shall explain it later.
λ
[email protected]
Example
0×Tid + D
c×Tid + D
D×Tid + D
λ
[email protected]
λ
[email protected]
Redundancy:
Uniform variables always have
the same value for all the
threads. Would it not be better
to keep only one image of each
spilled uniform variable?
Moreover, we can also share
affine variables, as we will
explain soon.
λ
This is what we get with divergence aware allocation
[email protected]
λ
[email protected]
The benefits of our allocator
• A traditional allocator spills
everything to the local memory.
• The divergent aware allocator
uses more the shared memory.
This has many advantages:
– Shared memory is faster.
– Less memory is used to spill
variables.
λ
[email protected]
How to Spill Affine Values?
• An affine value is like C×Tid + N, where C is a constant
known at compilation time. Lets assume an expression
like: N = 2*tid + t0
store: st.local N 0xFFFFFC32
changes to: st.shared t0 0xFFFFFC32
Load: ld.local N 0xFFFFFC32
changes to: ld.shared t0 0xFFFFFC32
N = 2*tid + t0
λ
[email protected]
Implementation
• We have implemented the affine analysis
and the divergence aware register
allocator in Ocelot, an open source PTX
optimizer.
– More than 10,000 lines of code!
– This compiler is used in the industry.
• We have successfully tested our divergence analysis in all
the 177 different CUDA kernels that we took from the
Rodinia and NVIDIA SDK 3.1 benchmark suites.
λ
[email protected]
Performance
% faster (execution time)
65
Divergent
Affine
55
45
35
25
15
5
-5
-15
% faster than naive linearscan execution
Gtx 570 / Nvidia CUDA driver and toolkit 3.2 / 32 bit linux / 8 register per thread
λ
[email protected]
Conclusions
• New directions to divergence aware optimizations.
– So far, optimizations have been focusing on branch
fusion and synchronization of divergent threads.
• Open source implementation already been used by the
Ocelot community.
• To know more:
– http://code.google.com/p/gpuocelot/
– http://simdopt.wordpress.com
λ
[email protected]
What if the affine expression is formed by
constants only?
• If the affine expression is like C0×Tid + C0, where C0 and
C1 are constants, then we do not need neither loads nor
stores (this is rematerialization). For instance, assume
N = 2*tid + 3
store: st.local N 0xFFFFFC32
the store is completely removed
Load: ld.local N 0xFFFFFC32
changes to: N = 2*tid + 3
We have all the information to reconstruct N!
λ
[email protected]
Definitions (stores)
Uses (loads)
Classification of spilled Variables.
Constant
Uniform
Cnst. Affine
Affine
Divergent
λ