Control Flow¶
Background¶
Traditional Compilers¶
In most IRs, functions consist of basic blocks (often shortened to just blocks), which are a series of instructions that always execute linearly from the beginning to the end, and the edges that connect them. Each basic block ends with a jump or branch that transfers execution either unconditionally to one basic block or conditionally to two (or sometimes more) blocks. Together, the basic blocks and edges comprise the Control Flow Graph or CFG. For example, this snippet of code:
while (foo) {
...
if (baz) {
...
} else {
...
}
if (bar) {
...
}
}
...
could be translated to a CFG that looks like:
Note that the CFG is a little different from the original code, since it’s
been optimized somewhat; for example, we’ve folded the second if
into the
then and else branches of the first. This is a good thing for traditional,
scalar, hardware, but as we’ll see, these types of optimizations are usually
unnecessary and sometimes harmful for GPUs. However, this is the standard
model that most literature on compiler theory assumes.
SIMT and Convergence¶
A unique feature of most GPUs is that they are designed to run many different invocations or threads of the same program in lock step, in a model called Single Instruction Multiple Threads (SIMT). The set of threads running in lock step is called the warp, wave, or subgroup depending on whom you ask. When control flow diverges, meaning that when two different threads (fragments, vertices, etc.) in the wave branch in different directions, then the GPU will take both sides of the branch. For example, let’s take this source:
if (foo) {
// A
} else {
// B
}
// C
If foo
is divergent upon reaching the if statement, that is if not all
threads in the wave have the same value for foo
, then:
First, threads where it is false are disabled and
A
is executed.Then the disabled threads are flipped and control falls through to
B
.Finally, control reconverges at
C
and everything is re-enabled.
Although most GPUs do something like this internally, the details of how it
works varies greatly from vendor to vendor and generation to generation.
Some GPUs, such as Midgard-generation Mali GPU’s, give each thread a
separate program counter and don’t keep track of divergance at all! Intel
GPUs have both “jump” instructions that manage the execution mask as well as
higher-level instructions such as IF
, ELSE
, ENDIF
, and WHILE
that map directly to their C namesakes, and on some generations, the former is
slower or non-existent. On AMD GCN the disabling and re-enabling of threads is
done manually using code generated by the backend. Other architectures are
similarly varied in their implementation.
Cross-thread Operations¶
From the programmer’s point of view, and NIR’s point of view, we can mostly ignore this and pretend we’re writing code for a single thread. However there’s one place where these details matter: cross-thread operations. There are a few operations that GPUs can do where separate threads in a wave can share information. One important example is the operation of taking a derivative in screen-space in fragment shaders, where the value of the input is exchanged between fragments in a 2x2 quad. Derivatives can be taken either explicitly, or implicitly in order to calculate the appropriate level of detail when sampling from a texture. Source languages such as GLSL require that these derivatives be taken in uniform control flow, meaning that no thread can be disabled, so that the inputs to the derivative are well-defined, and derivatives don’t even make sense if some fragments in each quad aren’t enabled. Therefore, optimizations in any intermediate IR must respect this, which means that the IR must be aware of when control flow diverges. For example, in the following snippet:
vec2 color = texture(tex, coordinate * 2.0)
if (foo) {
/* the only time we ever use 'color' */
output = color;
}
we can only push the texture operation into the if
if foo
is
uniform, meaning it takes the same value for each fragment, since the
texture takes an implicit derivative.
Later extensions add many more powerful cross-thread operations, such as
ballot()
which takes a boolean and returns a bitset of active threads where
the condition is true. In particular ballot(true)
allows direct access to
the set of active threads.
There are several tricky issues raised by the inclusion of these cross-thread operations. For example, the following program:
while (true) {
// (A)
if (bar) continue;
// (B)
if (foo) break;
}
and this one:
while (true) {
do {
// (A)
} while (bar);
// (B)
if (foo) break;
}
have almost the same control-flow graph, and work the same on a single-thread
level, yet programmers generally expect them to have different convergence
behavior: in the first loop, divergent branches on bar
reconverge at the
bottom of the loop, so that (B)
is not in uniform control flow if bar
is divergent, while in the second loop control flow reconverges after the inner
loop so that (B)
is in uniform control flow as long as foo
is uniform.
Even though each individual thread executes the same thing, they do so at
different times, so that the set of communicating threads is different.
The NIR Control Flow Model¶
In order to support many different backends, as well as maintain the structured control information currently given by source languages such as GLSL and required to handle cross-thread operations correctly, NIR uses a control flow model that explicitly contains structured control flow elements such as loops and if statements. This approach gives a clear model for when control flow converges and diverges that should be supported by every GPU.
Nevertheless, there’s still the need to support the vast existing literature that takes basic blocks as fundamental. So NIR includes basic blocks as a primitive as well. Control flow in NIR consists of a control flow tree whose elements are if statements and infinite loops, and whose leaves are basic blocks. In addition, the successors of each block, which are the blocks that a given block branches to, and the predecessors of each block, which are the blocks that branch to a given block, are always kept up-to-date. Finally, there’s the start block, which is the first block in the function, and end block, which is a fake basic block that return statements point to. Together, the start block, end block, and graph created by the successors and predecessors form the control flow graph that complements the control flow tree. For example, this:
void main(void) {
if (foo)
return;
while (bar) {
if (baz)
continue;
...
}
}
would become, in NIR:
where the if statements and loops are represented by boxes and the basic blocks are represented by ovals. One thing that may be initially surprising is that if statements always have at least one empty basic block in the “else” portion – that is, if-then statements are turned into if-then-else statements. This helps optimizations that push operations into if statements, since there could be a statement that only needs to be calculated when the condition is false, and adding the empty block creates a place where those statements can be moved. On the basic block level, creating the empty block removes a critical edge, which is an edge from a block with more than one successor to another with more than one predecessor. Consider this if-then statement:
if (foo) {
bar = 1;
}
...
and its basic block representation:
The red edge is a critical edge, since it’s one of two incoming edges and one of two outgoing edges. Before running optimizations like Partial Redundancy Elimination (PRE) and Global Code Motion (GCM) whose aim is to move code into less frequently executed paths, most compilers will split the critical edge by inserting an empty basic block:
In basic-block-focused compilers, critical edges are sometimes introduced by other optimizations and then have to be manually removed. But because NIR keeps control flow structured, those sorts of optimizations are either done very differently or not done at all, and therefore it makes sense to always keep critical edges split. It’s for the same reason that NIR doesn’t have a “predicated break” or “predicated continue” instruction, which is supported by most GPUs: they add critical edges to the CFG and prevent the compiler from being able to make code execute only when the break or continue executes. They also increase complexity in code aiming to manipulate the control flow tree, since unconditional continue and break must also be supported. In both cases, it’s easy enough for the backend to perform the optimizations to remove the extra blocks if necessary.
We’ve now explained why most of the extra empty basic blocks were inserted in the example NIR control flow, but there’s still one left. There’s an empty block in between the first if statement and the loop, so that the then and else branches branch to the empty block and then to the first block of the loop instead of jumping directly to the loop. Clearly, it isn’t there to remove a critical edge. So why insert it? Well, imagine that there was a statement in the loop that we determined to be loop-independent, so that we could move it outside the loop, but it was used inside the loop so we couldn’t move it after the loop. The empty block before the loop then comes in handy as a place to move it. Just as splitting critical edges helps optimizations such as PRE, inserting so-called padding blocks before and after the loop can help optimizations that do Loop-Invariant Code Motion (LICM), including GCM.
Putting it Together¶
We can put all the rules we’ve created into a guide for constructing the control flow tree. To do this, we’ll need a few different data types:
-
enum
nir_cf_node_type
¶ Values:
-
enumerator
nir_cf_node_block
¶
-
enumerator
nir_cf_node_if
¶
-
enumerator
nir_cf_node_loop
¶
-
enumerator
nir_cf_node_function
¶
-
enumerator
-
struct
nir_cf_node
¶ Control flow nodes (or CF nodes for short) are the base class for everything in the control-flow tree. They can be an entire function, a loop, an if statement, or a basic block.
Public Members
-
struct exec_node
node
¶ Entry in the CF list.
-
nir_cf_node_type
type
¶ Which type of CF node this is.
-
struct nir_cf_node *
parent
¶ CF node which contains the CF list we’re in.
-
struct exec_node
A control flow list or CF list is a list of control flow nodes, which
roughly corresponds to a series of statements in GLSL. It’s used to
represent the body of a function and a loop as well as the then and else
branches of an if statement. In the core datastructures to follow, CF lists
have type struct exec_list
but they have additional rules explained below.
-
struct
nir_if
¶ If statement.
Public Members
-
nir_cf_node
cf_node
¶ Base CF node.
-
nir_src
condition
¶ Branch condition
-
nir_selection_control
control
¶ A hint from the user about whether we should flatten this if.
-
struct exec_list
then_list
¶ CF list to be executed if the condition is true.
-
struct exec_list
else_list
¶ CF list to be executed if the condition is false.
-
nir_cf_node
-
struct
nir_loop
¶ An infinite loop (the only way to exit is through a
break
instruction.)Public Members
-
nir_cf_node
cf_node
¶ Base CF node.
-
struct exec_list
body
¶ CF list representing the body.
-
nir_loop_info *
info
¶ Information about the loop generated by analysis.
-
nir_loop_control
control
¶ A hint from the user about whether the loop should be unrolled.
-
bool
partially_unrolled
¶ Whether the loop has already been partially unrolled based on a guessed trip count and should not be partially unrolled again. Set by loop unrolling.
-
bool
divergent
¶ Whether the loop has any divergent (not reached by all threads that enter the loop) breaks or continues. Set by divergence analysis.
-
nir_cf_node
-
struct
nir_block
¶ A basic block. These are the leaves of the control flow tree.
Public Members
-
nir_cf_node
cf_node
¶ Base CF node.
-
struct exec_list
instr_list
¶ List of
nir_instr
.
-
unsigned
index
¶ generic block index; generated by nir_index_blocks
-
struct nir_block *
successors
[2]¶ Each block can only have up to 2 successors, so we put them in a simple array - no need for anything more complicated.
-
struct nir_block *
imm_dom
¶ this node’s immediate dominator in the dominance tree - set to NULL for the start block.
-
unsigned
num_dom_children
¶ Size of dom_children.
-
uint32_t
dom_pre_index
¶ These two indices have the property that dom_{pre,post}_index for each child of this block in the dominance tree will always be between dom_pre_index and dom_post_index for this block, which makes testing if a given block is dominated by another block an O(1) operation.
-
uint32_t
start_ip
¶ Value just before the first
nir_instr->index
in the block, but after end_ip that of any predecessor block.
-
uint32_t
end_ip
¶ Value just after the last
nir_instr->index
in the block, but before thestart_ip
of any successor block.
-
BITSET_WORD *
live_in
¶ SSA def live in for this block; used for liveness analysis. Indexed by
ssa_def->index
.
-
BITSET_WORD *
live_out
¶ SSA def of live out for this block.
-
nir_cf_node
CF lists follow two rules, which together will cover both the if-then-else and loop padding situations: a control flow list must end and begin with a basic block and must contain one (and exactly one) block between each non-block control flow node (i.e. loop or if statement). That is, control flow lists must look like:
block
loop/if
block
loop/if
...
loop/if
block
and they have to consist of at least one (possibly empty) basic block.
Finally, there are a class of instructions called “jump instructions”, which is how breaks, continues, and returns are represented in NIR:
-
enum
nir_jump_type
¶ Values:
-
enumerator
nir_jump_return
¶ Return from a function
This instruction is a classic function return. It jumps to nir_function_impl::end_block. No return value is provided in this instruction. Instead, the function is expected to write any return data to a deref passed in from the caller.
-
enumerator
nir_jump_halt
¶ Immediately exit the current shader
This instruction is roughly the equivalent of C’s
exit()
in that it immediately terminates the current shader invocation. From a CFG perspective, it looks like a jump to nir_function_impl::end_block but it actually jumps to the end block of the shader entrypoint. A halt instruction in the shader entrypoint itself is semantically identical to a return.For shaders with built-in I/O, any outputs written prior to a halt instruction remain written and any outputs not written prior to the halt have undefined values. It does NOT cause an implicit discard of written results. If one wants discard results in a fragment shader, for instance, a discard or demote intrinsic is required.
-
enumerator
nir_jump_break
¶ Break out of the inner-most loop
This has the same semantics as C’s
break
statement.
-
enumerator
nir_jump_continue
¶ Jump back to the top of the inner-most loop
This has the same semantics as C’s
continue
statement assuming that a NIR loop is implemented aswhile (1) { body }
.
-
enumerator
nir_jump_goto
¶ Jumps for unstructured CFG.
As within an unstructured CFG we can’t rely on block ordering we need to place explicit jumps at the end of every block.
-
enumerator
nir_jump_goto_if
¶
-
enumerator
-
struct
nir_jump_instr
¶ Jump instructions for structured and unstructured control flow. There must be at most one jump instruction in a basic block, and it must be at the end of the block.
Note that “multilevel breaks” and “multilevel continues”, i.e. jumping to a loop outside of the innermost one, are currently not supported, although they may be in the future.
If you aren’t sure, you should go and convince yourself that the example NIR control flow given earlier satisfies all these rules, in addition to being free of critical edges.
Modifying Control Flow¶
We’ve seen that there are two complimentary ways of describing control flow in NIR, the control flow tree and the control flow graph, which contain redundant information. To ease the burden of keeping both forms up-to-date, core NIR provides a number of helpers for rewriting the control flow graph. They allow you to manipulate the program as if it consists of a series of statements, like in GLSL, while “under the hood” they guarantee that the control flow tree is in the correct form and the successors and predecessors of the basic blocks involved are updated.
These functions all rely on the notion of a cursor:
-
enum
nir_cursor_option
¶ Defines what a cursor is anchored to and whether it is before/after it.
Values:
-
enumerator
nir_cursor_before_block
¶
-
enumerator
nir_cursor_after_block
¶
-
enumerator
nir_cursor_before_instr
¶
-
enumerator
nir_cursor_after_instr
¶
-
enumerator
-
struct
nir_cursor
¶ A tiny struct representing a point to insert/extract instructions or control flow nodes. Helps reduce the combinatorial explosion of possible points to insert/extract.
Public Members
-
nir_cursor_option
option
¶ Whether this cursor is before/after a block or instruction.
-
nir_instr *
instr
¶ The instruction this is before/after.
-
nir_cursor_option
The first, and simplest, function, inserts a newly-created control flow node (if or loop) at the cursor:
-
void
nir_cf_node_insert
(nir_cursor cursor, nir_cf_node *node)¶ puts a control flow node where the cursor is
There is also a function to extract a part of a CF list:
-
void
nir_cf_extract
(nir_cf_list *extracted, nir_cursor begin, nir_cursor end)¶ Extract a piece of control flow from a function.
begin
andend
must be inside blocks in the same CF list, andbegin
must be beforeend
. If a NIR CF list corresponds to a list of statements in GLSL, then the portion betweenbegin
andend
then corresponds to a sub-list within that list, which is extracted intoextracted
which is a free-floating piece of IR that can later be deleted, cloned, or re-inserted.This function splits up the basic blocks at both
begin
andend
, and it is left unspecified how they are split up. This means that any pointers to those blocks are invalid after the function is called.
which returns a nir_cf_list
structure:
-
struct
nir_cf_list
¶ An opaque wrapper for a portion of a CF list that has been extracted from a function.
The returned CF list can be reinserted/cloned/deleted:
-
void
nir_cf_reinsert
(nir_cf_list *cf_list, nir_cursor cursor)¶ Re-insert a nir_cf_list which has been extracted by nir_cf_extract() at the cursor. Any pointer to the block that
cursor
is in is similarly invalidated.
-
void
nir_cf_delete
(nir_cf_list *cf_list)¶ Delete a nir_cf_list which has been extracted by nir_cf_extract().
-
void
nir_cf_list_clone
(nir_cf_list *dst, nir_cf_list *src, nir_cf_node *parent, struct hash_table *remap_table)¶ Clone a nir_cf_list which has been extracted by nir_cf_extract().
- Parameters
dst – The cloned nir_cf_list.
src – The nir_cf_list to clone.
parent – The nir_cf_node
dst
will be inserted under.remap_table – A table of SSA values used to rewrite uses of values when cloning. If a value is in this table, uses of it will be rewritten. Otherwise, values outside
src
will be kept as-is.
Finally, there are a few wrapper functions around these worth mentioning:
-
static inline void
nir_cf_list_extract
(nir_cf_list *extracted, struct exec_list *cf_list)¶ Extract an entire CF list.
-
static inline void
nir_cf_node_remove
(nir_cf_node *node)¶ Removes a control flow node, doing any cleanup necessary.
-
static inline void
nir_cf_list_clone_and_reinsert
(nir_cf_list *src_list, nir_cf_node *parent, nir_cursor cursor, struct hash_table *remap_table)¶
However, there are a few caveats to be aware of when using these functions:
Phi nodes are considered attached to the piece of control flow that their sources come from. There are three places where phi nodes can occur, which are the three places where a block can have multiple predecessors:
After an if statement, if neither branch ends in a jump.
After a loop, if there are multiple breaks.
At the beginning of a loop.
For 1., the phi node is considered to be part of the if, and for 2. and 3. the phi node is considered to be part of the loop. This allows us to keep phis intact, but it means that phi nodes cannot be separated from the control flow they come from. For example, extracting an if without extracting all the phi nodes after it is not allowed, and neither is extracting only some of the phi nodes at the beginning of a block. It also means that extracting from the beginning of a basic block actually means extracting from the first non-phi instruction, since there’s no situation where extracting phi nodes without extracting what comes before them makes any sense.
Phi node sources are guaranteed to remain valid, meaning that they still correspond one-to-one with the predecessors of the basic block they’re part of. In addition, the original sources will be preserved unless they correspond to a break or continue that was deleted. However, no attempt is made to ensure that SSA form is maintained. In particular, it is not guaranteed that definitions of SSA values will dominate all their uses after all is said and done. Either the caller must ensure that this is the case, or it must insert extra phi nodes to restore SSA.
It is invalid to move a piece of IR with a break/continue outside of the loop it references. Doing this will result in invalid successors/predecessors and phi node sources.
It is invalid to move a piece of IR from one function implementation to another.
Extracting a control flow list will leave lots of dangling references to and from other pieces of the IR. It also leaves things in a not 100% consistent state. This means that some things (e.g. inserting instructions) might not work reliably on the extracted control flow. It also means that extracting control flow without re-inserting it or deleting it is a Bad Thing (tm) and must be avoided.