Control Flow

Background

Traditional Compilers

In most IRs, functions consist of basic blocks (often shortened to just blocks), which are a series of instructions that always execute linearly from the beginning to the end, and the edges that connect them. Each basic block ends with a jump or branch that transfers execution either unconditionally to one basic block or conditionally to two (or sometimes more) blocks. Together, the basic blocks and edges comprise the Control Flow Graph or CFG. For example, this snippet of code:

while (foo) {
   ...
   if (baz) {
      ...
   } else {
      ...
   }
   if (bar) {
      ...
   }
}
...

could be translated to a CFG that looks like:

digraph { post [label="..."]; /* force 'post' to be at the bottom*/ {rank="sink" post} header [label="foo?"]; block1 [label="...\nbaz?"]; then [label="...\nbar?"]; else [label="...\nbar?"]; block2 [label="...\nfoo?"]; header -> block1; header -> post; block1 -> then; block1 -> else; then -> header; else -> header; then -> block2; else -> block2; block2 -> block1; block2 -> post; }

Note that the CFG is a little different from the original code, since it’s been optimized somewhat; for example, we’ve folded the second if into the then and else branches of the first. This is a good thing for traditional, scalar, hardware, but as we’ll see, these types of optimizations are usually unnecessary and sometimes harmful for GPUs. However, this is the standard model that most literature on compiler theory assumes.

SIMT and Convergence

A unique feature of most GPUs is that they are designed to run many different invocations or threads of the same program in lock step, in a model called Single Instruction Multiple Threads (SIMT). The set of threads running in lock step is called the warp, wave, or subgroup depending on whom you ask. When control flow diverges, meaning that when two different threads (fragments, vertices, etc.) in the wave branch in different directions, then the GPU will take both sides of the branch. For example, let’s take this source:

if (foo) {
   // A
} else {
   // B
}
// C

If foo is divergent upon reaching the if statement, that is if not all threads in the wave have the same value for foo, then:

  • First, threads where it is false are disabled and A is executed.

  • Then the disabled threads are flipped and control falls through to B.

  • Finally, control reconverges at C and everything is re-enabled.

Although most GPUs do something like this internally, the details of how it works varies greatly from vendor to vendor and generation to generation. Some GPUs, such as Midgard-generation Mali GPU’s, give each thread a separate program counter and don’t keep track of divergance at all! Intel GPUs have both “jump” instructions that manage the execution mask as well as higher-level instructions such as IF, ELSE, ENDIF, and WHILE that map directly to their C namesakes, and on some generations, the former is slower or non-existent. On AMD GCN the disabling and re-enabling of threads is done manually using code generated by the backend. Other architectures are similarly varied in their implementation.

Cross-thread Operations

From the programmer’s point of view, and NIR’s point of view, we can mostly ignore this and pretend we’re writing code for a single thread. However there’s one place where these details matter: cross-thread operations. There are a few operations that GPUs can do where separate threads in a wave can share information. One important example is the operation of taking a derivative in screen-space in fragment shaders, where the value of the input is exchanged between fragments in a 2x2 quad. Derivatives can be taken either explicitly, or implicitly in order to calculate the appropriate level of detail when sampling from a texture. Source languages such as GLSL require that these derivatives be taken in uniform control flow, meaning that no thread can be disabled, so that the inputs to the derivative are well-defined, and derivatives don’t even make sense if some fragments in each quad aren’t enabled. Therefore, optimizations in any intermediate IR must respect this, which means that the IR must be aware of when control flow diverges. For example, in the following snippet:

vec2 color = texture(tex, coordinate * 2.0)
if (foo) {
   /* the only time we ever use 'color' */
   output = color;
}

we can only push the texture operation into the if if foo is uniform, meaning it takes the same value for each fragment, since the texture takes an implicit derivative.

Later extensions add many more powerful cross-thread operations, such as ballot() which takes a boolean and returns a bitset of active threads where the condition is true. In particular ballot(true) allows direct access to the set of active threads.

There are several tricky issues raised by the inclusion of these cross-thread operations. For example, the following program:

while (true) {
   // (A)
   if (bar) continue;
   // (B)
   if (foo) break;
}

and this one:

while (true) {
   do {
      // (A)
   } while (bar);
   // (B)
   if (foo) break;
}

have almost the same control-flow graph, and work the same on a single-thread level, yet programmers generally expect them to have different convergence behavior: in the first loop, divergent branches on bar reconverge at the bottom of the loop, so that (B) is not in uniform control flow if bar is divergent, while in the second loop control flow reconverges after the inner loop so that (B) is in uniform control flow as long as foo is uniform. Even though each individual thread executes the same thing, they do so at different times, so that the set of communicating threads is different.

The NIR Control Flow Model

In order to support many different backends, as well as maintain the structured control information currently given by source languages such as GLSL and required to handle cross-thread operations correctly, NIR uses a control flow model that explicitly contains structured control flow elements such as loops and if statements. This approach gives a clear model for when control flow converges and diverges that should be supported by every GPU.

Nevertheless, there’s still the need to support the vast existing literature that takes basic blocks as fundamental. So NIR includes basic blocks as a primitive as well. Control flow in NIR consists of a control flow tree whose elements are if statements and infinite loops, and whose leaves are basic blocks. In addition, the successors of each block, which are the blocks that a given block branches to, and the predecessors of each block, which are the blocks that branch to a given block, are always kept up-to-date. Finally, there’s the start block, which is the first block in the function, and end block, which is a fake basic block that return statements point to. Together, the start block, end block, and graph created by the successors and predecessors form the control flow graph that complements the control flow tree. For example, this:

void main(void) {
   if (foo)
      return;

   while (bar) {
      if (baz)
         continue;

      ...
   }
}

would become, in NIR:

digraph { clusterrank="local"; subgraph cluster_main { style="solid"; label="main"; start [label="(start)"]; start -> then1_block; start -> else1_block; subgraph cluster_if1 { style="filled"; fillcolor="lightgrey"; label="if (foo)"; subgraph cluster_then1 { label="then"; fillcolor="white"; then1_block [label="return"]; } subgraph cluster_else1 { label="else"; fillcolor="white"; else1_block [label="(empty)"]; } } pre_loop [label="(empty)"]; else1_block -> pre_loop; pre_loop -> loop_header; subgraph cluster_loop { style="filled"; fillcolor="lightgrey"; label="loop"; loop_header [label="(empty)"]; then3_block -> loop_header [constraint=false]; loop_header -> then2_block; loop_header -> else2_block; subgraph cluster_if2 { fillcolor="white"; label="if (!bar)"; subgraph cluster_then2 { fillcolor="lightgrey"; label="then"; then2_block [label="break"]; } subgraph cluster_else2 { fillcolor="lightgrey"; label="else"; else2_block [label="(empty)"]; } } loop_middle_block [label="(empty)"]; else2_block -> loop_middle_block; loop_middle_block -> then3_block; loop_middle_block -> else3_block; subgraph cluster_if3 { fillcolor="white"; label="if (baz)"; subgraph cluster_then3 { fillcolor="lightgrey"; label="then"; then3_block [label="continue"]; } subgraph cluster_else3 { fillcolor="lightgrey"; label="else"; else3_block [label="(empty)"]; } } loop_end_block [label="...", rank="max"]; else3_block -> loop_end_block; loop_end_block -> loop_header [constraint=false]; } post_loop [label="(empty)"]; then2_block -> post_loop; loop_end_block -> post_loop [style="invis"]; post_loop -> end_block; then1_block -> end_block; end_block [label="(end)"]; } }

where the if statements and loops are represented by boxes and the basic blocks are represented by ovals. One thing that may be initially surprising is that if statements always have at least one empty basic block in the “else” portion – that is, if-then statements are turned into if-then-else statements. This helps optimizations that push operations into if statements, since there could be a statement that only needs to be calculated when the condition is false, and adding the empty block creates a place where those statements can be moved. On the basic block level, creating the empty block removes a critical edge, which is an edge from a block with more than one successor to another with more than one predecessor. Consider this if-then statement:

if (foo) {
   bar = 1;
}
...

and its basic block representation:

digraph { pre [label="foo?"]; then [label="bar = 1;"]; post [label="..."]; pre -> then; pre -> post [color="red"]; then -> post; }

The red edge is a critical edge, since it’s one of two incoming edges and one of two outgoing edges. Before running optimizations like Partial Redundancy Elimination (PRE) and Global Code Motion (GCM) whose aim is to move code into less frequently executed paths, most compilers will split the critical edge by inserting an empty basic block:

digraph { pre [label="foo?"]; then [label="bar = 1;"]; else [label="(empty)"]; post [label="..."]; pre -> then; pre -> else; then -> post; else -> post; }

In basic-block-focused compilers, critical edges are sometimes introduced by other optimizations and then have to be manually removed. But because NIR keeps control flow structured, those sorts of optimizations are either done very differently or not done at all, and therefore it makes sense to always keep critical edges split. It’s for the same reason that NIR doesn’t have a “predicated break” or “predicated continue” instruction, which is supported by most GPUs: they add critical edges to the CFG and prevent the compiler from being able to make code execute only when the break or continue executes. They also increase complexity in code aiming to manipulate the control flow tree, since unconditional continue and break must also be supported. In both cases, it’s easy enough for the backend to perform the optimizations to remove the extra blocks if necessary.

We’ve now explained why most of the extra empty basic blocks were inserted in the example NIR control flow, but there’s still one left. There’s an empty block in between the first if statement and the loop, so that the then and else branches branch to the empty block and then to the first block of the loop instead of jumping directly to the loop. Clearly, it isn’t there to remove a critical edge. So why insert it? Well, imagine that there was a statement in the loop that we determined to be loop-independent, so that we could move it outside the loop, but it was used inside the loop so we couldn’t move it after the loop. The empty block before the loop then comes in handy as a place to move it. Just as splitting critical edges helps optimizations such as PRE, inserting so-called padding blocks before and after the loop can help optimizations that do Loop-Invariant Code Motion (LICM), including GCM.

Putting it Together

We can put all the rules we’ve created into a guide for constructing the control flow tree. To do this, we’ll need a few different data types:

enum nir_cf_node_type

Values:

enumerator nir_cf_node_block
enumerator nir_cf_node_if
enumerator nir_cf_node_loop
enumerator nir_cf_node_function
struct nir_cf_node

Control flow nodes (or CF nodes for short) are the base class for everything in the control-flow tree. They can be an entire function, a loop, an if statement, or a basic block.

Public Members

struct exec_node node

Entry in the CF list.

nir_cf_node_type type

Which type of CF node this is.

struct nir_cf_node *parent

CF node which contains the CF list we’re in.

A control flow list or CF list is a list of control flow nodes, which roughly corresponds to a series of statements in GLSL. It’s used to represent the body of a function and a loop as well as the then and else branches of an if statement. In the core datastructures to follow, CF lists have type struct exec_list but they have additional rules explained below.

struct nir_if

If statement.

Public Members

nir_cf_node cf_node

Base CF node.

nir_src condition

Branch condition

nir_selection_control control

A hint from the user about whether we should flatten this if.

struct exec_list then_list

CF list to be executed if the condition is true.

struct exec_list else_list

CF list to be executed if the condition is false.

struct nir_loop

An infinite loop (the only way to exit is through a break instruction.)

Public Members

nir_cf_node cf_node

Base CF node.

struct exec_list body

CF list representing the body.

nir_loop_info *info

Information about the loop generated by analysis.

nir_loop_control control

A hint from the user about whether the loop should be unrolled.

bool partially_unrolled

Whether the loop has already been partially unrolled based on a guessed trip count and should not be partially unrolled again. Set by loop unrolling.

bool divergent

Whether the loop has any divergent (not reached by all threads that enter the loop) breaks or continues. Set by divergence analysis.

struct nir_block

A basic block. These are the leaves of the control flow tree.

Public Members

nir_cf_node cf_node

Base CF node.

struct exec_list instr_list

List of nir_instr.

unsigned index

generic block index; generated by nir_index_blocks

struct nir_block *successors[2]

Each block can only have up to 2 successors, so we put them in a simple array - no need for anything more complicated.

struct set *predecessors

Set of nir_block predecessors in the CFG

struct nir_block *imm_dom

this node’s immediate dominator in the dominance tree - set to NULL for the start block.

unsigned num_dom_children

Size of dom_children.

struct nir_block **dom_children

This node’s children in the dominance tree

struct set *dom_frontier

Set of nir_block’s on the dominance frontier of this block

uint32_t dom_pre_index

These two indices have the property that dom_{pre,post}_index for each child of this block in the dominance tree will always be between dom_pre_index and dom_post_index for this block, which makes testing if a given block is dominated by another block an O(1) operation.

uint32_t start_ip

Value just before the first nir_instr->index in the block, but after end_ip that of any predecessor block.

uint32_t end_ip

Value just after the last nir_instr->index in the block, but before the start_ip of any successor block.

BITSET_WORD *live_in

SSA def live in for this block; used for liveness analysis. Indexed by ssa_def->index.

BITSET_WORD *live_out

SSA def of live out for this block.

CF lists follow two rules, which together will cover both the if-then-else and loop padding situations: a control flow list must end and begin with a basic block and must contain one (and exactly one) block between each non-block control flow node (i.e. loop or if statement). That is, control flow lists must look like:

block
loop/if
block
loop/if
...
loop/if
block

and they have to consist of at least one (possibly empty) basic block.

Finally, there are a class of instructions called “jump instructions”, which is how breaks, continues, and returns are represented in NIR:

enum nir_jump_type

Values:

enumerator nir_jump_return

Return from a function

This instruction is a classic function return. It jumps to nir_function_impl::end_block. No return value is provided in this instruction. Instead, the function is expected to write any return data to a deref passed in from the caller.

enumerator nir_jump_halt

Immediately exit the current shader

This instruction is roughly the equivalent of C’s exit() in that it immediately terminates the current shader invocation. From a CFG perspective, it looks like a jump to nir_function_impl::end_block but it actually jumps to the end block of the shader entrypoint. A halt instruction in the shader entrypoint itself is semantically identical to a return.

For shaders with built-in I/O, any outputs written prior to a halt instruction remain written and any outputs not written prior to the halt have undefined values. It does NOT cause an implicit discard of written results. If one wants discard results in a fragment shader, for instance, a discard or demote intrinsic is required.

enumerator nir_jump_break

Break out of the inner-most loop

This has the same semantics as C’s break statement.

enumerator nir_jump_continue

Jump back to the top of the inner-most loop

This has the same semantics as C’s continue statement assuming that a NIR loop is implemented as while (1) { body }.

enumerator nir_jump_goto

Jumps for unstructured CFG.

As within an unstructured CFG we can’t rely on block ordering we need to place explicit jumps at the end of every block.

enumerator nir_jump_goto_if
struct nir_jump_instr

Jump instructions for structured and unstructured control flow. There must be at most one jump instruction in a basic block, and it must be at the end of the block.

Public Members

nir_instr instr

Base instruction.

nir_jump_type type

Which type of jump this is.

nir_src condition

For ::nir_jump_goto_if, the branch condition.

struct nir_block *target

For ::nir_jump_goto, the block to jump to. For ::nir_jump_goto_if, the block to jump to if true.

struct nir_block *else_target

For ::nir_jump_goto_if, the block to jump to if false.

Note that “multilevel breaks” and “multilevel continues”, i.e. jumping to a loop outside of the innermost one, are currently not supported, although they may be in the future.

If you aren’t sure, you should go and convince yourself that the example NIR control flow given earlier satisfies all these rules, in addition to being free of critical edges.

Modifying Control Flow

We’ve seen that there are two complimentary ways of describing control flow in NIR, the control flow tree and the control flow graph, which contain redundant information. To ease the burden of keeping both forms up-to-date, core NIR provides a number of helpers for rewriting the control flow graph. They allow you to manipulate the program as if it consists of a series of statements, like in GLSL, while “under the hood” they guarantee that the control flow tree is in the correct form and the successors and predecessors of the basic blocks involved are updated.

These functions all rely on the notion of a cursor:

enum nir_cursor_option

Defines what a cursor is anchored to and whether it is before/after it.

Values:

enumerator nir_cursor_before_block
enumerator nir_cursor_after_block
enumerator nir_cursor_before_instr
enumerator nir_cursor_after_instr
struct nir_cursor

A tiny struct representing a point to insert/extract instructions or control flow nodes. Helps reduce the combinatorial explosion of possible points to insert/extract.

Public Members

nir_cursor_option option

Whether this cursor is before/after a block or instruction.

nir_block *block

The block this is before/after

nir_instr *instr

The instruction this is before/after.

The first, and simplest, function, inserts a newly-created control flow node (if or loop) at the cursor:

void nir_cf_node_insert(nir_cursor cursor, nir_cf_node *node)

puts a control flow node where the cursor is

There is also a function to extract a part of a CF list:

void nir_cf_extract(nir_cf_list *extracted, nir_cursor begin, nir_cursor end)

Extract a piece of control flow from a function.

begin and end must be inside blocks in the same CF list, and begin must be before end. If a NIR CF list corresponds to a list of statements in GLSL, then the portion between begin and end then corresponds to a sub-list within that list, which is extracted into extracted which is a free-floating piece of IR that can later be deleted, cloned, or re-inserted.

This function splits up the basic blocks at both begin and end, and it is left unspecified how they are split up. This means that any pointers to those blocks are invalid after the function is called.

which returns a nir_cf_list structure:

struct nir_cf_list

An opaque wrapper for a portion of a CF list that has been extracted from a function.

The returned CF list can be reinserted/cloned/deleted:

void nir_cf_reinsert(nir_cf_list *cf_list, nir_cursor cursor)

Re-insert a nir_cf_list which has been extracted by nir_cf_extract() at the cursor. Any pointer to the block that cursor is in is similarly invalidated.

void nir_cf_delete(nir_cf_list *cf_list)

Delete a nir_cf_list which has been extracted by nir_cf_extract().

void nir_cf_list_clone(nir_cf_list *dst, nir_cf_list *src, nir_cf_node *parent, struct hash_table *remap_table)

Clone a nir_cf_list which has been extracted by nir_cf_extract().

Parameters
  • dst – The cloned nir_cf_list.

  • src – The nir_cf_list to clone.

  • parent – The nir_cf_node dst will be inserted under.

  • remap_table – A table of SSA values used to rewrite uses of values when cloning. If a value is in this table, uses of it will be rewritten. Otherwise, values outside src will be kept as-is.

Finally, there are a few wrapper functions around these worth mentioning:

static inline void nir_cf_list_extract(nir_cf_list *extracted, struct exec_list *cf_list)

Extract an entire CF list.

static inline void nir_cf_node_remove(nir_cf_node *node)

Removes a control flow node, doing any cleanup necessary.

static inline void nir_cf_list_clone_and_reinsert(nir_cf_list *src_list, nir_cf_node *parent, nir_cursor cursor, struct hash_table *remap_table)

However, there are a few caveats to be aware of when using these functions:

  • Phi nodes are considered attached to the piece of control flow that their sources come from. There are three places where phi nodes can occur, which are the three places where a block can have multiple predecessors:

    1. After an if statement, if neither branch ends in a jump.

    2. After a loop, if there are multiple breaks.

    3. At the beginning of a loop.

    For 1., the phi node is considered to be part of the if, and for 2. and 3. the phi node is considered to be part of the loop. This allows us to keep phis intact, but it means that phi nodes cannot be separated from the control flow they come from. For example, extracting an if without extracting all the phi nodes after it is not allowed, and neither is extracting only some of the phi nodes at the beginning of a block. It also means that extracting from the beginning of a basic block actually means extracting from the first non-phi instruction, since there’s no situation where extracting phi nodes without extracting what comes before them makes any sense.

  • Phi node sources are guaranteed to remain valid, meaning that they still correspond one-to-one with the predecessors of the basic block they’re part of. In addition, the original sources will be preserved unless they correspond to a break or continue that was deleted. However, no attempt is made to ensure that SSA form is maintained. In particular, it is not guaranteed that definitions of SSA values will dominate all their uses after all is said and done. Either the caller must ensure that this is the case, or it must insert extra phi nodes to restore SSA.

  • It is invalid to move a piece of IR with a break/continue outside of the loop it references. Doing this will result in invalid successors/predecessors and phi node sources.

  • It is invalid to move a piece of IR from one function implementation to another.

  • Extracting a control flow list will leave lots of dangling references to and from other pieces of the IR. It also leaves things in a not 100% consistent state. This means that some things (e.g. inserting instructions) might not work reliably on the extracted control flow. It also means that extracting control flow without re-inserting it or deleting it is a Bad Thing (tm) and must be avoided.