Vulkan Subgroup Tutorial
Subgroups are an important new feature in Vulkan 1.1 because they enable highly-efficient sharing and manipulation of data between multiple tasks running in parallel on a GPU. In this tutorial, we will cover how to use the new subgroup functionality.
Modern heterogeneous hardware like GPUs gain performance by using parallel hardware and exposing a parallel programming model to target this hardware. When a user wants to run N
parallel tasks for their algorithm, a GPU would divide this N
-sized workload between the compute units of that GPU. Each compute unit of the GPU is then capable of running one or more of these parallel tasks concurrently. In Vulkan, we refer to the data that runs on a single compute unit of a GPU as the local workgroup, and an individual parallel task as an invocation.
Vulkan 1.0 already exposes a method to share data between the invocations in a local workgroup via shared memory, which is exposed only in compute shaders. Shared memory allows for invocations within the local workgroup to share some data via memory that is faster to access than reading and writing to buffer memory, providing a mechanism to share data in a performance sensitive context.
Vulkan 1.1 goes further and introduces a mechanism to share data between the invocations that run in parallel on a single compute unit. These concurrently running invocations are named the subgroup. This subgroup allows for the sharing of data between a much smaller set of invocations than the local workgroup could, but at a significantly higher performance.
While shared memory is only available in compute shaders, sharing data via subgroup operations is allowed in all shader stages via optionally supported stages as we'll explain below.
How to Query for Subgroup Support
In Vulkan 1.1, a new structure has been added for querying the subgroup support of a physical device:
struct VkPhysicalDeviceSubgroupProperties {
VkStructureType sType;
void* pNext;
uint32_t subgroupSize;
VkShaderStageFlags supportedStages;
VkSubgroupFeatureFlags supportedOperations;
VkBool32 quadOperationsInAllStages;
};
To get the subgroup properties of a physical device:
VkPhysicalDevice physicalDevice = ...; // A previously retrieved physical device
VkPhysicalDeviceSubgroupProperties subgroupProperties;
subgroupProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_PROPERTIES;
subgroupProperties.pNext = NULL;
VkPhysicalDeviceProperties2 physicalDeviceProperties;
physicalDeviceProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
physicalDeviceProperties.pNext = &subgroupProperties;
vkGetPhysicalDeviceProperties2(physicalDevice, &physicalDeviceProperties);
The fields of VkPhysicalDeviceSubgroupProperties
are:
subgroupSize
- how many invocations are in a single subgroup of this device.supportedStages
- which shader stages support subgroup operations.supportedOperations
- which subgroup operations are supported.quadOperationsInAllStages
- ifsupportedOperations
containsVK_SUBGROUP_FEATURE_QUAD_BIT
, do the quad operations work in allsupportedStages
or only in fragment and compute stages.
There are some minimal guarantees that all Vulkan 1.1 physical devices must support:
subgroupSize
must be at least 1.supportedStages
must include theVK_SHADER_STAGE_COMPUTE_BIT
stage - all compute shaders can use subgroup functionality on all Vulkan 1.1 devices.supportedOperations
must includeVK_SUBGROUP_FEATURE_BASIC_BIT
operations, all other categories are optional.
Subgroup operations are pretty useless with a subgroupSize
of 1, but it does mandate that a shader that uses subgroup functionality must be consumable by all Vulkan 1.1 drivers - which is good news for our users who don't want to ship more than one shader to take advantage of the new functionality!
How Subgroups are Formed on Hardware
Subgroups have some characteristics that are true on all hardware that supports the functionality.
Invocations within a subgroup can be active or inactive. An invocation is active if it is, for want of a better word, active within the subgroup - EG. it is doing actual calculations or memory accesses. An invocation is inactive if the opposite is true - for one reason or another, the invocation is not doing anything useful.
So what cases could cause an invocation to be inactive?
Small WorkgroupSize
Let us assume we've got a device that has a subgroupSize
that is greater than 1. Now, let's have a look at the following compute shader:
#version 450
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
void main() {
// Do work!
}
In the above case, we are guaranteed to have inactive invocations in every subgroup. Why? you ask: If you specify a workgroup size that is less than the subgroup size, you are guaranteed to have inactive invocations within the subgroup.
Now, remembering that a subgroup is a way to expose a set of concurrently running invocations, we've basically committed ourselves to be underutilizing the hardware in the above example!
If the subgroupSize
of the device was 2, we're executing at 50% capacity, 4 we're executing at 25% capacity, on NVIDIA (which has a subgroupSize
of 32) we're executing at 3.1% capacity, and on AMD (which has a subgroupSize
of 64) we're executing at 1.6% capacity!
Even if you don't use any other part of the new subgroup functionality, you should aim to make your local workgroup be at least the size of the subgroup in most situations.
Dynamic Branching
Lets take the following snippet of a branch:
float x = ...; // Some previously set variable
if (x < 0.0f) {
x = x * -1.0f + 42.0f;
} else {
x = x - 13.0f;
}
In the above example, if some invocations in the subgroup have an x
with a value less than 0, and some have a value of x
that is not less than 0, some invocations will enter the if branch, and others the else branch. Within the if branch, all invocations that had a value of x that is not less than 0 will be inactive. Likewise, in the else branch, all invocations that had a value of x that is less than 0 will inactive.
Active Invocations are Happy Invocations
From the information provided above, I hope it's obvious that keeping invocations active, and thus doing actual work, is the key to keeping your GPU happy.
GL_KHR_shader_subgroup
Alongside the Vulkan 1.1 core subgroup functionality, a new GLSL extension GL_KHR_shader_subgroup which is available here, and is usable in glslang too.
GL_KHR_shader_subgroup exposes new subgroup built-in variables and new subgroup built-in functions. These functions are separated into categories that match the supportedOperations
enumeration of Vulkan 1.1.
Each category has a corresponding extension to enable. Enabling any category other than GL_KHR_shader_subgroup_basic
will implicitly enable GL_KHR_shader_subgroup_basic
also.
We'll use the generic typename T
to denote bool
, bvec2
, bvec3
, bvec4
, int
, ivec2
, ivec3
, ivec4
, uint
, uvec2
, uvec3
, uvec4
, float
, vec2
, vec3
, vec4
, double
, dvec2
, dvec3
, and dvec4
types.
#extension GL_KHR_shader_subgroup_basic
The first category is GL_KHR_shader_subgroup_basic
. The basic category introduces the built-in subgroup variables, and a few built-in functions too.
In the compute shader stage only:
gl_NumSubgroups
is the number of subgroups within the local workgroup.gl_SubgroupID
is the ID of the subgroup within the local workgroup, an integer in the range [0
..gl_NumSubgroups
).
In all supported stages:
gl_SubgroupSize
is the size of the subgroup, which matches thesubgroupSize
field ofVkPhysicalDeviceSubgroupProperties
mentioned previously.gl_SubgroupInvocationID
is the ID of the invocation within the subgroup, an integer in the range [0
..gl_SubgroupSize
).gl_SubgroupEqMask
,gl_SubgroupGeMask
,gl_SubgroupGtMask
,gl_SubgroupLeMask
, andgl_SubgroupLtMask
are variables that can be used in conjunction with a subgroupBallot result.
The basic category also introduces the various barrier functions to control execution and memory accesses across the subgroup:
void subgroupBarrier()
performs a full memory and execution barrier - basically when an invocation returns fromsubgroupBarrier()
we are guaranteed that every invocation executed the barrier before any return, and all memory writes by those invocations are visible to all invocations in the subgroup.void subgroupMemoryBarrier()
performs a memory barrier on all memory types (buffers, images, and shared memory). A memory barrier enforces that the ordering of memory operations by a single invocation as seen by other invocations is the same. For example, if I wrote42
to the 0'th element of a buffer, calledsubgroupMemoryBarrier()
, then wrote13
to the 0'th element of the same buffer, no other invocation would read a42
after previously reading13
.void subgroupMemoryBarrierBuffer()
performs a memory barrier on just the buffer variables accessed.void subgroupMemoryBarrierShared()
performs a memory barrier on just the shared variables accessed.void subgroupMemoryBarrierImage()
performs a memory barrier on just the image variables accessed.bool subgroupElect()
- exactly one invocation within the subgroup will return true, the others will return false. The invocation that returns true is always the one that is active with the lowestgl_SubgroupInvocationID
.
The basic category is the building block for the other categories. In isolation, most of what we could do with the basic category could be done with the local workgroup synchronization primitives that GLSL already has, but when we later combine it with other functionality the power of the functionality will become clear.
#extension GL_KHR_shader_subgroup_vote
The vote category introduces built-in functions that allow invocations to vote on whether boolean conditions were met across the subgroup:
bool subgroupAll(bool value)
- returns true if all active invocations havevalue == true
.bool subgroupAny(bool value)
- returns true if any active invocation hasvalue == true
.bool subgroupAllEqual(T value)
- returns true if all active invocations have avalue
that is equal.
These built-ins are a superset of the ones previously provided in ARB_shader_group_vote.
The vote category is seriously useful in code that has branching. Let's say we have the following shader:
#version 450
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
int foo[];
};
layout(std430, set=0, binding=1) buffer layout_bar {
int bar[];
};
void main() {
if (foo[gl_GlobalInvocationID.x] < bar[gl_GlobalInvocationID.x]) {
// x
} else {
// y
}
}
And let's assume that you know that 80% of the time the invocations will all either enter the if statement and execute x, or enter the else statement and execute y. We can now tell the compiler this information and thus allow it to optimize the code by:
#version 450
#extension GL_KHR_shader_subgroup_vote: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
int foo[];
};
layout(std430, set=0, binding=1) buffer layout_bar {
int bar[];
};
void main() {
bool condition = foo[gl_GlobalInvocationID.x] < bar[gl_GlobalInvocationID.x];
if (subgroupAll(condition)) {
// all invocations in the subgroup are performing x
} else if (!subgroupAny(condition)) {
// all invocations in the subgroup are performing y
} else {
// Invocations that get here are doing a mix of x & y so have a fallback
}
}
#extension GL_KHR_shader_subgroup_ballot
The ballot category introduces built-in functions that allow invocations to do limited sharing of data across the invocations of a subgroup:
T subgroupBroadcast(T value, uint id)
broadcasts thevalue
whosegl_SubgroupInvocationID == id
to all other invocations (id must be a compile time constant).T subgroupBroadcastFirst(T value)
broadcasts thevalue
whosegl_SubgroupInvocationID
is the lowest active to all other invocations.uvec4 subgroupBallot(bool value)
each invocation contributes a single bit to the resulting uvec4 correponding tovalue
.bool subgroupInverseBallot(uvec4 value)
returns true if this invocations bit invalue
is true.bool subgroupBallotBitExtract(uvec4 value, uint index)
returns true if the bit corresponding toindex
is set invalue
.uint subgroupBallotBitCount(uvec4 value)
returns the number of bits set invalue
, only counting the bottomgl_SubgroupSize
bits.uint subgroupBallotInclusiveBitCount(uvec4 value)
returns the inclusive scan of the number of bits set invalue
, only counting the bottomgl_SubgroupSize
bits (we'll cover what an inclusive scan is later).uint subgroupBallotExclusiveBitCount(uvec4 value)
returns the exclusive scan of the number of bits set invalue
, only counting the bottomgl_SubgroupSize
bits (we'll cover what an exclusive scan is later).uint subgroupBallotFindLSB(uvec4 value)
returns the lowest bit set invalue
, only counting the bottomgl_SubgroupSize
bits.uint subgroupBallotFindMSB(uvec4 value)
returns the highest bit set invalue
, only counting the bottomgl_SubgroupSize
bits.
These built-ins are a superset of the ones previously provided in ARB_shader_ballot.
The ballot category has two groups of functionality - the ability to send a value from one invocation to the others within a subgroup, and a more powerful form of voting in the form of the ballot
built-in.
Let's take our previous example, and let's say you want to only perform x if at least a quarter of the invocations want to perform x. In other words, most of the time you are fine with performing y, but if enough say otherwise you'll reluctantly run x. Let's make use of our ballot built-ins for this:
#version 450
#extension GL_KHR_shader_subgroup_ballot: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
int foo[];
};
layout(std430, set=0, binding=1) buffer layout_bar {
int bar[];
};
void main() {
uvec4 ballot = subgroupBallot(foo[gl_GlobalInvocationID.x] < bar[gl_GlobalInvocationID.x]);
if (gl_SubgroupSize <= (subgroupBallotBitCount(ballot) * 4)) {
// all invocations in the subgroup are performing x
} else {
// all invocations in the subgroup are performing y
}
}
Another example, lets say you want to load a value such that the value is the same across the entire subgroup:
#version 450
#extension GL_KHR_shader_subgroup_ballot: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
vec4 foo[];
};
void main() {
vec4 value;
if (subgroupElect()) {
uint index; // some complicated index
value = foo[index];
// could even do some complicated math on value
}
// Tell everyone else in the subgroup the value
value = subgroupBroadcastFirst(value);
// Every invocation in the subgroup now has the same value
}
#extension GL_KHR_shader_subgroup_arithmetic
The arithmetic category introduces built-in functions that allow invocations to perform some simple operations across the invocations of a subgroup:
T subgroupAdd(T value)
returns the summation of all active invocationsvalue
's across the subgroup.T subgroupMul(T value)
returns the multiplication of all active invocationsvalue
's across the subgroup.T subgroupMin(T value)
returns the minimumvalue
of all active invocationsvalue
's across the subgroup.T subgroupMax(T value)
returns the maximumvalue
of all active invocationsvalue
's across the subgroup.T subgroupAnd(T value)
returns the binary and of all active invocationsvalue
's across the subgroup.T subgroupOr(T value)
returns the binary or of all active invocationsvalue
's across the subgroup.T subgroupXor(T value)
returns the binary xor of all active invocationsvalue
's across the subgroup.
These operations perform what is termed a reduction operation - each subgroup invocation takes a number of value
's and performs an operation on them. Two other sets of operations that are supported are inclusive and exclusive scan. To understand scan operations, let's look at an approach that is not optimal to implementing an add-scan:
#version 450
#extension GL_KHR_shader_subgroup_ballot: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
float foo[];
};
void main() {
uint foo_index = subgroupBroadcastFirst(gl_GlobalInvocationID.x);
float value = 0;
for (uint i = 0; i < gl_SubgroupSize; i++) {
#ifdef INCLUSIVE_SCAN
if (i == (gl_SubgroupInvocationID + 1)) {
break;
}
#else//EXCLUSIVE_SCAN
if (i == gl_SubgroupInvocationID) {
break;
}
#endif
value += foo[foo_index + i];
}
}
In the example above, we've implemented a scan operation using a loop. What in effect happens, is that invocations will only perform the requested operation with active invocations in the subgroup that have a gl_SubgroupInvocationID
that is less than themselves. The difference between the inclusive and exclusive variants is that the inclusive variants will use the value they provide as part of their own result, whereas exclusive variants do not use the value they provide to calculate their own results.
T subgroupInclusiveAdd(T value)
returns the inclusive scan summation of all active invocationsvalue
's across the subgroup.T subgroupInclusiveMul(T value)
returns the inclusive scan the multiplication of all active invocationsvalue
's across the subgroup.T subgroupInclusiveMin(T value)
returns the inclusive scan the minimumvalue
of all active invocationsvalue
's across the subgroup.T subgroupInclusiveMax(T value)
returns the inclusive scan the maximumvalue
of all active invocationsvalue
's across the subgroup.T subgroupInclusiveAnd(T value)
returns the inclusive scan the binary and of all active invocationsvalue
's across the subgroup.T subgroupInclusiveOr(T value)
returns the inclusive scan the binary or of all active invocationsvalue
's across the subgroup.T subgroupInclusiveXor(T value)
returns the inclusive scan the binary xor of all active invocationsvalue
's across the subgroup.T subgroupExclusiveAdd(T value)
returns the exclusive scan summation of all active invocationsvalue
's across the subgroup.T subgroupExclusiveMul(T value)
returns the exclusive scan the multiplication of all active invocationsvalue
's across the subgroup.T subgroupExclusiveMin(T value)
returns the exclusive scan the minimumvalue
of all active invocationsvalue
's across the subgroup.T subgroupExclusiveMax(T value)
returns the exclusive scan the maximumvalue
of all active invocationsvalue
's across the subgroup.T subgroupExclusiveAnd(T value)
returns the exclusive scan the binary and of all active invocationsvalue
's across the subgroup.T subgroupExclusiveOr(T value)
returns the exclusive scan the binary or of all active invocationsvalue
's across the subgroup.T subgroupExclusiveXor(T value)
returns the exclusive scan the binary xor of all active invocationsvalue
's across the subgroup.
For each of the reduction operations we showed above, there are equivalent inclusive and exclusive scan variants too.
So where are these operations useful?
There are places where you want to scan an entire data set and perform an operation on the entire set. For example, you might want to know of the millions of data points, which is the largest? To do this, you typically need to use atomic operations to compare and update a single value multiple times. The problem is that atomics are expensive - resulting in a significant performance drop as compared to normal memory accesses. The single most sane way to reduce the cost of atomic operations is to simply do less of them, and this is where our subgroup arithmetic operations can come in. Let's assume we want to get the maximum value across our entire data set:
#version 450
#extension GL_KHR_shader_subgroup_arithmetic: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(std430, set=0, binding=0) buffer layout_foo {
uint foo[];
};
layout(std430, set=0, binding=1) buffer layout_bar {
uint bar;
};
void main() {
uint value = subgroupMax(foo[gl_GlobalInvocationID.x]);
// A single invocation in the subgroup will do the atomic operation
if (subgroupElect()) {
atomicMax(bar, value);
}
}
Before we had subgroup operations, we'd have performed one atomic operation per data point we wanted to consider. Now, we are performing 1/gl_SubgroupSize
of the atomic operations - a whopping 32x drop in atomic operations on NVIDIA, and 64x on AMD!
A really cool example of where scan operations are useful is when you are using a compute shader to cull triangles (see Graham Wihlidal's 2016 GDC talk Optimizing the Graphics Pipeline with Compute). Reducing triangles means you are running fewer fragment shaders which can be a huge win. The basis of the triangle reduction code is that each invocation is going to consider some set of triangles and decide whether its worth including the triangles or not. For our example we'll simplify the shader somewhat to keep it short:
#version 450
#extension GL_KHR_shader_subgroup_ballot: enable
#extension GL_KHR_shader_subgroup_arithmetic: enable
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
struct PerVertexData {
float x;
float y;
float z;
float u;
float v;
};
layout(std430, set=0, binding=0) buffer layout_in_triangles {
PerVertexData in_triangles[];
};
layout(std430, set=0, binding=1) buffer layout_out_triangles {
uint out_triangles_size; // Needs to be set to 0 before shader invocation
PerVertexData out_triangles[];
};
void main() {
// Get the 3 vertices for our triangle
PerVertexData x = in_triangles[gl_GlobalInvocationID.x * 3 + 0];
PerVertexData y = in_triangles[gl_GlobalInvocationID.x * 3 + 1];
PerVertexData z = in_triangles[gl_GlobalInvocationID.x * 3 + 2];
// Check whether the triangle should be culled or not
bool care_about_triangle;
uint vertices_to_keep = care_about_triangle ? 3 : 0;
uint local_index = subgroupExclusiveAdd(vertices_to_keep);
// Find out which active invocation has the highest ID
uint highestActiveID = subgroupBallotFindMSB(subgroupBallot(true));
uint global_index = 0;
// If we're the highest active ID
if (highestActiveID == gl_SubgroupInvocationID) {
// We need to carve out a slice of out_triangles for our triangle
uint local_size = local_index + vertices_to_keep;
global_index = atomicAdd(out_triangles_size, local_size);
}
global_index = subgroupMax(global_index);
if (care_about_triangle) {
out_triangles[global_index + local_index + 0] = x;
out_triangles[global_index + local_index + 1] = y;
out_triangles[global_index + local_index + 2] = z;
}
}
#extension GL_KHR_shader_subgroup_shuffle
The shuffle category introduces built-in functions that allow invocations to perform more extensive data sharing across the invocations of a subgroup.
T subgroupShuffle(T value, uint index)
returns thevalue
whosegl_SubgroupInvocationID
is equal toindex
.T subgroupShuffleXor(T value, uint mask)
returns thevalue
whosegl_SubgroupInvocationID
is equal to the current invocationsgl_SubgroupInvocationID
xor'ed withmask
.
Shuffle performs the same action as subgroupBroadcast
, with the difference that the index whose value we want to get can be specified dynamically.
The main benefit of the shuffle built-ins is the ability to do cross-invocation sharing in all shader stages. Compute shaders already give us a mechanism to do this with shared memory, but the shuffle built-ins give us a way to do something similar in other shader stages to:
#version 450
#extension GL_KHR_shader_subgroup_shuffle: enable
layout(location = 0) flat in uint index;
layout(location = 1) in vec4 x;
layout(location = 2) in float blendFactor;
layout(location = 0) out vec4 data;
void main() {
vec4 blendWith = subgroupShuffle(x, index);
data = mix(x, blendWith, blendFactor);
}
In general, if you have a compile time constant value for the index, you should use subgroupBroadcast
as that may use more optimal hardware paths on some hardware.
Shuffle xor is a specialization of subgroup shuffle such that you know every invocation is going to trade its value with exactly one other invocation. Let's take the example where you've got your own fancy reduction algorithm that you want to apply to all members of the subgroup:
#version 450
#extension GL_KHR_shader_subgroup_shuffle: enable
layout(location = 0) in vec4 x;
layout(location = 1) in float additive;
layout(location = 0) out vec4 data;
void main() {
vec4 temp = x;
for (uint i = 1; i <= 128; i *= 2) {
// The mask parameter of subgroupShuffleXor must either be a constant,
// or if used within a loop it must derive from a loop counter whose
// initial value is constant (i = 1), its stride must be constant
// (i *= 2) and its loop end condition must be constant (i <= 128). So
// instead of doing i <= gl_SubgroupSize above, we make the loop counter
// check less than 128 (which is the maximum supported subgroup size),
// and include an additional break here.
if (gl_SubgroupSize == i) {
break;
}
vec4 other = subgroupShuffleXor(temp, i);
temp = temp * other + additive;
}
data = temp;
}
#extension GL_KHR_shader_subgroup_shuffle_relative
The shuffle relative category introduces built-in functions that allow invocations to perform shifted data sharing across the invocations of a subgroup.
T subgroupShuffleUp(T value, uint delta)
returns thevalue
whosegl_SubgroupInvocationID
is equal to the current invocationsgl_SubgroupInvocationID
minusdelta
.T subgroupShuffleDown(T value, uint delta)
returns thevalue
whosegl_SubgroupInvocationID
is equal to the current invocationsgl_SubgroupInvocationID
plusdelta
.
These built-ins are yet further specializations of the shuffle built-ins. shuffle up and shuffle down are great if you want to concoct your own scan operations. Let's say we want a strided scan - EG. all odd invocations will do a scan together, and all even invocations too:
#version 450
#extension GL_KHR_shader_subgroup_shuffle_relative: enable
layout(location = 0) in vec4 x;
layout(location = 0) out vec4 data;
void main() {
vec4 temp = x;
// This is a custom strided inclusive scan!
for (uint i = 2; i < gl_SubgroupSize; i *= 2) {
vec4 other = subgroupShuffleUp(temp, i);
if (i <= gl_SubgroupInvocationID) {
temp = temp * other;
}
}
data = temp;
}
Another really cool thing you can do with shuffle relative is reverse scans. As you may have realized, the inclusive and exclusive scan perform the scan operation from lowest to highest ID within the subgroup. There may well be situations where you want to do this in reverse, where you want the scan to be performed highest to lowest:
#version 450
#extension GL_KHR_shader_subgroup_shuffle_relative: enable
layout(location = 0) in vec4 x;
layout(location = 0) out vec4 data;
void main() {
vec4 temp = x;
// This is a custom reverse inclusive scan!
for (uint i = 1; i < gl_SubgroupSize; i *= 2) {
vec4 other = subgroupShuffleDown(temp, i);
if ((gl_SubgroupSize - i) > gl_SubgroupInvocationID) {
temp = temp * other;
}
}
data = temp;
}
#extension GL_KHR_shader_subgroup_clustered
The clustered category takes the operations we introduced in the arithmetic category but allows only subsets of the invocations to interact with each other.
T subgroupClusteredAdd(T value, uint clusterSize)
returns the summation of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredMul(T value, uint clusterSize)
returns the multiplication of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredMin(T value, uint clusterSize)
returns the mininumvalue
of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredMax(T value, uint clusterSize)
returns the maximumvalue
of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredAnd(T value, uint clusterSize)
returns the binary and of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredOr(T value, uint clusterSize)
returns the binary or of all active invocationsvalue
's across clusters of sizeclusterSize
.T subgroupClusteredXor(T value, uint clusterSize)
returns the binary xor of all active invocationsvalue
's across clusters of sizeclusterSize
.
The clusterSize
must be a compile-time constant, a power of two, and at least one. You'll also get undefined results if clusterSize
is greater than gl_SubgroupSize
.
The main idea with clustered operations is that sometimes you want to only share data with a selection of your closest neighbors within the subgroup.
A really cool thing you can do with subgroup operations is implemented a high performance convolutional neural network. The clustered operations, in particular, can help us implement a specific part of the neural network, max pooling. In max pooling, we want to take a large data set and compress it to a smaller data set. We do this by dividing our large data set into an NxN grid and outputting a single element, the maximum value within the NxN grid. In the below example, we'll use a 4x4 grid using clusters of size 16:
#version 450
#extension GL_KHR_shader_subgroup_clustered: enable
// Using a spec constant for the x dimension here, and we'll set this to the
// subgroup size in the Vulkan API so that the work group is the exact size of
// the subgroup
layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
layout(r16f, set = 0, binding = 0) uniform readonly image3D inImage;
layout(r16f, set = 0, binding = 1) uniform writeonly image3D outImage;
void main() {
// We're going to perform a 4x4 max pooling operation. So we divide up our
// subgroup size into chunks of 16. We'll split up the subgroup like so:
// x 0 1 2 3 4 5 6 7
// y+------------------------
// 0| 0 1 2 3 16 17 18 19
// 1| 4 5 6 7 20 21 22 23
// 2| 8 9 10 11 24 25 26 27
// 3| 12 13 14 15 28 29 30 31
uint numClusters = gl_SubgroupSize / 16;
uint clusterID = gl_SubgroupInvocationID / 16;
uint x = (clusterID * 4) + (gl_SubgroupInvocationID % 4);
uint y = (gl_SubgroupInvocationID / 4) % 4;
uvec3 inIndex = gl_WorkGroupID * uvec3(numClusters * 4, 4, 1) + uvec3(x, y, 1);
float load = imageLoad(inImage, ivec3(inIndex)).x;
float max = subgroupClusteredMax(load, 16);
if (0 == (gl_SubgroupInvocationID % 16)) {
uvec3 outIndex = gl_WorkGroupID;
outIndex.x = outIndex.x * numClusters + clusterID;
imageStore(outImage, ivec3(outIndex), vec4(max));
}
}
#extension GL_KHR_shader_subgroup_quad
The quad category introduces the concept of a subgroup quad, which is a cluster of size 4. This quad corresponds to 4 neighboring pixels in a 2x2 grid within fragment shaders. The quad operations allow for the efficient sharing of data within the quad.
T subgroupQuadBroadcast(T value, uint id)
returns thevalue
in the quad whosegl_SubgroupInvocationID
modulus4
is equal toid
.T subgroupQuadSwapHorizontal(T value)
swapsvalue
's witin the quad horizontally.T subgroupQuadSwapVertical(T value)
swapsvalue
's witin the quad vertically.T subgroupQuadSwapDiagonal(T value)
swapsvalue
's witin the quad diagonally.
Quad operations are not restricted to just fragment stages though. In other stages, you can think of the quad as just a special case of the clustered operations where the cluster size is 4.
An example of where quad operations can be useful is to imagine you've got a fragment shader, and you also want to output a lower resolution image of the fragment result (the next smaller mip level). We can do this by including a single extra image store from the fragment shader, and use our quad subgroup functionality:
#version 450
#extension GL_KHR_shader_subgroup_quad: enable
layout(location = 0) in vec3 inColor;
layout(location = 0) out vec3 color;
layout(set = 0, binding = 0) uniform writeonly image2D lowResFrame;
void main() {
vec3 lowResColor = inColor + subgroupQuadSwapHorizontal(inColor);
lowResColor += subgroupQuadSwapVertical(lowResColor);
// Store the color out to the framebuffer
color = inColor;
// Only the 0'th pixel of each quad will enter here and do the image store
if (gl_SubgroupInvocationID == subgroupQuadBroadcast(gl_SubgroupInvocationID, 0)) {
ivec2 coord = ivec2(gl_FragCoord.xy / 2.0f);
imageStore(lowResFrame, coord, vec4(lowResColor, 1.0f));
}
}
One More Thing
While the GLSL lovers among the readership will be delighted with these new GLSL additions, it'd be wrong if we forgot our HLSL loving brethren! Alongside the new GLSL functionality added to glslang, we're pleased to announce that all of the Shader Model 6.0 wave operations are also supported in glslang's HLSL mode in the 1.1.70 release of LunarG's Vulkan SDK and available in the glslang GitHub repository and in DXC's GitHub repository.
Minor note: some of the built-ins (like the clustered and shuffle operations) are only available in GLSL shaders as HLSL has no corresponding built-ins we can map to.
About the Author
Neil Henning is the Principal Software Engineer, for Vulkan & SPIR-V at Codeplay Software Ltd., and Codeplay's lead representative to the Vulkan and SPIR-V working-groups as a member of Khronos. Neil dedicated 2 years of his blood, sweat, and tears as the primary engineer and author of Vulkan 1.1's new subgroup functionality. The best place to find Neil is on Twitter @sheredom.