Understanding Vulkan Synchronization

Graphics developers with a history of using DirectX and OpenGL may feel familiar with many aspects of low-level GPU APIs such as Vulkan. But as we explore how to get the highest performance from Vulkan, it becomes apparent how many aspects of driving the GPU that DirectX and OpenGL drivers have been handling behind the scenes that we now get to control explicitly.

For example, Vulkan enables developers to take more explicit control over the coordination of graphics tasks and memory management in their applications. These are development tasks any C/C++ developer should be able to handle, though it may have a bit of a learning curve — or may just require dusting off some skills that haven't been used in a while.

The goal of this article is to help developers easily understand and not be intimidated or confused by one of the toughest aspects of Vulkan: Synchronization. We’ll go through individual concepts important to synchronization and demonstrate how to use them correctly and effectively.

Synchronization2 Released!

Synchronization is a critical but often misunderstood part of the Vulkan API. The new VK_KHR_synchronization2 extension includes several improvements to make Vulkan Synchronization easier to use, without major changes to the fundamental concepts described below.

We’ll highlight key differences introduced with Synchronization2 throughout the blog.

Why is Synchronization Important?

Vulkan gives us increased control over the render process to maximize the use of both CPU and GPU resources by running many tasks in parallel. Whereas previous generation APIs were presented as if operations ran sequentially, Vulkan is explicitly parallel and built for multithreading.

For example, the GPU and CPU can run various fragment and vertex operations of the current frame and the next frame all independently of each other. By being specific about which operations need to wait on one another and which operations do not need to wait, Vulkan can render scenes with maximum efficiency and minimal wait time.

By putting CPU and GPU cores to work in tandem with the correct coordinated timing we can keep resources from idling for longer than they need to, squeezing the most performance out of the user’s system. The key is making sure that any parallel tasks wait only when they need to, and only for as long as necessary.

This is where proper and effective synchronization comes into play.

For example, we need to keep the final post-processing shader effect of a game waiting until the current frame has been fully rendered to avoid render artifacts or other strangeness. Vulkan’s synchronization operations let us define these tasks and dependencies as part of the render pipeline so that it can process the work as efficiently as possible.

To understand how this works, we need to look at synchronization at two levels: within a single queue and across multiple queues. Let’s start by looking at in-queue synchronization.

Synchronization Within a Device Queue

Vulkan enables us to send command buffers into a queue to process graphics operations. This process is designed to be thread-friendly so we can submit work using command buffers from any CPU thread and they are eventually inserted into the same GPU queue. This gives us the ability to do our own multi-threading while Vulkan runs its commands, also often in parallel, computing vertices or loading textures to maximize the use of all CPU cores.

Note that our commands can depend on the completion of other commands even within the same queue, and they do not need to be in the same command buffer. Commands are also guaranteed to start in the exact order they were inserted, but because they can run in parallel, there is no guarantee that the commands will complete in that same order.

The in-queue tools for synchronization that ensure these commands wait correctly on their dependencies are pipeline barriers, events, and subpass dependencies. Let's take a look at how to use each one.

Pipeline Barriers

Pipeline barriers specify what data or which stages of the rendering pipeline to wait for and which stages to block until other specified stages in previous commands are completed.

Keep in mind that these barriers are GPU-only, which means that we cannot check when a pipeline barrier has been executed from our application running on the CPU. If we need to signal back to the application on the CPU, it’s possible to do so by instead using another tool called a fence or an event, which we will discuss later.

The two types of barriers are:

  • Execution barriers
  • Memory barriers

We can create either an execution barrier, or an execution barrier and a number of memory barriers of one or more types in a single call..

Here is the pipeline barrier function for reference as we discuss parts of it below:

void vkCmdPipelineBarrier(
   VkCommandBuffer                             commandBuffer,
   VkPipelineStageFlags                        srcStageMask,
   VkPipelineStageFlags                        dstStageMask,
   VkDependencyFlags                           dependencyFlags,
   
uint32_t                                    memoryBarrierCount,
   
const VkMemoryBarrier*                      pMemoryBarriers,
   
uint32_t                                    bufferMemoryBarrierCount,
   
const VkBufferMemoryBarrier*                pBufferMemoryBarriers,
   
uint32_t                                    imageMemoryBarrierCount,
   
const VkImageMemoryBarrier*                 pImageMemoryBarriers);

Vulkan Synchronization2 note:

Synchronization2 stores barrier pipeline stage masks in the barrier structure rather than passing them as separate parameters to vkCmdPipelineBarrier. This change simplifies resource tracking.

Execution Barriers

When we want to control the flow of commands and enforce the order of execution using pipeline barriers, we can insert a barrier between the Vulkan action commands and specify the prerequisite pipeline stages during which previous commands need to finish before continuing ahead. We can also specify the pipeline stages that should be on hold until after this barrier.

These options are set using the srcStageMask and dstStageMask parameters of the vkCmdPipelineBarrier. Since they are bit flags, we can specify multiple stages in these masks. The srcStageMask marks the stages to wait for in previous commands before allowing the stages given in dstStageMask to execute in subsequent commands. For execution barriers, the srcStageMask is expanded to include logically earlier stages. Likewise, the dstStageMask is expanded to include logically later stages. The stages in the dstStageMask (and later) will not start until the stages in srcStageMask (and earlier) complete. This is sufficient to guard read accesses by stages in srcStageMask from write accesses by stages in dstStageMask .

To avoid a common pitfall, note that stage mask expansion is not applied to memory barriers defined below.

For a full reference of the pipeline stages available, see Pipeline Barriers in the Vulkan Specification.

Memory Barriers

To increase performance under the hood, Vulkan uses a series of caching mechanisms between the fast L1/L2 cache memory on the CPU and GPU cores and the relatively slow main RAM memory.

When one core writes to memory (to a render target, for example), the updates could still only exist in a cache and not be available or visible to another core ready to work with it. Memory barriers are the tools we can use to ensure that caches are flushed and our memory writes from commands executed before the barrier are available to the pending after-barrier commands. They are also the tool we can use to invalidate caches so that the latest data is visible to the cores that will execute after-barrier commands.

In addition to the pipeline stage masks specified for execution barriers, memory barriers specify both the type of memory accesses to wait for, and the types of accesses that are blocked at the specified pipeline stages. Each memory barrier below contains a source access mask (srcAccessMask) and a destination access mask (dstAccessMask) to specify that the source accesses (typically writes) by the source stages in previous commands are available and visible to the destination accesses by the destination stages in subsequent commands.

In contrast to execution barriers, these access masks only apply to the precise stages set in the stage masks, and are not extended to logically earlier and later stages.

There are three types of memory barriers we can use: global, buffer, and image. Each of these defines the accesses which will be ensured to be available (the source access by the source stage) and the stages and accesses types to which these accesses will be visible (the destination access by the destination stage).

  • Global memory barriers are added via the pMemoryBarriers parameter and apply to all memory objects.
  • Buffer memory barriers are added via the pBufferMemoryBarriers parameter and only apply to device memory bound to VkBuffer objects..
  • Image memory barriers are added via the pImageMemoryBarriers parameter and only apply to device memory bound to VkImage objects.
typedef struct VkMemoryBarrier {
   VkStructureType sType;
   
const void* pNext;
   VkAccessFlags srcAccessMask;
   VkAccessFlags dstAccessMask;
} VkMemoryBarrier;
typedef struct VkBufferMemoryBarrier {
   VkStructureType sType;
   
const void* pNext;
   VkAccessFlags srcAccessMask;
   VkAccessFlags dstAccessMask;
   
uint32_t srcQueueFamilyIndex;
   
uint32_t dstQueueFamilyIndex;
   VkBuffer buffer;
   VkDeviceSize offset;
   VkDeviceSize size;
} VkBufferMemoryBarrier;
typedef struct VkImageMemoryBarrier {
   VkStructureType sType;
   
const void* pNext;
   VkAccessFlags srcAccessMask;
   VkAccessFlags dstAccessMask;
   VkImageLayout oldLayout;
   VkImageLayout newLayout;
   
uint32_t srcQueueFamilyIndex;
   
uint32_t dstQueueFamilyIndex;
   VkImage image;
   VkImageSubresourceRange subresourceRange;
} VkImageMemoryBarrier;

The parameters srcQueueFamilyIndex and dstQueueFamily are used to support sharing buffers or images among multiple device queues from different queue families, which requires additional care to synchronize memory accesses. For resources created with VK_SHARING_MODE_EXCLUSIVE, Queue Ownership Transfer barriers must be executed on both the source and destination queues. These barriers are similar to normal vkCmdPipelineBarrier operations, except the source stage and access mask part of the barrier is submitted to the source queue and the destination stage and access mask is submitted to the destination queue. Using VK_SHARING_MODE_CONCURRENT when creating Buffers and Images avoids the needs for these barriers, but usually results in worse performance.

Execution and Memory Dependency

One scenario where we might want to set up a pipeline barrier is when we write to a texture image buffer in a compute shader and then use it in a fragment shader. That setup might look like this example from the Vulkan synchronization examples wiki:

vkCmdDispatch(...);

VkImageMemoryBarrier imageMemoryBarrier = {
 ...
 .srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT,
 .dstAccessMask = VK_ACCESS_SHADER_READ_BIT,
 .oldLayout = VK_IMAGE_LAYOUT_GENERAL,
 .newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
 
/* .image and .subresourceRange should identify image subresource accessed */};

vkCmdPipelineBarrier(
   ...
   VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,  
// srcStageMask
   VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
// dstStageMask
   ...
   1,                                    
// imageMemoryBarrierCount
   &imageMemoryBarrier,                  
// pImageMemoryBarriers
   ...);
   

...
// Render pass setup etc.

vkCmdDraw(...);

For more examples of setting up synchronization for various setups, we recommend the Synchronization Examples in the Vulkan wiki.

Events

Another tool for synchronization in Vulkan is the event, which uses source stage masks and destination stage masks just like pipeline barriers, and can be quite useful when we need to specify and run parallel computation. The key difference between events and pipeline barriers is that event barriers occur in two parts. The first part is setting an event using vkCmdSetEvent, and the second is waiting for an event with vkCmdWaitEvents. Events synchronize the execution and memory accesses that occur before vkCmdSetEvent calls with execution and memory accesses which occur after the vkCmdWaitEvents calls; commands that occur between vkCmdSetEvent and vkCmdWaitEvents are unaffected by the event. While events can be set both from the GPU in command buffers, and from the host, we will only discuss the GPU side event setting here.

An event is set by calling vkCmdSetEvent with a stageMask parameter that marks the stages to wait for in previous commands before signalling the event. The vkCmdWaitEvents call takes nearly identical parameters to the pipeline barrier parameters, with the meaning of the srcStageMask altered to be the union of the stageMask of all events in pEvents. Without memory barriers, the vkCmdWaitEvents causes the dstStageMask stage execution in subsequent commands to wait until all events in pEvents signal. This creates an execution barrier between the stageMask stages of each of the given events for commands prior to that vkCmdSetEvent with the dstStageMask stage in subsequent commands.

When present, the optional memory barriers function very similarly to pipeline barriers. srcStageMask and srcAccessMask combine to define the memory accesses to be complete, available, and visible to those matching dstStageMask and dstAccessMask in subsequent commands, creating a memory barrier with those accesses. However, the accesses guaranteed to be available and visible are limited to those matching stageMask in commands prior to each vkCmdSetEvent as well as being included in srcStageMask and srcAccessMask in vkCmdWaitEvents.

As with pipeline barriers, stage masks are expanded for execution barriers but not for memory barriers.

Let's illustrate an example of how events work:

An example usage for events might look like this:

// Three dispatches that don’t have conflicting resource accesses
vkCmdDispatch( 1 );
vkCmdDispatch( 2 );
vkCmdDispatch( 3 );


// 4, 5, and 6 don’t share resources with 1, 2, and 3
// No reason for them to be blocked, so set an event to wait for later


vkCmdSetEvent( A, srcStageMask = COMPUTE );
vkCmdDispatch( 4 );
vkCmdDispatch( 5 );
vkCmdDispatch( 6 );


// 7 and 8 don’t use the same resources as 4, 5, and 6.  So use an event
vkCmdSetEvent( B, srcStageMask = COMPUTE );


// 7 and 8 need the results of 1, 2, and 3

// So we’ll wait for them by waiting on A
vkCmdWaitEvents( A, dstStageMask = COMPUTE );

vkCmdDispatch( 7 );


vkCmdDispatch( 8 );
// 9 uses the same resources as 4, 5, and 6 so we wait.


// Also assumed is that 9 needs nothing from 7 and 8
vkCmdWaitEvents( B, dstStageMask = COMPUTE );


vkCmdDispatch( 9 );

By interleaving groups of unrelated work, we can reduce device stall time with events. Assuming the work between the Set/Wait pairs is sufficient to cover the write retirement and cache flushing, the stalls at wait time are minimized, while still guaranteeing synchronized memory accesses.

Between each signalling of an event, the event must be reset with vkCmdResetEvent. The reset must be synchronized with both the vkWaitEvents that precedes it and the vkCmdSetEvent that follows. Please see Events in the Vulkan Specification for full details.

Vulkan Synchronization2 note:

In Synchronization2, vkCmdSetEvent2KHR requires pipeline barriers. This change enhances driver efficiency by scheduling work at event ‘set’ time, rather than the ‘wait’ for barrier information to become available.

Subpass Dependencies

One more way to synchronize within the device queue is through subpass dependencies. These are similar to pipeline barriers, but are used specifically to express dependencies between render subpasses and between commands within a renderpass instance and those outside it, either before or after.

These can be a bit tricky because they come with many restrictions. But they can come in handy if we’re working with data across render passes, like when rendering shadows or reflections, or if we need to wait on an external resource or event.

When using subpass dependencies, there are some things we want to keep in mind:

  • They contain only a single memory barrier for attachments specified with srcAccessMask and dstAccessMask.
  • Subpasses can wait to complete using pipeline stage flags srcStageMask and dstStageMask.
  • They can only make forward progress, meaning a subpass can wait on earlier stages or the same stage, but cannot depend on later stages in the same render pass.

Synchronization Across Multiple Device Queues

Now that we’ve gone through some mechanisms for setting up dependencies within a single device queue, let’s check out how we can orchestrate synchronization across different queues.

The Vulkan API provides two options, each with different purposes: semaphores and fences.

Note that Vulkan 1.2 introduced timeline semaphores, which is the new and preferred approach to semaphores going forward. They aren’t widely available yet on mobile, so we'll talk about the original approaches first, then look at how the new timeline semaphores design is able to replace both of the original options.

Semaphores

Semaphores are simply signal identifiers that indicate when a batch of commands has been processed. When submitting a queue with vkQueueSubmit, we can pass in multiple semaphores as a parameter.

The key to understanding and using semaphores is recognizing that they are for synchronizing solely between GPU tasks, especially across multiple queues, and not for synchronizing between GPU and CPU tasks.

If multiple commands are busy crunching away on their tasks across cores and threads, a semaphore is like an announcement that a team of commands has finished. Semaphores are signaled only after all commands in the batch are complete. They make implicit memory guarantees, so we can access any memory following the semaphore without needing to think about adding memory barriers between them.

Vulkan Synchronization2 note:

Synchronization2 passes data for semaphores and command buffers in arrays of structures, rather than in separate arrays spread across multiple structures, to streamline queue submissions.

Fences

Fences are pretty straightforward: while semaphores were built for synchronizing GPU tasks, fences are designed for GPU-to-CPU synchronization.

Fences can attach to a queue submission and allow the application to check a fence status using vkGetFenceStatus or to wait for queues to complete using vkWaitForFences.

Fences make the same implicit memory guarantee as semaphores, and if we want to present the next frame in a swap buffer, we can use fences to know when to swap and start the render of the next frame.

Timeline Semaphores

Now to touch on the Vulkan 1.2 API’s new timeline semaphores.

The new timeline semaphore approach comes with various advantages because it’s flexible and works as a superset for both semaphores and fences while allowing signaling between GPU and CPU in both directions. Not only can we wait on semaphores from the application on the CPU, we can even signal a semaphore to the GPU! And while fences only work at the coarse queue submission level, timeline semaphores have finer granularity for more flexibility.

The way it works is very clever: it uses an integer counter, which each semaphore signals to increment upon completion, as a signal timeline.

We could think about it like a stadium crowd orchestrating a wave together where rows of people stand up and sit back down, passing the motion along to the next person in sequence without having to specifically tell the person sitting next to them. Each person already knows what their position is in the stadium (the timeline) and as the wave comes near, they recognize it is their turn. Pretty cool.

For more information, see the Vulkan Timeline Semaphores post on the Khronos Group blog.

Next Steps

Using synchronization to enable tasks to efficiently run in parallel is the key to getting maximum performance out of Vulkan. We hope this whirlwind tour of Vulkan synchronization has been helpful; and here are more resources if you want to dig deeper:

Also, now that you have a high-level understanding of how synchronization works, try browsing through the Synchronization and Cache Control section of the Vulkan Specification. Powered with your new knowledge, it should be much easier to approach. Happy coding and synchronizing!

About the Authors

Raphael Mun

Raphael Mun is a tech entrepreneur and educator who has been developing software professionally for over 20 years. He currently runs Lemmino, Inc and teaches and entertains through his Instafluff livestreams on Twitch building open source projects with his community.

John Zulauf

John Zulauf is a Senior Graphics Software Engineer at LunarG with 30 years Graphics experience across numerous platforms and from kernel drivers to application development.

Jeremy Gebben

Jeremy Gebben is a Senior Graphics Software Engineer at LunarG with 25 years of experience working on drivers for GPUs, high speed networking devices, and custom embedded hardware.

Jan-Harald Fredriksen

Jan-Harald Fredriksen is a Fellow at Arm with 16 years of experience working on GPU drivers, technology, and API standards.

Comments

devilish