Reducing Draw Time Hitching with VK_EXT_graphics_pipeline_library
Khronos has introduced a new extension named VK_EXT_graphics_pipeline_library that allows for shaders to be compiled much earlier than at full Pipeline State Object (PSO) creation time. By leveraging this extension, I was able to avoid many causes of frame hitches due to PSOs being late-created at draw time in the Source 2 Vulkan renderer. The extension spec was released today and SDK support will follow soon, you can track the release status at https://github.com/KhronosGroup/Vulkan-Docs/issues/1808.
The Source 2 engine was fairly heavily designed around the Direct3D11 model where shaders are created independently and state objects are provided at draw time. As such, there is a significant amount of information our engine does not know at the time the shaders are provided to our rendering abstraction: the pairing of shaders across stages, vertex formats, framebuffer formats, depth/stencil state, viewport information, MSAA state, and several others. This has meant that we delay creation of PSOs until draw time, which can lead to hitches, particularly with a cold pipeline cache.
Before going into detail on how we integrated VK_EXT_graphics_pipeline_library into our engine, I want to give a couple of caveats about this extension. It should first of all be said that there is a very good reason that Vulkan was designed so that shader compilation work happens at PSO creation time with a full view of all of the required state. While Direct3D11 drivers provide the illusion of compiling entirely with just the shader byte code, the truth is that there are massive heroics happening inside the drivers to make this so. Drivers are often doing background compilation on multiple threads and in fact Direct3D11 cannot guarantee that shader compilation doesn’t happen at draw time. In practice though, GPU vendors have gotten exceptionally good at these heroics, and the typical user experience on a Direct3D11 driver leads to significantly less hitching than our Vulkan renderer without fully prewarmed pipeline caches. However, Vulkan applications that are able to know all the shader and pipeline state ahead of time are guaranteed to avoid hitching because the shader compilation work will have entirely occurred at PSO creation time. Whereas with Direct3D11, there is no such guarantee.
A second caveat is that if you are designing a new engine for Vulkan, you should really consider whether having large numbers of shader permutations is a good idea. Some games, such as DOOM 2016/DOOM Eternal have kept to having a very small number of PSOs. Describing this design space in detail is beyond the scope of this blog post, but I highly recommend reading this two part blog series that explains why many engines have large numbers of shader permutations (which is one of the root causes of many draw time compile hitches): The Shader Permutation Problem: How Did We Get Here?
With all of that said, Khronos has heard from many developers (including us) that it simply is not possible in some scenarios to know the entire PSO state up front. This in part has led to the creation of several new extensions (core in Vulkan 1.3) that allow much more PSO state to be dynamic. VK_EXT_graphics_pipeline_library goes a step further, allowing for shaders to be fully lowered to machine instructions long before draw time. With this extension, Direct3D11-style engines such as ours have a way to provide a comparable (or even better!) experience than on Direct3D11 with respect to shader compilation. In the following sections, I will provide an overview of VK_EXT_graphics_pipeline_library and detail the process of integrating the extension into the Source 2 engine.
Graphics Pipeline Library Overview
For those looking for a detailed overview of the VK_EXT_graphics_pipeline_library extension, I would strongly encourage you to check out the proposal document. In brief, the extension breaks the PSO into four individual pipeline stages instead of one monolithic pipeline:
- Vertex Input Interface
- Pre-Rasterization Shaders
- Fragment Shader
- Fragment Output Interface
The Vertex Input Interface contains the information that would normally be provided to the full pipeline state object by VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo. For our engine, this information is not known until draw time, so a pipeline for this stage is still hashed and created at draw time. However, this stage has no shader code and thus the driver can create it quickly and there are also a fairly small number of these objects.
The Pre-Rasterization Shaders contain the vertex, tessellation, and geometry shader stages along with the state associated with VkPipelineViewportStateCreateInfo, VkPipelineRasterizationStateCreateInfo, VkPipelineTessellationStateCreateInfo, and VkRenderPass (or dynamic rendering). This may sound like more information than your engine would know at shader creation time, it definitely was for us. However, the key is that by combining pipeline libraries with dynamic state extensions, the only information you actually need to create the pre-rasterization shader is the SPIR-V code and pipeline layout. This is discussed in more detail below.
The fragment shader stage contains the fragment shader along with the state in VkPipelineDepthStencilStateCreateInfo and VkRenderPass (or dynamic rendering - although in that case only the viewMask is required). Much like with the pre-rasterization stage, if combined with dynamic rendering you can create the fragment shader pipeline with only the SPIR-V and the pipeline layout. This allows the driver to do the heavy lifting of lowering to hardware instructions for the pre-rasterization and fragment shaders with very little information.
Finally, there is the Fragment Output Interface, which contains the VkPipelineColorBlendStateCreateInfo, VkPipelineMultisampleStateCreateInfo, and VkRenderPass (or dynamic rendering). Like with the Vertex Input Interface, this stage requires information that we don’t know until draw time, so this state is also hashed and the Fragment Output Interface pipeline is created at draw time. It is expected to be very quick to create and also relatively small in number.
With all four individual pipeline library stages created, an application can perform a final link to a full pipeline. This final link is expected to be extremely fast - the driver will have done the shader compilation for the individual stages and thus the link can be performed at draw time at a reasonable cost. This is where the big benefit of the extension comes in: we’ve pre-created all of our pre-rasterization and fragment shaders, hashed the small number of vertex input/fragment output interfaces, and can on-demand create a fast linked pipeline library at draw time, thus avoiding a dreaded hitch.
Early Shader Compilation with Pipeline Libraries
In our engine, shaders are provided to our rendering abstraction layer during load time of our materials (which happens during startup or loading screens). In Direct3D11, these directly lead to calling IDirect3D11Device::Create*Shader methods. In Vulkan, prior to VK_EXT_graphics_pipeline_library, the only thing we could really do at this time was vkCreateShaderModule. This hands the SPIR-V to the driver, but does not actually trigger any significant shader compilation since the Vulkan driver needs to do it at PSO creation time when all shader stages, descriptor set layouts, and required state are known.
As such, our Vulkan renderer keeps a hash of state and will create the full pipeline at draw time when all of that state is finally known. With VK_EXT_graphics_pipeline_library, we can now compile shaders at the same time we would in Direct3D11. In the following sections I’ll describe the changes that were needed to make this possible.
While the use of VK_EXT_graphics_pipeline_library does not require that applications use dynamic state, practically speaking for our engine the two are inexorably linked together. Without using dynamic state, it would not be possible for us to create pipeline libraries for the pre-rasterization and fragment shader stages at material load time. I’ll note now that we only create pre-rasterization pipelines for vertex shaders and don’t bother with tessellation and geometry shaders. We don’t have many instances where we use geometry and tessellation shaders, so for the purposes of the rest of this article the pre-rasterization stage for us refers just to vertex shaders. If a pipeline uses tessellation or geometry shaders, we fall back to full PSO creation.
The specific dynamic state extensions that we require in our engine to be able to use VK_EXT_graphics_pipeline_library are as follows:
Thankfully, all three of these extensions are part of Vulkan 1.3 so they can be expected to be supported anywhere VK_EXT_graphics_pipeline_library is supported.
For the vertex shader (pre-rasterization pipeline library), the information in the following table needs to be dynamic in order for us to be able to create the pipeline library immediately. That is, we do not know the viewport, depth bias, cull mode, render targets (or formats) at the time we are provided the vertex shader so by making all of this state dynamic we are able to create a pre-rasterization pipeline library with just the SPIR-V (and the pipeline layout, more on that later).
Pre-Rasterization Stage Dynamic State
|VkRenderPass||Dynamic (VK_NULL_HANDLE) with VK_KHR_dynamic_rendering|
As with the vertex shader, for the fragment shader there are many states we do not know at the time we load the SPIR-V. Specifically, we don’t know the depth/stencil and renderpass-related state so we make those dynamic as detailed in the next table.
Fragment Stage Dynamic State
|VkRenderPass||Dynamic (VK_NULL_HANDLE) with VK_KHR_dynamic_rendering|
With the dynamic states used in the previous section, the only other information we need in order to be able to create the vertex/fragment shader pipeline libraries is the pipeline layout. This would seem on the surface to be pretty straightforward information to gather. We know from shader reflection what descriptors are consumed in a shader so we should be able to know the descriptor set layouts for each stage. This would be very simple if our vertex and fragment shaders were created together in a pair, but that’s not how it works (nor I would imagine how many Direct3D11-based engines work). Although our shaders are both contained in the same file, the combination of which vertex/fragment shader pair will be used together isn’t known until draw time. For example, in the depth-only pass vertex shader A is paired with fragment shader A (i.e. that fetches a texture to perform alpha test). In the forward pass, vertex shader A will be paired with fragment shader B that does full forward lighting. And there are actually many other scenarios where the precise combination isn’t known until draw time.
This poses a problem for VK_EXT_graphics_pipeline_library which wants the full pipeline layout when we create either the pre-rasterization or fragment shader stage. We simply don’t have that information - we know the descriptors consumed by the stage we’re compiling, but not the other stage. Thankfully, VK_EXT_graphics_pipeline_library contains a flag that allows you to create a pipeline layout where each stage only needs the descriptor sets it consumes (VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_KHR). As long as the descriptor set layouts match for any shared descriptor sets, we can avoid providing the other stage descriptor sets layouts at the time we create the individual stage libraries.
One easy way to handle this would be if your engine uses different descriptor sets for each shader stage.You would simply provide a pipeline layout containing the per-stage descriptor sets to each library.Our engine, however, does not do this.The reason we don’t do it is partially because of performance (so that we can allocate/update/bind one descriptor set for the dynamic resources in the VS/FS instead of two) and partially because some Vulkan implementations still have a very small number of total descriptor sets supported (specifically, some mobile GPUs have a limit of 4).
The way our engine partitions descriptor sets is roughly as follows:
- Descriptor Set 0 - dynamic resources not bound until draw time for all stages
- Descriptor Set 1 - vertex shader static descriptors
- Descriptor Set 2 - fragment shader static descriptors
- Descriptor Set 3 - bindless descriptors (shared across stages)
So for graphics pipeline libraries, the way we create our pipeline layouts is as follows:
- Descriptor Set 0 - an “uber set” that contains all possible consumed dynamically bound resources that can be shared across the VS/FS
- Descriptor Set 1 - only provided to pre-rasterization library (vertex shader) if used
- Descriptor Set 2 - only provided to fragment shader library if used
- Descriptor Set 3 - provided to both stages if used
So in other words, the vertex shader library is created with a pipeline layout containing descriptor set 0, 1, and 3. The fragment shader library is created with a pipeline layout containing descriptor set 0, 2, and 3. We know that set 1, 2, and 3 will have identical layouts across any used stages, and we also guaranteed this for descriptor set 0 by making it an “uber set” that contains all possible consumed resources.
A final note here is that the pipeline layout needs to also contain the immutable samplers and push constants. For us, the push constant is a shared resource across stages so we have the information to apply it to both shader stages when creating the per-stage layouts (it already can’t differ across stages because of the way we use it). Likewise, the immutable sampler state is known ahead of time so we’re able to include those in the pipeline layout.
Vertex Input Interface and Fragment Output Interface
Using what I’ve described so far, we are now able to immediately compile our vertex and fragment shaders with just the SPIR-V and the pipeline layout. There are two more stages we need to build: the vertex input interface and fragment output interface. These bits of information are still not known until draw time, so we hash the subset of information required for the vertex input interface and fragment output interface. These stages should be small in quantity (I measured less than forty in a workload in our engine) and also are fast to create. Unlike the other stages, there is no shader code for the driver to compile.
Final Linked Pipeline
With all four stages ready, we can create the final linked pipeline library just before drawing with a new material. There are some decisions to make when creating the final linked pipeline that potentially trade fast link times on the CPU for reduced GPU performance. That is, assuming you created the individual pipeline libraries with VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT, you can choose whether you want the driver to create the final linked pipeline with cross-stage optimization. It is expected that doing cross-stage optimization will increase the amount of CPU time with the benefit of improved GPU performance.
Our goal is to achieve no draw time hitching, so we initially create our linked libraries without the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT set. Without that bit set, it is expected that creating the linked pipeline library will be very fast in the driver. It will be particularly fast on implementations that set VkPhysicalDeviceGraphicsPipelineLibraryPropertiesEXT.graphicsPipelineLibraryFastLinking (which is true for at least all of the desktop vendors - NVIDIA, AMD, and Intel). Even on implementations that don’t set graphicsPipelineLibraryFastLinking, it is expected that pipeline library linking will be significantly faster than a full PSO link.
After creating the fast-linked pipeline library without optimization, we kick off compilation of a pipeline library with VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT on a background thread and swap that in when it’s ready. In this way, we avoid the hitch when first creating the pipeline library but can gain back the full GPU performance once we have time to do the cross stage optimization in the background. This choice is entirely up to the application: some applications which are less sensitive to stuttering might choose to always create the cross-stage linked pipeline library. They still should expect significant CPU improvement over creating a full PSO since much of the compilation will have moved earlier.
VK_EXT_graphics_pipeline_library provides a way to avoid draw time hitching by compiling shaders earlier. While it comes with a set of tradeoffs, we believe that for some engines constrained by existing content/design it will be extremely helpful in reducing the main cause of pipeline hitches. The combination of graphics pipeline libraries and dynamic state provides increased flexibility that can allow engines to avoid delaying shader compilation until draw time. While it is still the recommendation that you aim for a design that creates full PSOs ahead of time, for applications unable to do so this extension will be very useful.
Thanks to all of the many Khronos members that have participated in creating this extension, especially Chris Glover @ Google for chairing the effort. I’d also particularly like to thank Tobias Hector @ AMD for championing this extension and Piers Daniell @ NVIDIA for providing early driver support. Thanks to Baldur Karlsson for providing early support in RenderDoc to help with development and Nathaniel Cesario @ LunarG for working on validation support. Also thanks to Mike Blumenkrantz @ Valve for providing early support in lavapipe.