Vulkan for Cloud-based Transient Compute
In the world of simulation we are accustomed to dealing with both extremely large datasets and very long compute times. Even with modern GPU acceleration and large amounts of memory the resolution of the domain required to accurately simulate even a subset of real-world physics can result in compute times that run into the days or even weeks and datasets that are many tens of gigabytes in size. When you have datasets this large it can be difficult to distill this down into something that you can derive valuable insights from and keeping these enormous datasets in the cloud allows us to use scalable cloud resources to process the data. This is something that has become more of a pressing issue as the simulation capabilities of Autodesk Fusion 360 have expanded.
At Autodesk we rely heavily on cloud computing to make all of this happen and from the perspective of simulation post-processing - being the processing of the data that is the result of the simulation - we have been leveraging GPU resources in the cloud to generate visualizations of simulation data. Clients need to be able to inspect the data by generating visualizations on-the-fly which requires an architecture that can handle transient compute processes that don’t simply reserve cloud computing capacity for the duration of a user session.
Vulkan has brought us both flexible compute and graphics along with a vendor-agnostic implementation in a way no other GPU API has before. We had faced limitations on the capabilities of OpenGL on some platforms and an explicit API gives us much more control over how the GPU operates. While our various solvers use a variety of technologies to perform their computations on both the CPU and GPU we decided to unify on a portable, modern explicit GPU API for our post processing system. Given the results of the compute are targeted at interactive visualization we wanted a solution that could be enabled for graphics as well so Vulkan was the only viable choice.
So What Are We Actually Doing?
Depending on the simulation domain we can be dealing with a variety of topologies from uniform voxel domains to tetrahedral meshes to octree structures and the resulting datasets are often timeseries with hundreds of discrete timesteps over a number of physics types. The output can be many thousands of individual datasets. Vulkan compute has enabled us to perform ‘operations’ on this data to generate iso-surfaces, iso-volumes, volumetric slices, flow lines, time-series probes and more from a variety of different datasets on-demand and pass them to the client for rendering in a native or browser-based renderer.
As an example a client can request an iso-surface to be generated for a given dataset at a specific value and the server is responsible for fetching the dataset, performing the iso-surface operation, mapping the requisite data to the geometry and passing the results to the client for rendering. The client and server may be on the same machine, they may even be the same application or they may be running on different machines on opposite sides of the world.
Each of these operations is authored as a GLSL compute shader, compiled to SPIR-V and predominantly executed by our Vulkan runtime inside a docker container on the cloud. The flexibility of Vulkan also means our runtime can execute client-side on just about any kind of device. This runtime flexibility enables us to make the decision whether to use client or server-side compute depending on the dataset size, compute capability and available network bandwidth.
While our primary use-case is executing our Vulkan code headless on the cloud we can also pass the generated buffers directly to a Vulkan-based renderer which is something that was necessary to do for debugging and profiling work. The primary development was initially done on macOS thanks to MoltenVK but it became necessary to use Linux (or Windows) systems with Nvidia GPUs to enable access to the NSight debugging tools. The caveat was that to debug Vulkan compute NSight requires a swapchain so this meant connecting compute to rendering which served as an effective visual debugging tool as well as allowing us to inspect state within NSight. More recently we have replicated this developer experience with AMD GPUs in RenderDoc.
The performance goal of these operations is to be as close to realtime as possible, so for example a request to animate an isosurface across the range of a dataset on a tetrahedral mesh with tens of millions of elements should be able to be done at 60+ fps. This meant that it wasn’t too burdensome to always execute the compute pipeline and then use the output of that as the input to the render pipeline on a per-frame basis for debugging purposes and then simply do an arbitrary frame capture in order to inspect the state of the compute pipeline.
For development we have tried to be as platform-agnostic as possible, this is something that Vulkan enabled us to do over any other GPU API option. We use the prebuilt Vulkan SDK provided by LunarG (except in instances where we are using M1 Macs that don’t have a pre-compiled SDK available, in that case we build the required components manually) which provides us the glslang compiler and the Vulkan headers. We also aim to be compiler independent and support recent versions of Clang/LLVM, GCC and MSVC. MoltenVK has been hugely valuable to us by providing the ability to execute our Vulkan code on Apple platforms without any discernable performance penalty versus writing a native Metal implementation but more importantly without requiring any modification of the code itself.
This portability provides us options for execution on x86 and ARM CPU architectures, Intel, AMD, Nvidia and Apple’s GPU architectures and Windows, Linux and macOS operating systems.
In our use case the VkPipelineLayout is largely the same across many operations which allows us, for the most part, to share the pipeline objects between them and avoid some VkPipeline creation costs. The remaining overhead isn’t so much that we have needed to venture down the route of a VkPipelineCache thus far. We currently target a known configuration on our cloud offerings which gives us very predictable performance and a concrete optimization target when executing there.
Keeping data as close to the processor as possible - in the device memory for a non-unified memory architecture - is essential for performance when you are doing on-demand processing so while we have all our data in place and our pipeline bound we often execute appropriate variations of the input parameters and cache the results. This means we can serve up pre-computed results which works well for tasks like scrubbing over timeseries data at a known resolution but isn’t viable for things like arbitrary slice planes. So in the case of the latter we try and keep the data in memory for as long as possible and only shift it out when the memory pressure becomes too high and the data has not been touched for some time. The first step being to flush it from GPU memory, then from system memory, then eventually from local disk storage.
Flexibility Of Execution
As a GPU API the goal of using Vulkan was to accelerate our simulation post-processing using the GPU. What we found was that in some instances where we had relatively tiny datasets we could get comparable performance by executing our operations on the CPU which is of course a much cheaper option on the cloud.
Our original approach was to leverage the Implicit SPMD Program Compiler (ISPC) backend that Intel created for SPIRV-Cross, that allowed us to take our compiled SPIR-V compute shaders and generate ISPC code which we could then compile and execute on the CPU. The downside here is that while we still only had to author our compute operations once we did have to have a separate runtime environment to execute them and we faced limitations like the inability to use graphics features like texture samplers.
Google’s Swiftshader is an implementation of Vulkan that runs on the CPU. While not as targeted at SPMD optimization as ISPC is it does expose the graphics elements of Vulkan so it gave us the flexibility to use texture samplers without writing our own implementation and allowed us to run our code completely unmodified on the CPU or the GPU. We did find some cases where certain operations were significantly slower than their ISPC counterparts on the CPU but we decided the reduced complexity was worth the cost to simply relegate such operations to the GPU exclusively.
Heterogeneous Compute APIs
Given that our target is to generate renderables the ability to be so closely tied to an API supporting graphics felt advantageous though interop mechanisms in modern APIs like SYCL can achieve this to some degree. SYCL does remain an enticing prospect but access to GPU acceleration on Apple platforms, for example, is something Vulkan has afforded us out of the box thanks in no small part to the Vulkan Portability Initiative.
Vulkan has solved so many of our issues around GPU acceleration in particular the platform portability issues. The developer experience has seen improvements thanks to the variety of tools and platform support and we’ve only just scratched the surface in terms of potential optimization.