Over the past decade, the use of accelerator architectures and, in particular, GPUs, in high performance computing (HPC) has skyrocketed. Of the Top 500 list of supercomputers from June 2010, only three systems out of the top 50 used accelerator architectures. In the June 2020 list, the number has increased to 27. In addition to the largest supercomputers in the world embracing the performance and efficiency advantages of accelerators for many data-parallel workloads, smaller GPU clusters are also seeing increasing use in research and production.
Software is one of the largest challenges to this development. HPC programs often outlive the hardware they run on, which means that in addition to initial development, ease of maintenance and porting are significant concerns.
There are three main ways to write software for an accelerator cluster today:
- A proprietary software stack might offer good vertically integrated support but locks your (long-lived) software into one given ecosystem.
- A domain-specific language or library is not always possible, but it works well if one actually exists for your domain, your feature requirements and the hardware you are targeting.
- Standards such as Khronos® OpenCL™ or SYCL™ can be used to program the accelerators, typically combining it with the de-facto standard for cluster communication, the Message Passing Interface (MPI).
The third choice is the most viable if you are targeting a unique domain or functionality and want to maintain cross-vendor portability. However, from a programmability perspective, this approach may present difficulties for non-experts. When using OpenCL and MPI, you are intentionally working at a very low level of abstraction for both accelerator programming and cluster data transfer and synchronization. SYCL improves on the accelerator programmability aspect but combining it with MPI introduces a mismatch between the high level of abstraction enabled by SYCL, and the low-level responsibilities that are part of developing an efficient MPI program.
This is where Celerity comes in. It is an open source project which focuses on providing a way of scaling applications to a cluster of accelerators without having to be an expert in distributed memory programming. In fact, the Celerity API does not make it apparent that a program is running on many nodes at all: There is no notion of MPI ranks or process IDs, and partitioning of work and data is taken care of transparently behind the scenes.Celerity is built on top of SYCL: The API makes it the perfect starting point that hits a sweet spot between cost and power as well as ease of use. From that base, we set out to find the minimal set of extensions required to bring the SYCL API to distributed memory clusters - thus making it relatively easy to migrate an existing SYCL application to Celerity.
The code snippet above adds two matrices and writes the result to a third; while not particularly exciting in terms of semantics, it should look quite familiar to anyone who has worked with SYCL before. In fact, the only truly notable difference beyond namespaces and the use of a distributed queue instead of a SYCL queue is the addition of a range mapper when constructing buffer accessors.
The range mapper tells the Celerity runtime system which data is required for which range of work items the parallel kernel is being executed on. It can in principle be any arbitrary C++ functor but, for ease of use, Celerity includes several common cases including the simple one_to_one mapping shown in the example as well as slices, neighborhoods, and fixed ranges. Examples for the latter three options are shown above, with the outlined rectangles representing execution ranges and the colored boxes representing their associated data ranges.
Under the hood, the Celerity system asynchronously builds a task graph describing the relationship between kernel invocations. This is later refined and extended into a command graph which splits kernel execution across accelerators in the cluster and ensures that their required input data is available on the correct node in the distributed memory system when needed.
Of course, in practice, making the Celerity approach efficient is challenging, particularly on large clusters. To realize the comfortable developer experience Celerity aims for, we need to:
- track partial data dependencies and deduce required transfers across buffers and access patterns,
- build efficient task and command graphs for arbitrary programs and cluster sizes at runtime,
- make sure that all communication and synchronization is as minimal as possible, and
- hide latency as well as possible throughout the execution.
While there is still plenty of room for improvement in all of these aspects, Celerity is currently deployed on several smaller clusters as well as the Marconi-100 supercomputer. It has been tested with multiple SYCL implementations such as ComputeCPP and HipSYCL, and future releases will support Intel DPC++/OneAPI as well as other implementations from the growing SYCL ecosystem. If you are interested in giving it a try, then check out the tutorial or follow development on GitHub.