Recently, Simon McIntosh-Smith talked with a group of OpenCL and SYCL subject matter experts during the online panel discussion at the 8th International Workshop on OpenCL and SYCL (IWOCL), about the recent announcements of OpenCL 3.0 and the SYCL 2020 provisional release.
Here’s a recap of Simon McIntosh-Smith’s discussion with these experts, where they walk us through the newest events:
- Alastair Murray, Principal Software Engineer, Compilers at Codeplay Software
- Ben Ashbaugh, Software Architect, Intel
- Dennis Adams, Director of Technology, Sony Creative Software
- Eric Berdahl, Senior Engineering Manager, Adobe
- Hal Finkel, Lead, Compiler Technology and Programming Languages, Argonne National Laboratory
- Jeremy Kemp, Senior Software Engineer, Imagination Technologies
- Ronan Keryell, Principal Software Engineer, Xilinx Research Labs
- Kevin Petit, Software Architect, Arm Technology
- Neil Trevett, President, The Khronos Group
- Michael Wong, Working Group Chair, The Khronos Group
Simon McIntosh-Smith: “Starting with OpenCL 3.0, what are the most important changes?”
“There are three things that really jump out at me about OpenCL 3 that I'm excited about:
- OpenCL 3.0 is not that big a step code-wise. We can update very easily from OpenCL 2.X to OpenCL 3.0, so there's not a lot of work needed to start using OpenCL 3.0 implementations and then start incrementally adopting features.
- One of the things that I really like about the spec itself is the fact it's really opening up for the possibility of layered implementations. The trend toward layers becoming more prevalent in the industry is one that we are watching very carefully at Adobe and that I have been subtly influencing for several years. I think it's a very important direction for the future of GPU runtimes and GPU programming models.
- In some ways it’s a minor thing, but as a user, it's one of the most exciting pieces: The OpenCL 3.0 specification is now a universal specification. There isn't just an OpenCL 3.0 spec; there is an OpenCL spec where I can go and find information about every version of OpenCL ever shipped. This makes my ability to find information about OpenCL significantly easier, and I've very much enjoyed using that specification.”
“Speaking as someone who, until recently, had only implemented OpenCL 1.2 because of the fact that the hardware we target can't support some 2.X features, it's pretty good to be able to offer the features we can support without having to try and emulate the features we can't.”
“We have a large base of existing OpenCL C and are looking for deployment flexibility on runtimes other than the OpenCL runtime, so the layered implementation approach opens up a lot of avenues for getting onto Vulkan runtimes or on top of Metal runtimes or other things like that.
And the new initiative by Microsoft of OpenCL on top of DX is very exciting. OpenCL C is a very expressive language for high-performance image processing, allowing us to optimize using shared local memory and workgroups and things like that. And we would hate to have to rewrite those into some flavor-of-the-day compute language every time there is a new platform out there; so, the layered approach, together with being able to retain our investment in over 300 kernels, is very important to us.”
“The OpenCL C kernel programming language hadn’t been updated since OpenCL 2.0, so the addition of OpenCL C 3.0 is a significant upgrade. We now have a much better and complete kernel programming language now in OpenCL 3.0.”
“Generalizations, which essentially give us independence from a single backend, are a big change. We're still deeply tied to OpenCL, but we can now also adopt a Vulkan, CUDA, or OpenMP backend, or whatever you can put your imagination to.
The other big thing that I urge people to look at is unified shared memory. This enables code with pointers in data structures to work naturally without translation for buffers or accessors, and it helps us to port from lower-level languages or from CUDA, for instance. I think that's a big, big improvement.
The other one is modules, not standard C++ modules - unfortunately, that name is heavily overloaded - SYCL modules essentially embed multiple objects into archives even for different backends; this massively reduces ambiguity in the spec.”
“In HPC, having some efficient reduction to take advantage of the real hardware features is very important if you're using accelerators in order to reach the maximum performance--so that's something which is coming. There are also specialization constants which are actually similar to something pushed by Hal Finkel in ISO C++ about JITing C++ templates.
The typical use case is that often, for example in machine learning or things like that, you have a huge amount of machine learning kernels, but the difference is just a few constant coefficients; if the coefficients are compiled or JIT just before run time, you can achieve more efficient execution and reduce the code size, so you can basically lower your carbon footprint! I am pretty excited by the efficiency we can get from that, for example.”
“From a user’s point of view, what things are useful?”
“There are a large number of features that we've been working on for a while and have been inspired by a lot of our experience working on other programming models, such as Kokkos and RAJA and the standard of C++ itself. I think that it's really going to improve the user experience.”
“Is there a timeline for the OpenCL 3.0 to Vulkan interop?”
“Broadly, we've been working on this for a while now. It is rounding up into shape. I can't give a target date, but certainly I would hope that this will come out in 2020 and hopefully, sooner in 2020 rather than later.
The high-level way of working with it is a lot like the external memory in Vulkan itself. The good news here is that this won't be specific to Vulkan; we should be able to extend this to support other APIs, as well. We've heard requests for interop with DX12, and that should be able to use a similar mechanism. We're solving Vulkan first, but we do think that this is a general mechanism that can work for other APIs, too.”
“Is there any widely available mobile device support, OpenCL or SYCL in the pipeline?”
“Yes, Imagination will be shipping an OpenCL 3.0 implementation. We expect to have that once the spec has been finalized and conformance is ready, but from that point on, we'll be shipping on mobile GPUs.”
“Qualcomm, and others in the mobile space, support OpenCL quite extensively. Quite a few of the mobile chip vendors ship OpenCL drivers. It's the low level foundation for many stacks, particularly for applications like video and imaging and increasingly inferencing acceleration. Google has not put the OpenCL into the native platform definition for Android, but this is where the layering kicks in again. Google has been instrumental in the clspv project, which layers OpenCL over Vulkan, and Google is now bringing OpenCL apps and libraries over into Android using OpenCL through this layered implementation.”
“Arm is committed to implementing OpenCL 3.0.There is a bit of a disconnect between what's vendor support and what is widely accessible to developers. Most mobile vendors implement OpenCL and ship OpenCL, but it is not an official API on Android, which makes it hard to access sometimes.”
“In one of our Adobe apps, we are using OpenCL C kernels in our Android application despite the fact that there's no OpenCL runtime. We use the clspv SPIR-V compiler and custom Vulkan hosted code, but all of our shaders are still the same OpenCL kernels that we run on desktop”
“Would you be interested in a layered OpenCL over Metal to guarantee OpenCL would always be available on Apple?”
Seventy percent of the panel responded yes, and 30% responded no.
Panelists response to the survey results:
“This seems like an obvious thing going forward, and, obviously, Apple has deprecated. OpenCL is still available right now, but the layering is going to be a vital piece to ensure that OpenCL remains available where everyone wants to run their applications.
And Apple is a significant platform. This is encouraging, and, hopefully, we will be able to get this going as a project pretty soon and then encourage people to get involved.”
“Does OpenCL 3.0 support unified shared memory? What's the relationship between SYCL and oneAPI, which Intel has been driving?”
“We have an Intel extension for unified shared memory that we're proving out first before we would add this to the core specification or even as a standard extension. I think this is a great candidate for a future extension in 2020.
oneAPI is a broad initiative; it covers things like tools and libraries. It also covers what we're calling direct programming, which is where you would write code and kernels that would run on accelerators. The component that we call direct programming is Data Parallel C++.
It is a specification and an implementation that includes C++ and SYCL, and it also includes extensions like the USM extension. The short answer is that the real question is about the relationship between Data Parallel C++ and SYCL, and SYCL is an important component of Data Parallel C++.
oneAPI is absolutely not strictly for Intel. It's an industry initiative. Anybody is welcome to join and implement parts of oneAPI. We actually have support for other devices in our Data Parallel C++ compiler today; so, you can use Data Parallel C++ to compile for NVIDIA GPUs, and you can find all about this on the oneAPI website, oneAPI.com.”
“Is there something specific you're doing to minimize performance degradation between using SYCL and directly using the backend?”
“Yes. From a theoretical point of view, I can see no real difference, and with modern C++, you can express even more detail if you want, because of our new extensions and lots of things available in C++. You should be able to write more efficient and more adaptable code because it's easier to write modern C++ code that really takes into account value data types that you cannot do, for example, in OpenCL because it is not single-source.”
“We don't want to get into any flame war between C and C++ performance. But I will say that SYCL is being modelled after C++, and C++ has traditionally been shown to have reduced a lot of the overheads that people have been concerned about.
Without more concrete data, nobody can really say for sure, but we have seen that in some cases with C++, and with the improvement in compiler optimizations over the years by many compiler groups, they have been able to reduce overheads wherever necessary so that it's now comparable to inline C code.”
“Can you comment on the performance of SYCL compared to other lower level things or the backends before we move on?”
“One of the things that we are working on at Argonne with our compiler work is taking advantage of single source program models in order to specifically enable optimizations inside of kernels based on data that's outside the kernel. That's something that you can do in a model like SYCL and can't do in a model like OpenCL, where the kernels are compiled separately.
So there actually are cases where you can get even better performance with a single source program model than you can with separate source program models.
I think one area where we have seen OpenCL performance superior to other program models are specific cases where you're able to take advantage of just-in-time compilation from OpenCL C code. You can dynamically compose code that does just what you want as opposed to pre-compiling a more generic version.
I think that, over time, this is something that we'll see improve in other program models as well, whether it's specialization or other capabilities that will provide some of the same kinds of functionality. I also think that will be important in part because it'll provide that functionality in a way which is cleaner and more usable and maintainable than just constructing source code as strings inside your program.”
“At Codeplay, we implement both OpenCL and SYCL, and, much like other panelists have already said, it's very rare that SYCL provides any disadvantages. In fact, often, it's faster because you can write better code. The only real case where we may not end up clearly winning over OpenCL is if you're using a low-level hardware feature that is not yet enabled in SYCL.
But really, that's just a matter of time, and it's just a function of the fact you have to implement it in the lower level before you can do it in a higher level. It is about exposing the hardware features that provide advantages.”
“Yes, what is also very important with this new backend interface in SYCL 2020 is that you have a very strong interoperability layer with the underlying implementations. That means that if you're not happy with your SYCL implementation, you can still very easily use, for example, OpenCL kernel from inside SYCL programs, so you don't lose anything here.”
“It's great that OpenCL and SYCL have these different characteristics. I think they play to the strength of their specific communities in that way.”
“Would it be possible to merge SYCL features to future C++ standards, and where do we see SYCL and C++ and oneAPI and Kokkos? Where are they all going? Are they going to merge? Is one informing the other? What's happening there?”
“I've watched this space very closely for three, four, five years now. And for all the authors of those frameworks you mentioned--Kokkos, RAJA, SYCL, and, I imagine, oneAPI--have all stated that their wish is to push their features into C++, ISO C++. And that has already happened. Things like
std::span came from Kokkos. We're pushing upwards into C++ affinity, and we've tried pushing other things in because it's not going to be exactly the way it looks in these frameworks where they originated.
And now, I would say I’ve given talks in the past pointing out that there's this quiet cooperation happening within C++, where a number of people who are interested in heterogeneous C++ are collaborating to add this into ISO C++ slowly. I think for a long time, frameworks like SYCL, Kokkos, RAJA, will lead C++ and will have to because things take time to settle in C++. It's an ISO standard, so it moves like a battleship - it needs to bring along 200 people, 500 companies, 20 countries along with it. There will always be a space for forward-looking frameworks that are trying to take advantage of the latest hardware, the latest programming innovations, and models. And, by the same token, I think a lot of these frameworks are also downloading features from ISO C++.
For instance, we definitely intend to download features like futures and executors and core routines and modules and C++ ranges into SYCL in the future.”
“I'll quickly add two things. A lot of the same people are involved in all of these spaces, so we have good participation from a lot of people who are on the SYCL working group and within standard C++ itself. The same thing is true of developers of Kokkos and RAJA and other similar frameworks.
So, there's really a way in which there's a lot of learning going on between the different implementations, but also a lot of direct cross-pollination because it's the same people who are involved in these different areas. One of the interesting questions going forward from the standard C++ side of things has to do with the way that we evolve the memory model and how we adopt features that are proven out by SYCL and Kokkos and other frameworks for dealing with different memory spaces and execution and data locality concerns into standard C++.
And right now, we are learning how to best do this in the context of these other frameworks, and those learnings are then going to be moved forward into the C++ standard.”
“Will we ever be able to use OpenCL or SYCL with other languages, such as Fortran or Python or OpenMP?”
“I would just say we would if we have more of those experts in the Khronos community because you need the local experts there to be able to make a good interface. I will say that there's nothing that prevents us from working with OpenMP. This has been asked around a lot in terms of language framework interoperability.
Obviously, when you have two parallel programming languages going together, you're going to run into things like scheduling issues, overloading your scheduling space, and your stack space--and there's nothing that seems to be able to create a topmost overload governor on that. That’s just a standard thing that you would have when you're interfacing any multiple parallel programming languages together.”
“In all seriousness, I think there are a couple of Python bindings for OpenCL at least, so give it a try if you haven't.”
“I was just going to add that we are dealing with these interoperability issues on a regular basis. And my expectation is that as we explore solutions in this space, as we figure out the things that seem to work best, we'll definitely bring those to the relevant standards committee. But part of the issue here is that in many of these cases, it's not obvious that there's a right answer for what the interoperability semantic should be. A lot of this ends up being somewhat empirical based on looking at the applications and what they want to do in the various use cases in trying to fashion some set of solutions that practically work. I think as we get that experience, we'll be able to bring that back to the standards bodies.”
“Has anyone looked at Fortran with OpenCL or SYCL? Fortran support for SYCL and OpenCL: Is that possible? Is it getting easier?”
“We have experience mixing Fortran with a number of different other programming models. Now sometimes, this is done in this trivial sense, where you have some kernels that you're going to write in some other language and then you're going to call those from Fortran. And practically speaking, we have a number of users who do precisely that.
With the implementation of Fortran's front end for LLVM, one of the things that does allow us to do is to use that front-end in order to directly generate LLVM IR and perhaps SPIR-V code from Fortran. Now, there are a lot of questions you have to answer about what it means when you call Fortran built-in functions and the like.
But in the context of OpenMP, for example, this is an active area. We intend to support OpenMP offload in Fortran, and many of the same challenges and questions that come up there will come up in adaptations of other programming models for Fortran.”
“I think, if there is such a thing as what Hal alluded to--a Fortran front end for LLVM--then it is entirely possible for us to pair it with SYCL in some future way. And we would welcome that work.
And I understand where this question is coming from because in a high-performance computing domain, with the research, a lot of kernels are still purely written in Fortran. And they might be accessed from the outside by C or C++ or something like that. And people are just not going to spend all the time tuning the kernels from Fortran to some other language.”
“And a general comment from the OpenCL side: We have the community resource page, and it always amazes me going by how much there is there. I've just gone and searched on the community resource page, and there's about a dozen hits on Fortran there for various tools doing various things with Fortran.
There's actually quite a lot of depth in the OpenCL ecosystem, which you can find on the community resource page, and if you know stuff that is not there, you can do a pull request and add your own favorite tools to let the community know. I encourage people to do that.”
“Just in a really general sense, speaking to bridges and interop and things like that--you mentioned these Fortran routines were written 20 years ago and fully debugged. And nobody wants to have to put an intern on it for a summer to rewrite it in a different language because it works just fine in the language it's in, which speaks towards the investment even in OpenCL C kernels between, for example, Adobe and us.
They still apply to the image processing that we want to do on new platforms, so having a really clean bridge to new runtimes, like Vulkan and things like that, and even bridging into SYCL-- you can certainly write new things in the new expressive language, but you do need a way to bring along the old things, whether that's OpenCL C or Fortran or whatnot.”
“With the recent announcements for OpenCL 3.0 and for SYCL, is this going to make a difference with how much NVIDIA is going to be able to support, in terms of how much of OpenCL it can officially support? Will the changes mean that NVIDIA can officially support more of OpenCL, for example, and more of SPIR?”
“It will definitely help. NVIDIA has publicly stated that we are going to implement OpenCL 3.0. The fact that we can now begin to have much more flexibility in shipping the functionality that is relevant to our customers will help us implement more than we are currently able to do for our customers going forward.
Personally, I would love NVIDIA to support SPIR-V in OpenCL. I loved the results of that poll. I'm going to take that back to NVIDIA and see if that will help us support it. Of course, we already support SPIR-V for Vulkan. The gap is potentially not too big, but as always, it comes down to what our customers are asking for.
But the amount of goodness in the SPIR-V ecosystem that we have been talking about now is beginning to get pretty persuasive.”
“Is this going to make it easier for Arm to support more of OpenCL, and is there going to be any official support for SYCL on Arm in the future?
“Arm already supports OpenCL 2.1 on Mali GPUs, including SPIR-V--we’ve already started shipping that.
On SYCL, it's an interesting technology that we're watching. We don't have anything to say publicly about whether we will support SYCL directly at this stage. But I should add that via SPIR-V, it is already possible to take a third party SYCL implementation and run SYCL applications on Mali GPUs.”
“What do the panelists think are the biggest missing features still for both OpenCL and SYCL? What kinds of things might come in OpenCL 3.1 or next-gen SYCL?”
“We've talked about the real positive experiences around SPIR-V and what it brings to the table. Having SPIR-V either required or at least very widely supported across all of the platforms that a given company wants to use would be very beneficial.”
“The phrase that I like to use is that SPIR-V should be the common shading language. And then I would like to see a renaissance of shader coding languages, all generating SPIR-V. And once the runtimes start all-consuming SPIR-V, you don't have to care what I write in. And that's good for me--that lets me be more expressive.
The other thing I like to see is continuing the direction of having additional layers on top of the dominant GPU runtimes. The things that the GPU runtimes give us are ubiquity and stability and quality, but the things that the layers give us is programming expressiveness and workflow--things that are very difficult to do down at the very low levels.
So that is why I'd like to see things like OpenCL and SYCL continue as high-level programming interfaces interfacing down into lower-level GPU runtimes via this layering strategy and pattern.”
“Some of the things that we are definitely looking forward to are things like subgroups. We want it to simplify the accessors. Accessors are great, and they certainly serve a certain community, but there's a community that doesn’t care about that that much, especially if you are a runtime (below SYCL) and you want more explicit control.
We also have things like a better way--a more formalized way--of handling extensions. Address spaces is another thing that we want to serve. And we've already talked about a few other things in that direction.”
“For OpenCL, obviously 3.0 is all about flexibility, and that's great, but what I hope we'll be doing is more in that direction of making OpenCL even better for a wider range of devices. The ones I care about are things like DSPs and AI accelerators, but any heterogeneous device.”
See the keynote slides from IWOCL/SYCLcon by Neil Trevett to see a list of extensions and future features in OpenCL.
“What are the biggest opportunities in terms of increased adoption for OpenCL and SYCL for the next few years? Where do you see the growth areas for OpenCL and SYCL over the next couple of years?”
“Mobile. Clearly, the place where OpenCL is least available right now is in mobile devices. And despite the excellent work that the vendors have done to make OpenCL SDKs available for their individual parts, none of the mobile ecosystems really support OpenCL. And there's a lot of things going on that have an opportunity to change that over the next one, two, and three years.”
“And I would echo that for SYCL. SYCL has made surprisingly large inroads in high-performance computing and ADAS--advanced driving assistance code for self-driving cars. We've made some inroads into FPGAs, AI processors, and machine-learning processors and custom processors. And as Eric said, when you asked that question about SYCL and mobile, I realized that we haven't really made much inroad in that direction, so we could certainly do better there.
The SYCL direction, of course, depends on how many people we have in the working group--towards things like more C++ features, more safety-critical features. That's an underpinning of Khronos, as well, with our safety staff counsel: to make the language safer to be used in various future applications that we anticipate rapidly coming. And I think OpenCL also has that same safety direction as well.”
“The OpenCL 3 is a good foundation for that. If we push forward to a flexible profile, I think that is probably numerically OpenCL's biggest processor adoption opportunity to bring in billions of embedded processors into the OpenCL ecosystem. And, of course, reducing API surface area, which the flexibility will enable us to do, is vital to safety-critical.
And it's interesting: The embedded space applications hopping from embedded system to embedded system isn't so important--so the flexibility makes a lot of sense. We need to use the profiles in the desktop space where application portability is more important so we can retain portability between different vendors. The profiling and the flexibility that OpenCL 3.0 gives us is going to enable us to reach into different markets in very tuned ways--optimal ways--for each of those markets in turn.”
“That would extend to mobile to then move into the edge. On the desktop, you have some great choices and open standards and things like that. But as you move closer to the edge, you basically have a vendor API that you have to deal with. Being able to get these open standards into the edge would give us the same flexibility that we have on desktop everywhere.”
All of the materials for the IWOCL and SYCL conference are available on the IWOCL website, where you can find presentations, videos, papers, and proceedings for free.
The Khronos specific sessions mentioned above can be found online.