Page 1 of 4 1234 LastLast
Results 1 to 10 of 31

Thread: Official OpenCL 2.0 Feedback thread

  1. #1
    Administrator khronos's Avatar
    Join Date
    Jun 2002

    Official OpenCL 2.0 Feedback thread

    Khronos Releases OpenCL 2.0 Provisional Specification for Public Review
    New generation of industry open standard for cross-platform parallel programming delivers increased flexibility, functionality and performance

    July 22nd 2013 – SIGGRAPH - Anaheim, CA – The Khronos™ Group today announced the ratification and public release of the OpenCL™ 2.0 provisional specification. OpenCL 2.0 is a significant evolution of the open, royalty-free standard that is designed to further simplify cross-platform, parallel programming while enabling a significantly richer range of algorithms and programming patterns to be easily accelerated. As the foundation for these increased capabilities, OpenCL 2.0 defines an enhanced execution model and a subset of the C11 and C++11 memory model, synchronization and atomic operations. The release of the specification in provisional form is to enable developers and implementers to provide feedback before specification finalization, which is expected within 6 months. The OpenCL 2.0 provisional specification and reference cards are available at

    “The OpenCL working group has combined developer feedback with emerging hardware capabilities to create a state-of–the-art parallel programming platform - OpenCL 2.0,” said Neil Trevett, chair of the OpenCL working group, president of the Khronos Group and vice president of mobile content at NVIDIA. “OpenCL continues to gather momentum on both desktop and mobile devices. In addition to enabling application developers it is providing foundational, portable acceleration for middleware libraries, engines and higher-level programming languages that need to take advantage of heterogeneous compute resources including CPUs, GPUs, DSPs and FPGAs.

    Updates and additions to OpenCL 2.0 include:

    Shared Virtual Memory
    Host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.

    Dynamic Parallelism
    Device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.

    Generic Address Space
    Functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.

    Improved image support including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.

    C11 Atomics
    A subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.

    Pipes are memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.

    Android Installable Client Driver Extension
    Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

    OpenCL BOF at SIGGRAPH, Anaheim, CA July 24th 2013
    There is an OpenCL BOF “Birds of a Feather” Meeting on Wednesday July 24th at 4-5PM at the Hilton Anaheim, California Ballroom A & B, where attendees are invited to meet OpenCL implementers and developers and learn more about the new OpenCL 2.0 specification.

  2. #2
    Junior Member
    Join Date
    Aug 2010

    it's great to see depth texture support, but only for 32 bit float, and 16 bit unsigned normalized integer formats. I suppose these would be GL_DEPTH_COMPONENT16 and GL_DEPTH_COMPONENT32F in OpenGL. Why is there no support for the good old 24 bit format like the GL_DEPTH24_STENCIL8 format? Can we get support for that, please?

  3. #3
    Pipes: What is the idea behind the new pipe objects? Maybe I am too narrow-minded but I just come up with a good use case for them.

    clGetKernelArgInfo: Could you add a cl_kernel_arg_info constant named CL_KERNEL_ARG_INFO_TYPE_SIZE? It would be possible to infer this information from the type name but that sounds not too nice for me.

    General: Could you fix the documentation and documentation links. For example, enqueue_kernel is not listed in the side bar of the online manpages and I get asked for a password when trying to access get_default_queue.

  4. #4
    work_group_prefixsum_{inclusive,exclusive}_{add,mi n,max} functions are not named correctly, since they are not necessarily additions. Is it too late to change them to
    work_group_{inclusive,exclusive}_prefix_{add,min,m ax} or something else that removes the "sum" part of the name?

  5. #5
    Join Date
    Jul 2013
    Section 7.4 requires (for single-precision floating point numbers) an accuracy of 2.5 ulp for reciprocal and division. However, the fmod, modf, remainder, remquo functions require 0 ulp. This seems strange, since their implementation requires a division.

  6. #6
    For me, the spec is ok but lacks these important features:

    1. A way to specify different-sized images in an image array. Critical for OpenCL-accelerated hybrid renderers ( like MentalRay, FurryBall, Octane, etc... ) and rasterizers.
    You added the get_image_width/height() which is very good... now let us to use image opaque pointers in a 2D/3D image array so we can do
    __kernel void main ( image2d_array_t arr )
    int w = get_image_width ( arr[0] ) ; // w = 128
    int h = get_image_height ( arr[0] ) ; // h = 512

    int w2 = get_image_width ( arr[1] ) ; // w = 16
    int h2 = get_image_height ( arr[1] ) ; // h = 64
    We need a system to fetch images from a large array without restrictions !
    May be you should allow us too to pass image2d_t handles to CL pipes ?

    3. C++ support ( almost a partial feature set including virtual abstract functions/interfaces and some templates perhaps ). That's critical to reduce code size and implementation. Without that, we must implement a zillion of functions, for instance to perform lighting ( with point lights, spot lights, directional lights, ambient lights.... ). It would be much better to let us to use an interface class like ILight and then do virtual ILight::doLighting() as Directcompute/DX11.

    4. An extension to transmit kernels's GPU debug info to CPU-side debugger app.
    Printf sucks and to debug using CPU CL devices is not good because the GPU behaves differently.

    5. malloc/free/new/delete C++'s operators INSIDE the kernel's source so we can use linked lists and other kind of containers !
    Last edited by pelotoescogorciao; 09-09-2013 at 01:26 PM.

  7. #7
    Join Date
    Jul 2013
    Nice work.

    I have just a suggestion:

    Very often, my kernels have the following structure:

    1) copy data from global to local memory. barrier.
    2) a subset of work-items in the work-group perform operations on the local data.
    3) barrier.
    4) another subset of work-items in the work-group perform operations on the local data in another order.
    5) barrier. copy the local data to the global memory.

    the problem is that in step 2) or 4) a significant number of work-items may be "idling" *.
    Of course, for avoiding idling work-items, it is possible to put steps 2) and 4) in different kernels, but then it requires a local/global memory transfer.

    Then comes my suggestion: add a function for explicitly describing to the OpenCL compiler the range of active work-items between two barriers. This could enhance performance and power efficiency.

    Maybe it is possible to do that with the new functions, but I have not seen how.

    *: actually, I have been said that on SIMD hardware, the "idling" is only apparent, because all the work-items perform the same operations, and only the results of the operations are not written.
    Last edited by vic2013; 07-24-2013 at 03:09 AM.

  8. #8
    Junior Member
    Join Date
    Aug 2010

    Wait for any event

    I would like to suggest the addition of a function that implements the concept of "wait for any event" (maybe clWaitForAnyEvent?), as described below:

  9. #9
    Join Date
    Jul 2013
    Regarding the importance of Pipes:

    This is one of the most important new features of OpenCL 2.0 for 2 reasons:

    1) Pipes enable a very general way to achieve near-optimal memory performance by 'hiding' the memory read/write latency behind the queueing mechanism. Optimizing memory access patterns is one of the most difficult areas of OpenCL code optimization, and Pipes give us a way to achieve excellent memory utilization with a very simple and general approach.

    2) Pipes can also mitigate performance issues that arise from code that spends most of it's time in a loop where that loop contains conditional code paths that are only triggered rarely. For example, let's say we are searching a large dataset for certain types of patterns, and when a match is found then additional processing of that match is needed. Currently the GPU cores must pay the cost of that additional processing code on *every* iteration of the loop, even when no match is found! (because of the way SIMD works). With pipes, every time a match is found we can simply write an entry to a pipe and delegate the additional processing to a separate kernel that is reading from that pipe. Essentially we can now extract the seldom used (but always paid for) code from the loop and delegate it to a separate dedicated thread. The speedup to the original thread can be substantial depending on how much processing needs to be done on each match, and how often matches are actually encountered.

    Both of these uses are going to significantly speed up real-life production OpenCL code that I run every day.

    One thing that appears to be missing from the spec:

    We need a way to read/write Pipes from the host side (i.e., not within a kernel). Not all hardware will support this, but where it is supported it will give a fantastic capability to stream data to a live kernel, and receive streamed results back in a very straightforward way. Doesn't the HSA spec allow for host side queue read/writes as well? Why is this not in OpenCL?



  10. #10
    Join Date
    Jul 2012
    2 as mentioned in this thread, it would be better to have a kernel timeout feature for sloppy kernels. All my profiling sessions on time consuming kernels ended up crashing X

Page 1 of 4 1234 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts