Search:

Type: Posts; User: utnapishtim

Page 1 of 5 1 2 3 4

Search: Search took 0.00 seconds.

  1. Replies
    7
    Views
    181

    VGPRs are 32-bit wide.

    VGPRs are 32-bit wide.
  2. Replies
    7
    Views
    181

    Private memory is a lot faster than global memory...

    Private memory is a lot faster than global memory (roughly 100x faster). However you must consider it as a scarce resource. As I stated earlier, the optimal maximum number of registers for a kernel...
  3. From what I have seen,...

    From what I have seen, CL_DEVICE_MAX_WORK_GROUP_SIZE is 256 on a HD7970.
  4. Replies
    7
    Views
    181

    A wavefront is more or less the hardware...

    A wavefront is more or less the hardware counterpart of a work-group. Each work-group is split in blocks of 64 work-items; this block is executed as a wavefront by a compute unit. Several wavefronts...
  5. Replies
    7
    Views
    181

    This might be caused by register spilling. The...

    This might be caused by register spilling. The full code may need more registers than each isolated part of the algorithm.
    In this case, values have to be temporarily stored to and read from global...
  6. If I understand well, you want to know what...

    If I understand well, you want to know what happens when global size=700 and local size=100.

    Obviously you request 7 work-groups of 100 work-items each.

    On AMD devices, work-items in a...
  7. GL_RGBA8 is mapped onto CL_UNORM_INT8 data type....

    GL_RGBA8 is mapped onto CL_UNORM_INT8 data type. As a result, your kernel should use write_imagef() instead of write_imageui().
  8. If you try to execute a kernel with...

    If you try to execute a kernel with clEnqueueNDRangekernel() with global_work_size = 256 and local_work_size = 100, you will even get an error CL_INVALID_WORK_GROUP_SIZE.

    The number of work-items...
  9. Yes, it is possible. It is often the case when...

    Yes, it is possible.
    It is often the case when the image contains an intermediate result produced by kernel A and to be consumed by kernel B.
  10. Replies
    8
    Views
    971

    That's what the attribute "endian" is made for:...

    That's what the attribute "endian" is made for: use __attribute__ ((endian(host))) or __attribute__ ((endian(device))) to tell OpenCL which kind of endianness a buffer uses. Default is device...
  11. OpenCL requires that sine has a minimum accuracy...

    OpenCL requires that sine has a minimum accuracy of 4 ulp.
    For example, if the expected result is 0.5, one ulp is 2^-53 = 1.11e-16. So the maximum admissible error is 4 ulp ~ 4.5e-16.

    So the...
  12. Check the max size of a constant buffer with...

    Check the max size of a constant buffer with CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE.
    It is generally 64KB on a GPU, so your buffer is probably too big to fit into a constant buffer.
  13. You can find a good introduction to reduction...

    You can find a good introduction to reduction with OpenCL here:

    http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/
  14. Intel CPU and GPU share physical memory so...

    Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

    - The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with...
  15. To put it simply, a kernel should never access a...

    To put it simply, a kernel should never access a host memory buffer.

    In that case, the OpenCL implementation will either:

    - make a copy of the buffer between host and device before the kernel...
  16. Honestly, if your buffer is to be accessed by a...

    Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn't use a host buffer and expect that all transfers will be magically optimized.
    If you need a buffer for your GPU kernel, then...
  17. Calling clEnqueueReadBuffer() with the pointer...

    Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed (whence the...
  18. In your scenario, you can use...

    In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
    This will synchronize the (host) buffer with the GPU cache. You can then release...
  19. Replies
    2
    Views
    1,009

    You are using unnormalized integer coordinates...

    You are using unnormalized integer coordinates with read_imagef(), so your sampler should be

    const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |
    ...
  20. I've checked on NVIDIA GPU, AMD GPU and Intel CPU...

    I've checked on NVIDIA GPU, AMD GPU and Intel CPU and your kernel is fine.

    How do you get the result from the device buffer on the host side?
  21. Try to cast plain to unsigned int instead of...

    Try to cast plain to unsigned int instead of unsigned char, such as:

    W[t] = ((unsigned int) plain[t * 4]) << 24;

    and so on...
  22. Replies
    6
    Views
    1,188

    The host buffer is not necessarily up-to-date...

    The host buffer is not necessarily up-to-date when your kernel ends because its content can be cached in device memory.

    You have to use clEnqueueMapBuffer / clEnqueueUnmapBuffer to ensure that the...
  23. Replies
    3
    Views
    1,491

    Check whether the extension is present in the...

    Check whether the extension is present in the string returned by clGetDeviceInfo() with CL_DEVICE_EXTENSIONS.
  24. Replies
    3
    Views
    1,491

    Are you sure that your device has support for the...

    Are you sure that your device has support for the cl_khr_3d_image_writes extension?

    Also use clGetProgramBuildInfo() with CL_PROGRAM_BUILD_LOG to get more info about the reason why the build...
  25. Replies
    3
    Views
    1,279

    Your kernels could be optimized, but the most...

    Your kernels could be optimized, but the most important parameter when using a GPU is the local work size.

    NVIDIA GPUs for instance are optimized for a local work size of 128, so you should try...
Results 1 to 25 of 124
Page 1 of 5 1 2 3 4