Search:

Type: Posts; User: utnapishtim

Page 1 of 6 1 2 3 4

Search: Search took 0.00 seconds.

  1. You can't, but there's nothing wrong with a loop...

    You can't, but there's nothing wrong with a loop in a kernel.
  2. CL_DEVICE_LOCAL_MEM_SIZE returns the max amount...

    CL_DEVICE_LOCAL_MEM_SIZE returns the max amount of local memory that a work-group can allocate (and use). Since a work-group can run on only one compute unit, this amount of memory is for each...
  3. If you use the CPU device and your app is...

    If you use the CPU device and your app is compiled for x64, get_global_id() returns a size_t value with is 64-bit wide.
    In this case, as_uchar4(get_global_id(0)) is not legal.

    You should first...
  4. Each compute unit has 32 ALU. So the device has a...

    Each compute unit has 32 ALU. So the device has a total of 4x32=128 ALU.
    Each compute unit can run a work-group of up to 512 work-items.
  5. A work-group runs on one compute unit. It cannot...

    A work-group runs on one compute unit. It cannot be split among several compute units (first of all because local memory is local to a compute unit).

    The max work-group size is an indication of...
  6. Adding "return ret;" at the end of getuint2()...

    Adding "return ret;" at the end of getuint2() will probably help...
  7. Replies
    7
    Views
    414

    VGPRs are 32-bit wide.

    VGPRs are 32-bit wide.
  8. Replies
    7
    Views
    414

    Private memory is a lot faster than global memory...

    Private memory is a lot faster than global memory (roughly 100x faster). However you must consider it as a scarce resource. As I stated earlier, the optimal maximum number of registers for a kernel...
  9. From what I have seen,...

    From what I have seen, CL_DEVICE_MAX_WORK_GROUP_SIZE is 256 on a HD7970.
  10. Replies
    7
    Views
    414

    A wavefront is more or less the hardware...

    A wavefront is more or less the hardware counterpart of a work-group. Each work-group is split in blocks of 64 work-items; this block is executed as a wavefront by a compute unit. Several wavefronts...
  11. Replies
    7
    Views
    414

    This might be caused by register spilling. The...

    This might be caused by register spilling. The full code may need more registers than each isolated part of the algorithm.
    In this case, values have to be temporarily stored to and read from global...
  12. If I understand well, you want to know what...

    If I understand well, you want to know what happens when global size=700 and local size=100.

    Obviously you request 7 work-groups of 100 work-items each.

    On AMD devices, work-items in a...
  13. GL_RGBA8 is mapped onto CL_UNORM_INT8 data type....

    GL_RGBA8 is mapped onto CL_UNORM_INT8 data type. As a result, your kernel should use write_imagef() instead of write_imageui().
  14. If you try to execute a kernel with...

    If you try to execute a kernel with clEnqueueNDRangekernel() with global_work_size = 256 and local_work_size = 100, you will even get an error CL_INVALID_WORK_GROUP_SIZE.

    The number of work-items...
  15. Yes, it is possible. It is often the case when...

    Yes, it is possible.
    It is often the case when the image contains an intermediate result produced by kernel A and to be consumed by kernel B.
  16. Replies
    8
    Views
    1,288

    That's what the attribute "endian" is made for:...

    That's what the attribute "endian" is made for: use __attribute__ ((endian(host))) or __attribute__ ((endian(device))) to tell OpenCL which kind of endianness a buffer uses. Default is device...
  17. OpenCL requires that sine has a minimum accuracy...

    OpenCL requires that sine has a minimum accuracy of 4 ulp.
    For example, if the expected result is 0.5, one ulp is 2^-53 = 1.11e-16. So the maximum admissible error is 4 ulp ~ 4.5e-16.

    So the...
  18. Check the max size of a constant buffer with...

    Check the max size of a constant buffer with CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE.
    It is generally 64KB on a GPU, so your buffer is probably too big to fit into a constant buffer.
  19. You can find a good introduction to reduction...

    You can find a good introduction to reduction with OpenCL here:

    http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/
  20. Intel CPU and GPU share physical memory so...

    Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

    - The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with...
  21. To put it simply, a kernel should never access a...

    To put it simply, a kernel should never access a host memory buffer.

    In that case, the OpenCL implementation will either:

    - make a copy of the buffer between host and device before the kernel...
  22. Honestly, if your buffer is to be accessed by a...

    Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn't use a host buffer and expect that all transfers will be magically optimized.
    If you need a buffer for your GPU kernel, then...
  23. Calling clEnqueueReadBuffer() with the pointer...

    Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed (whence the...
  24. In your scenario, you can use...

    In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
    This will synchronize the (host) buffer with the GPU cache. You can then release...
  25. Replies
    2
    Views
    1,159

    You are using unnormalized integer coordinates...

    You are using unnormalized integer coordinates with read_imagef(), so your sampler should be

    const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |
    ...
Results 1 to 25 of 130
Page 1 of 6 1 2 3 4