Search:

Type: Posts; User: Dithermaster

Page 1 of 8 1 2 3 4

Search: Search took 0.02 seconds.

  1. Replies
    2
    Views
    331

    clFlush can certainly block the CPU; it won't...

    clFlush can certainly block the CPU; it won't return until the command queue has completely been flushed to the hardware, and if the hardware queue is full, the CPU will block.

    Except for CL/GL...
  2. I answered on SO (before I saw this).

    I answered on SO (before I saw this).
  3. That kernel looks like it was code-generated, not...

    That kernel looks like it was code-generated, not hand coded. In any case, one source of slowdown is that each work item reads 16 doubles from global memory. While they can be broadcast within each...
  4. The API is designed to be async -- all of the...

    The API is designed to be async -- all of the clEnqueue calls are designed to return quickly. The OpenCL driver uses a separate thread to push work to the GPU. So once you've queued up work to the...
  5. If you have an OpenCL driver for CPU installed...

    If you have an OpenCL driver for CPU installed then CL_DEVICE_TYPE_CPU devices appear, so yes, it is a useful flag to have.

    You might, for example, try for a GPU device, and only if one is not...
  6. Replies
    1
    Views
    484

    clBuildProgram is _required_ regardless of...

    clBuildProgram is _required_ regardless of whether you created the program using clCreateProgramWithSource or clCreateProjectWithBinary. It will be faster with binary sources.
  7. Replies
    3
    Views
    992

    OpenCL C is based on C99, so if it is ill-defined...

    OpenCL C is based on C99, so if it is ill-defined in C99, it's ill-defined in OpenCL C.
  8. Replies
    3
    Views
    992

    No such limitation. You can do multiple reads and...

    No such limitation. You can do multiple reads and writes to global memory from within a kernel. You should go back and ask your past self what they meant in the comment.
  9. My cursory understanding is that it's up to the...

    My cursory understanding is that it's up to the vendor's driver and how it's implemented. From what I'm reading above, AMD's driver support it. I think NVIDIA Tesla cards run in the non-graphics mode...
  10. > 256 is the work group size and 700 is the...

    > 256 is the work group size and 700 is the global size so it is evenly divisible.
    Um, no it's not. 256 goes into 768 but not 700.
    The common solution is to "round up" the global size to be an...
  11. You have old knowledge. Intel and AMD are both...

    You have old knowledge. Intel and AMD are both shipping OpenCL 2.0 drivers.

    Intel: https://software.intel.com/en-us/articles/opencl-drivers (2014 r2 is OpenCL 2.0)

    AMD:...
  12. Does the device report slightly less local memory...

    Does the device report slightly less local memory for CL_DEVICE_LOCAL_MEM_SIZE when you're running the r340.xx driver? It had better!

    I did notice a while back that some older NVIDIA OpenCL 1.0...
  13. OpenCL 2.0 adds support for images with the...

    OpenCL 2.0 adds support for images with the read_write qualifier. It is not possible in OpenCL 1.x, you'll need to use two different images. Note: It might just be faster that way anyway.
  14. OpenCL 1.x supports 2D and 3D images and OpenCL...

    OpenCL 1.x supports 2D and 3D images and OpenCL 1.2 adds 1D images, and clEnqueueNDRangeKernel supports 1D, 2D, and 3D workgroups. Of course all of these ultimately map to linear memory, so it's just...
  15. Do you have any constants with a decimal point...

    Do you have any constants with a decimal point and no "f" on the end? Those are doubles, and anything they do math with will get promoted to a double.
  16. Replies
    3
    Views
    1,206

    FPGA would be more applicable to a vertical...

    FPGA would be more applicable to a vertical market solution (where FPGAs typically have). OpenCL is now an alternative programming environment that may be more productive than learning other FPGA...
  17. Sounds like a driver bug.

    Sounds like a driver bug.
  18. Replies
    2
    Views
    538

    > I assume that inside a kernel I should bundle a...

    > I assume that inside a kernel I should bundle a set of such answers into a reasonably-sized scalar type, say ushort.
    Yes, you should do that. Bit set/clear across work items would not be...
  19. OpenCL and clEnqueueNDRangeKernel are all about...

    OpenCL and clEnqueueNDRangeKernel are all about parallel execution.

    The global work size is "how many work items do I want to compute?"

    Then, inside your parallel kernel execution,...
  20. Short answer: It will improve things. Long...

    Short answer: It will improve things.

    Long answer: Of course for big data bandwidth-limited problems where compute has been overlapped with transfer but the transfer is very large it will be a...
  21. I agree. This is a parallel reduction problem....

    I agree. This is a parallel reduction problem. OpenCL 2 even has operators for reduction, but you can write it in OpenCL 1.x yourself. I'd write the values to global memory and then run the reduction...
  22. There are the flags you use when you create the...

    There are the flags you use when you create the image object, and there are the qualifiers you give when you pass it to a kernel. Allocate it using read/write, and then pass it as write for kernel A,...
  23. Replies
    8
    Views
    977

    That's simply not possible since the runtime...

    That's simply not possible since the runtime doesn't know what you're storing in the buffers. It could be any size data types in any combination. There is no way for it to know which bytes to swap...
  24. With that amount of overlapped reads (work items...

    With that amount of overlapped reads (work items re-reading the same memory other work items just read) this is a good candidate for workgroup shared local memory. Make those global memory reads just...
  25. Replies
    8
    Views
    977

    Likely true, but again, until such a mis-matched...

    Likely true, but again, until such a mis-matched implementation exists how would you test that you handled it correctly? Seems like a lot of extra work for an unlikely scenario.
Results 1 to 25 of 189
Page 1 of 8 1 2 3 4