Threads more than local work size
I have a simple question. What happens if I use more threads than local work size constraint?
Is it normal that kernel gives me random values?
I hope to be clear, thank you!
If you specify a work group size larger than your hardware or kernel supports, the clEnqueueNDRange call should fail and return an error code.
It does not happen. what's the mistake?? It continues to working but giving back wrong values!
Your question isn't very clear. Presumably by "threads" here you mean "work-items", but what is not clear is whether you mean in a single group or as a whole ndrange. If you use more than the maximum size of a work-group the runtime should give you an error. If you use more in the ndrange then you will have more than one work-group. There is no synchronization defined in OpenCL between work-groups which means that if you don't construct your code carefully you may have unexpected behaviour if you expect them to be running in any particular order. If, for example, you are relying on a barrier in your code that barrier will only affect one of the work-groups, not all of them, so your synchronization would be invalid.
The reason for this is that a GPU, like a CPU, can only actually have a certain number of thread contexts at a time and this number is abstracted away in OpenCL. Instead the model is based on streaming more work-items over that underlying set of thread contexts. You may not have enough capacity in the machine to run everything concurrently and hence it is always valid to serialize the set of work-groups. There can, therefore, be no global synchronization in the model.
I am going to piggy back on this thread to ask another simple question. Is there a maximum number of work groups? I know there is a max number of items per group, but is there a similar value for work groups? Or can I make as many groups as I would like?
> If you specify a work group size larger than your hardware or kernel supports, the clEnqueueNDRange call should fail and return an error code.
That would be nice but I don't think you can reliably expect that from every driver. Some might crash or return incorrect results.
> Is there a maximum number of work groups?
The limit is pretty big. Ideally the global work size is limited by the largest number that fits in a size_t but more likely it is limited by the driver to something smaller, I'm guessing somewhere between 2^16 to 2^31. If your global work size is larger than the maximum work group size it will run as many work groups as necessary to get the work done. They might run in parallel or serial or a combination. Practically, on some platforms you will find the limit is time; the OS will kill the kernel if it takes more than a few seconds to run on your global work size.
Thanks for the answer Dithermaster.
I am looking at a bitonic sort example and as far as I can see each work item only works on one element in the sequence it is sorting. So if the upper limit is 2^31 and the sequence is larger, then some more trickery will have to be done in order to make it work. It seems to be a pretty basic implementation though.