I'm trying to grasp the concept of how openCL programming for a GPU works. Are they strictly SIMD? Meaning, must each thread must be executing the exact same instruction at a time or does each thread just need to be executing the same code (e.g. multiple stack frames in CPU threads)? I imagine it's somewhere in between (just SIMD in a workgroup, for example).
Here's an example to help illustrate my confusion:
Let's say I want to write a kernel that simulates coin-flipping. Each thread represents a person and the thread finishes when the person has seen 10 heads and it saves the total number of flips in some variable. Assuming that the threads are all seeded differently, this will mean that not every thread finishes at the same time. Is it still possible to run this on a GPU?
In the real program , would it just be sufficient to call barrier() after this part of the code to make sure everything is where it should be?
Thanks for the help and I apologize for any confusion in my explanation. I realize that I probably just don't understand this very well and any comments would be much appreciated.