1. Is Wave front and Wave front granularity with AMD equivalent to Warp size and warp size granularity with nVidia?
2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created and what is the priority? Is it automatically global?
4. Lets say that I want to operate on many small vectors of length 64, and my optimal work group size is 256 for my platform. Is it a bad idea (performance wise) to set group size to 32 or 64? Is it very important not to go too far below 256, and instead try to split the same work group out over different vectors? The reason why I ask is because splitting the work group up like that could potentially be bad in some aspects in my implementation.
3. A question regarding flow control. I read AMD Accelerated Parallel Processing OpenCL Programming Guide (section 1.3.2) and got a question about this statement:
This cant possibly mean that all work-items in a wavefront is automatically synchronized, right? Only that all cases of the statement is executed by each thread. If not, it seems that the "barrier" command would be useless.If work-items within a wavefront diverge, all paths are executed
serially. For example, if a work-item contains a branch with two paths, the
wavefront first executes one path, then the second path. The total time to
execute the branch is the sum of each path time. An important point is that even
if only one work-item in a wavefront diverges, the rest of the work-items in the
wavefront execute the branch.