PDA

View Full Version : A few quick questions from a newbie



omgi
06-22-2011, 01:44 AM
1. Is Wave front and Wave front granularity with AMD equivalent to Warp size and warp size granularity with nVidia?

2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created and what is the priority? Is it automatically global?

4. Lets say that I want to operate on many small vectors of length 64, and my optimal work group size is 256 for my platform. Is it a bad idea (performance wise) to set group size to 32 or 64? Is it very important not to go too far below 256, and instead try to split the same work group out over different vectors? The reason why I ask is because splitting the work group up like that could potentially be bad in some aspects in my implementation.

3. A question regarding flow control. I read AMD Accelerated Parallel Processing OpenCL Programming Guide (http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Program ming_Guide.pdf) (section 1.3.2) and got a question about this statement:

If work-items within a wavefront diverge, all paths are executed
serially. For example, if a work-item contains a branch with two paths, the
wavefront first executes one path, then the second path. The total time to
execute the branch is the sum of each path time. An important point is that even
if only one work-item in a wavefront diverges, the rest of the work-items in the
wavefront execute the branch.
This cant possibly mean that all work-items in a wavefront is automatically synchronized, right? Only that all cases of the statement is executed by each thread. If not, it seems that the "barrier" command would be useless.

omgi
06-22-2011, 02:56 AM
5. Is there any way to estimate how much private memory I have on my GPU (nVidia GTX 470 and ATI HD5850)?

omgi
06-22-2011, 03:21 AM
6. Is there any particular reason to use 2D or 3D work groups, other than it might be easier/prettier to map the threads to the work space? Performance gain for example?

david.garcia
07-06-2011, 03:43 PM
2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created and what is the priority? Is it automatically global?

It's private by default.


4. Lets say that I want to operate on many small vectors of length 64, and my optimal work group size is 256 for my platform. Is it a bad idea (performance wise) to set group size to 32 or 64? Is it very important not to go too far below 256, and instead try to split the same work group out over different vectors? The reason why I ask is because splitting the work group up like that could potentially be bad in some aspects in my implementation.

Very small work group sizes have significantly lower performance. Why not compute 4x64 vectors in each work-group to achieve the ideal work-group size of 256?


This cant possibly mean that all work-items in a wavefront is automatically synchronized, right? Only that all cases of the statement is executed by each thread. If not, it seems that the "barrier" command would be useless.

It is true that all work-items inside a warp/wavefront are implicitly synchronized. A work-group will contain multiple warps/wavefronts, so the barrier function is still useful to synchronize between work-items in different warps/wavefronts.

However, this is an implementation detail and if you write your code assuming that it is true universally, your code will not run on some other OpenCL implementations. I strongly recommend not making implementation-specific tweaks like these.


5. Is there any way to estimate how much private memory I have on my GPU (nVidia GTX 470 and ATI HD5850)?

Only reading the programming guides for those two hardware vendors. There's no standard way to query the amount of private memory AFAICR.


6. Is there any particular reason to use 2D or 3D work groups, other than it might be easier/prettier to map the threads to the work space? Performance gain for example?

It's simply to make it easier to map to your particular problem domain.