Hi,

If underlying hardware can only operate on N threads simultaneously (like, say, warp or half-warp sizes on NVIDIAs current cards, which are 32/16 threads respectively), how do threads in a single 2D work unit map to these units?

To illustrate, imagine that work unit is 8x4 and warp size is 4 threads.

Is it like this (numbers denote number of warp):

00001111
22223333
44445555
66667777

or like this

00112233
00112233
44556677
44556677

Are there any guarantees or any non-vendor specific ways I can influence that (short of converting to 1D work units and doing the mapping myself)?

Cheers,
RCL