If underlying hardware can only operate on N threads simultaneously (like, say, warp or half-warp sizes on NVIDIAs current cards, which are 32/16 threads respectively), how do threads in a single 2D work unit map to these units?
To illustrate, imagine that work unit is 8x4 and warp size is 4 threads.
Is it like this (numbers denote number of warp):
or like this
Are there any guarantees or any non-vendor specific ways I can influence that (short of converting to 1D work units and doing the mapping myself)?