I'm writing some code to perform an image registration between two images, basically this involves working with values from 2 64x64x64 3d images which I am passing into my kernel via openCL image objects. My block size (purely for testing purposes) are 64x1x1.
In this first code example (nb code is simplified to example the problem I'm having): http://pastebin.com/m63f0afe4
Line 22 is performing coalesced writes to _bins stored in global memory, whilst storing the results from the image access, importantly note the addition of 'res'. The metrics I get using this code from the opencl nvidia profiling tool is:
gst coalesced : 8192
gld uncoalesced : 393216
however, if I change line 22 to simply not include the addition of 'res' (i.e. http://pastebin.com/m2cfbddc9) then there are 0 uncoalesced gld's, and it performs significantly faster.
I'm having trouble getting my head round this and why it is performing uncoalesced load access from global memory *only* when I use res in the addition, and otherwise not, is this some sort of compiler technique to reduce expense if it can see that a variable is not being used elsewhere?
Would appreciate anyone helping me to understand this.