Essentially I'm cross-posting this issue from the nvidia forums (I posted there first but there seems to be little interest so I figured I'd ask here as well.

Anyway, on to the matter at hand: When iterating over constant memory in kernels the GPU has a tendency to simply return 0s rather than fetch the value at the corresponding adress. I'm using a Macbook 15' 4.1 with Snow Leopard, nvidia cuda sdk 3.0 installed and a GeForce 8600M GT. Following is a code example of when aforementioned issue arises:
Code :
__kernel void add(__constant float *a, __global float *answer)
{
    const int id_x = (int)get_global_id(0);
    int i,j;
    float sum = 0;
 
 
    for(i = 0; i < 3; i++)
    {
        for(j = 0; (j < 3); j++)
        {
            //This will not work
            sum +=  a[i+j];
            /*
            This will work:
            sum += a[j*i];
 
            So will this:
            sum += a[j];
 
            And this:
            sum += a[i]; 
            */
 
        }
    }
    //Write result
    answer[id_x] = sum;
}
In this instance if I changed the iteration to the following:
Code :
int k = 0;
 
for(i = 0; i < 3; i++)
{
    for(j = 0; (j < 3); j++,k++)
    {
        //This works
        sum +=  a[k];
    }
}
It worked just fine. Until today that is. Now this code will fail and return 0s as well. Oddly enough I'm using this work around in one of my kernels (it's a simple naive convolution filter used for benchmarking) and it still works (code via pastebin). Now I haven't read the specification back to back so if there's something about pointer arithmetic I'm missing please tell me. However, considering this rather simple bug I'm unable to use constant memory. Using a single loop workaround with modulo arithmetic is not an option as it nearly doubled the running time of the kernel (god knows why). It's important to point out that this code works just fine on the CPU, it's only when run on the GPU starts returning 0s.