Code :
__kernel void deform(   __constant  float4 * inCage,
                        __constant uint * wnum,
                        __constant uint * wskip,
                        __constant float *weights,
                        __constant uint *cageID,
                        __global  float4 * outMesh,
                        const uint nend)
    uint i = get_global_id(0);
    if(i < nend)
        float4 pos = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
        uint j = wskip[i];
        uint end = j + wnum[i];
        for(; j < end ; j++)
            pos += inCage[cageID[j]] * weights[j];
        outMesh[i] = pos;
Got above code working correctly on CPU, but GPU outputs mess and it's about 100x slower. My NDRange is over 30 000. I've got Nvidia gtx260, athlon 4-core and windows XP 64bit. Got ATI stream installed and nvidias 195.39 forceware (later ones output mess that wasn't affected by input in any way, these drivers at least react to input by outputting mess that changes).

My first though was that slowdown might be because it's all in global/constant memory. Unfortunately I am in loss how to get inCage to local memory nicely as it's not possible to assign pointer from __global to __local. How to copy inCage to local memory efficiently?

I've ditched all zero weights from data and thus for loop length depends on how many weights exist per i. Every element of cageID and weights are read only once in whole NDRange, so it would be logical to keep them as __constant or __global.