PDA

View Full Version : register to global memory performance mystery



openclnewb
03-19-2010, 02:02 PM
Hi,

In my code, I have an private array:



__private float foo[10][2];


To make sure it stays in the registers and out of high-latency local memory, I use array offsets that are computed at compile-time. When I'm finished filling it with computed data, I want to transfer the array to thread-specific offset in global memory.

If I do the transfer this way:


__local float bar = foo[0][0];
*(output_data + get_global_id(0)) = test;


then it's nice and fast. But if I do the transfer this way:



*(output_data + get_global_id(0)) = foo[0][0];


then it's horribly slow. Really s-l-o-w.

I've gone through the various docs multiple times, but I still can't figure out why this is. If it was a problem with global memory coalescing, wouldn't it manifest itself in both examples? Can anyone enlighten me?

Thanks,

Polly