Hello everybody,

I'm still a beginner with OpenCL and I just started a small project for my university. During this project I experienced some problems which I can't explain.

First, some information about what I'm doing:
The NDRange problem size is 2^20 and the work group size is 2^8. The algorithm just tests all possible values within a given range and returns the result if a collision was found. The kernel is called many times so that a total range of 2^32 combinations can be checked.

The inputs to the kernel are only __constant variables (arrays) and the outputs are only __global variables (arrays). A single kernel runs for about 1400ms.

1) The first problem is, even though the code runs on my GPU, it neither runs on a Tesla C2050 card from NVIDIA or on my CPU. What might cause this issue? Is the NDRange too large?
2) The second problem is, I have a for loop that looks like:

Code :
... some code
 
var = 0xFFFFFFFF;
 
for(i=0; i<8; i++) {
        var &= func(privateArray) ^ constantArray[i + 0]));
}
 
... some code ...
 
for(i=0; i<8; i++) {
        var &= func(privateArray) ^ constantArray[i + 8]));
}
... some code ...
The function is a simple sequence of some logical operations.

When I manually unroll the first loop, the code breaks. Why? Of course I double checked and the loop is really that simple. I also asked a colleague to have a look at it. There was no obvious error.
I also tried to copy from that constantArray to a private Array first and then go through these loops. Did not work either.

3) From my understanding, when accessing memory one must watch out for:
-coalesced memory read/writes for global memory
-bank conflicts for shared memory

What about constant memory? For ATI GPUs, I read that this is located in global memory (=high latency) and that access is cached. If within a warp multiple locations of the constant memory are accessed, then access is serialized -- Is that information correct?

That's what I'm doing in the beginning of each kernel --> accessConstantArray[lid].
Is it better to use global memory, do a coalesced read (and eventually write it to shared memory?)
My only concerning is prevent serialization and reducing the latency.

4) One part of the computation involves an array shift:
for(i=0; i<100; i++) { array[i] = array[i+1] }
Is there any better to do that? ('memcpy' ?!)

I was about to use macros to emulate the array shift (in subsequent operations) -- but how can I ensure that the macros are fully expanded and the calculation of the index does not take place on the GPU?

Okay ... I really hope I can get some answers

Apologies that I can't post the full code here.

Just in case any of the following problems is related to the hardware/software I work on, here are the specs:
-GPU: AMD ATI Radeon HD 6750M
-CPU: 2.3 GHz Intel Core i7
-OS: Mac OS X 10.7.3 (Lion)
-IDE: XCODE (latest)