PDA

View Full Version : Parallelizing nested loops



Otterz
02-15-2011, 12:38 AM
Hi,

I am new to OpenCL, and I am porting some MPI code I have, as I am hoping to see a benefit from using a GPU.

The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.



for(int i = 0; i < L + 1; i++){
for(int j = 0; j < L + 1; j++){
for(int k = 0; k < L + 1; k++){
some_conditionals
for(int l = 0; l < L + 1; l++){
some_conditionals
G = 1.0;
for(int m = 0; m < L + 1; m++){
some_conditionals
G = some_math;
} // end M loop

blah[i][j] += some_math;
} // end l loop
} // end k loop
}// end j loop
}// end i loop


My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah[i][j]. But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

To parallelize i,j I just use an 2D NDRangeKernel like so:



err = queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange((L+1), (L+1)),
cl::NDRange(1,1),
NULL,
&event);
checkErr(err, "CommandQueue::enqueueNDRangeKernel()");


I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah[i][j], which I don't know how to do yet. I'm wondering am on the right track? Any suggestions?

I tried breaking up the 3rd loop, and running a loop queuing kernels with an additional arg (k=__)

Even doing that seems to be too much work in the kernel, as with L > 60 it will trigger windows TDR killing the kernel.

I am learning OpenCL as I go, and my background is MPI.

Thanks!

david.garcia
02-15-2011, 10:52 AM
But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

Assuming that you don't want to change the 5 second timeout (http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx) what you can do very easily is perform multiple calls to clEnqueueNDRangeKernel() instead of doing a single one. You can use the global_work_offset (http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueNDRangeKernel.html) argument to partition the work into smaller pieces.