PDA

View Full Version : Iterations And GlobalRange Difficulty (Related?)



wilson
04-03-2010, 06:55 PM
Hi everyone!

I'm benchmarking a parallelized algorithm and i wanted to run 1000 iterations of it.

The thing is, my global range for a single iteration is 82369, to make those 1000 iterations i decided to multiply 82369 times 1000, and then compute the offset inside the kernel. The problem is this raizes CL_OUT_OF_RESOURCES at enqueueReadBuffer.

Then i tried to do a FOR loop, inside the kernel 1000 times and keep global range at 82369 but the same CL_OUT_OF_RESOURCES came out also at enqueueReadBuffer.

Then tried to do some variations like 82369*20 and doing A FOR loop 50 times, but CL_OUT_OF_RESOURCES always came out.

I couldn't find any explanation for this, info explaining the relation that i found between global range and the number of loops inside the kernel.

If someone has an idea why this happens, i would appreciate it (if this is even possible).

This is the kernel code (most of it):


__kernel void square(__constant unsigned long NBTS,
__constant unsigned long GRID_SIZE_X,
__constant unsigned long GRID_SIZE_Y,
__constant unsigned long radix,
__global unsigned long *BTSET,
__global long *fitness,
__global int *covered_points

)
{

long globaID = get_global_id(0);

__local long GRID_SIZE;

if (globaID==0)
GRID_SIZE = GRID_SIZE_X*GRID_SIZE_Y;

barrier(CLK_LOCAL_MEM_FENCE);

int contador;
long x, y;
long x_k, y_k;
long x1, y1,rx,ry;

float cover_rate, fit;
int check=0,j,i,m;

// offset

long idy = globaID / GRID_SIZE;
long idx = globaID - idy * GRID_SIZE;

if (idx==0)
{
*covered_points = 0; //
}

for (j=0; j<1; j++) // <--- HERE´S WHERE I PUT THE 1000 ITERATIONS
{

contador = 0;

for(m=0; m<NBTS; m++)
{
if(BTSET[m]==idx)
{
...
}
}

if (check!=1)
{

for (i=0; i<NBTS; i++)
{
...
}

}

}

}


If you would like some more information, please just ask.

Thanks in advanced ...

dbs2
04-04-2010, 05:03 AM
Two questions:
1) what is your total global size? (I believe the Nvidia driver currently has a terribly small limitation of a global size of 65k or something.)
2) what is GRID_SIZE set to if you're not work-item 0? It seems like it is uninitialized.

Rui
04-04-2010, 05:25 PM
I'm experiencing the same problem executing two nested for cicles. Given the following code (lauching 44100 work-items, the size of addresses array), the card driver eventually crashes. It only allows me to run the outside for loop about 100 times. I don't know if it is due to memory usage..



long idx = get_global_id(0);
long x_p, y_p;
long x_a, y_a, rx, ry;

y_p = addresses[idx] / 287;
x_p = addresses[idx] - y_p * 287;


for (int i=0; i<1000; i++) {

for (int j=0; j<49; j++) {

y_a = bt[j] /287;
x_a = bt[j] - y_a * 287;

rx = abs_diff(x_a, x_p);
ry = abs_diff(y_a, y_p);


if (rx<=30 && ry<=30)
addresses[idx] = 1;

}
}



What could be preventing it from running as it is supposed to be?

Thanks

wilson
04-04-2010, 05:55 PM
Hi!
Thanks for the reply!
About question nº2, i made that way to try optimize, but now i realize that i only initialize that for the first work group :)

But even if i use it as global, the problem persist.

The global size, is at minimum 82369. But with this value i need to do 1000 iterations per kernel. And this is the problem, because it returns CL_OUT_OF_RESOURCES and i can't find any reason why this happens.

The reason why a talk about multiplying 82369 times 1000, it was to try replace the loop inside the loop by the number of work-items, if you know what i mean!

Thanks in advanced!

dbs2
04-05-2010, 08:00 AM
Are you using Nvidia's drivers? If so, check their release notes because I don't think they support a global size > 65,535.

Also, if your kernel is taking too long (say longer than 5 seconds) the system watchdog timer on most machines will kill your program. This is true on Mac OS X, and, I believe, windows/linux unless you are using a dedicated (i.e., non-display) card for computing.

Rui
04-05-2010, 08:03 AM
Yes, Nvidia drivers here...

Kratzy974
04-05-2010, 11:27 PM
A global size of something with 140 MB is allowed. The constant is limited to 64k and the local related to the chip between 16 and 32k.

I had the similar problem when using too large loops. So my outer loop isn't in the kernel anymore, instad I do use more work groups (several thousands). If you use for all work groups the same data, you should make a good time meassure.

dbs2
04-06-2010, 05:52 AM
I think there's some confusion here. I'm not talking about a global memory size when I say there is (was?) a limitation in the Nvidia drivers. I have heard several people say that global_x*global_y*global_z must be <= 65535 to run on Nvidia's drivers. Can anyone confirm if this is still the case? I know it's not a hardware limitation since the same card will happily run arbitrarily large global sizes under Mac OS X.

wilson
04-07-2010, 02:56 PM
Hello EveryOne!

Just writing to say that my problem was solved.

The problem wasn't the global range size, but the bad management of resources in the kernel. Since i had my kernel, optimized, and by this i mean define a right local range and define only the needed variable with the needed type.

And that was it.

Só, answering to dbs2, i belive that the nvidia limit 65535 doesn't exist. At least with the last drivers available!

Thanks you all for the help!

Take care!