PDA

View Full Version : Block edges updates

arthur.sw
04-06-2011, 06:31 AM
Hello,

I am working on an algorithm where each pixel needs its four neighbors.

So for each 16*16 block of threads, I want to copy the global memory into 18*18 shared memory blocks (This way pixels on block edges will have correct neighbors) :

-> [z z z z z]
[x x x] -> [z x x x z]
[x x x] -> [z x x x z]
[x x x] -> [z x x x z]
-> [z z z z z]

This is my simple code:

__kernel void test(__global float* u, __local volatile float* uLocal)
{
// Block index
int bi = get_group_id(1);
int bj = get_group_id(0);

// Local coordinates
int li = get_local_id(1)+1;
int lj = get_local_id(0)+1;

// Global coordinates
int gi = get_global_id(1);
int gj = get_global_id(0);

// Local height and width
int lHeight = get_local_size(1)+2;
int lWidth = get_local_size(0)+2;

// Global height and width
int gHeight = get_global_size(1);
int gWidth = get_global_size(0);

if( (gi-1<0) || (gi+1>gHeight-1) || (gj-1<0) || (gj+1>gWidth-1) )
return;

uLocal[li*lWidth+lj] = u[gi*gWidth+gj];

for(int i=0 ; i<N ; i++)
{
// put edges in local memory
if(li-1 == 0)
uLocal[(li-1)*lWidth+lj] = u[(gi-1)*gWidth+gj];
else if(li+1 == lHeight-1)
uLocal[(li+1)*lWidth+lj] = u[(gi+1)*gWidth+gj];

if(lj-1 == 0)
uLocal[li*lWidth+lj-1] = u[gi*gWidth+gj-1];
else if(lj+1 == lWidth-1)
uLocal[li*lWidth+lj+1] = u[gi*gWidth+gj+1];

//do something with uLocal...
uLocal[li*lWidth+lj] = 0.5*uLocal[li*lWidth+lj+1] - 0.5*uLocal[li*lWidth+lj-1];

barrier(CLK_LOCAL_MEM_FENCE);

u[gi*gWidth+gj] = uLocal[li*lWidth+lj];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}

It does not work (even without the for loop).

It is different from the oclMedianFilter example (which also read and write edges), I really don't understand the point of this complex code.

What should I do ??
Help !

david.garcia
04-06-2011, 07:08 AM
Could you elaborate a bit on how the code doesn't work? Do you have sample inputs and outputs? How are the outputs different from what you expected?

arthur.sw
04-07-2011, 12:47 AM
Well, I have 3 problems, but what I want to know is: is my code correct ? Why does the "edge update" differ from the oclMedianFilter example ?

On of my problem is that with my previous code, I get some vertical black lines between the blocks depending on which thread has been executed first.

My second problem:

I tried this:

__kernel void test2(__global float* u, __local volatile float* uLocal)
{
// same init...

uLocal[li*lWidth+lj] = 0;

if( bi == 2 && bj == 2 )
uLocal[li*lWidth+lj] = u[gi*gWidth+gj];

if( bi == 2 && bj == 4 )
{
uLocal[li*lWidth+lj] = u[gi*gWidth+gj];
if(li-1 == 0)
uLocal[(li-1)*lWidth+lj] = 1;
else if(li+1 == lHeight-1)
uLocal[(li+1)*lWidth+lj] = 1;

if(lj-1 == 0)
uLocal[li*lWidth+lj-1] = 1;
else if(lj+1 == lWidth-1)
uLocal[li*lWidth+lj+1] = 1;

// do something....

if(li-1 == 0)
u[(gi-1)*gWidth+gj] = uLocal[(li-1)*lWidth+lj];
else if(li+1 == lHeight-1)
u[(gi+1)*gWidth+gj] = uLocal[(li+1)*lWidth+lj];

if(lj-1 == 0)
u[gi*gWidth+gj-1] = uLocal[li*lWidth+lj-1];
else if(lj+1 == lWidth-1)
u[gi*gWidth+gj+1] = uLocal[li*lWidth+lj+1];
}

u[gi*gWidth+gj] = uLocal[li*lWidth+lj];
}

And the bottom edge of the second block does note appear.
(I know that this program is not a perfect example because execution paths diverge)

And my third problem concerns the for loop:
How to synchronize threads when I must update edges (the threads on blocks' edges will have more work) in the for loop. The barrier() and if() else() statements are not compatible and the whole program is skipped when threads are not synchronized (the GPU does nothing, the execution time is almost 0).

Maybe I must remove the for loop from the kernel and put the kernel into a for loop, but I guess that would require many clEnqueueRead/WirteBuffer and slow down the algorithm.

Thanks for helping !
Arthur

david.garcia
04-07-2011, 05:04 AM
Thanks for the additional info. Notice that you are changing the image in place. What measures are you taking to prevent a previous work-group from updating the image before you load the border pixels into local memory?

Let's say that your work-groups are of size 3x3 and your original image looks like this:

a a a b b b
a a a b b b
a a a b b b

What you want is each work group to load a tile of data into local memory like this:

x x x x x
x a a a b
x a a a b
x a a a b
x x x x x

x x x x x
a b b b x
a b b b x
a b b b x
x x x x x

However, with the code you've shown, when work-group a (3x3) is executed the image is updated and now looks like this:

c c c b b b
c c c b b b
c c c b b b

...so when it's the turn of work-group 'b' it will read this in local memory:

x x x x x
c b b b x
c b b b x
c b b b x
x x x x x

x x x x x
a b b b x
a b b b x
a b b b x
x x x x x

oclMedianFilter possibly gets away with it because the values from 'a' and 'c' look very similar (it's computing a median filter after all). However, in your example 'a' and 'c' will look very different and that's why you notice the artifacts.

You will have to read from an image X and write into a different image Y if you wan to avoid the artifacts.

arthur.sw
04-07-2011, 05:32 AM
Thank you very much for this clear answer :-) !
If I understand correctly, this problem can not be fixed with barrier or mem_fence functions since it would only synchronize memory within blocks ?

Anyway I tried to write in another variable, that works well, thanks.

I still have the other problems though ;-)

david.garcia
04-07-2011, 06:00 AM
If I understand correctly, this problem can not be fixed with barrier or mem_fence functions since it would only synchronize memory within blocks ?

That's right. There's no way in OpenCL to communicate between work-groups.

I'll look at the other questions after work :)

arthur.sw
04-07-2011, 07:58 AM
ahah, thank you :-)

For my loop problem, I guess there is no other way than executing the kernel n times (instead of having the for loop inside the kernel).

It works but it really slows down the program...

And I still don't know why the local edges did not update (my second problem), I tried to write the result in another variable and that did not fix the problem...

david.garcia
04-07-2011, 03:05 PM
It works but it really slows down the program...

Does it make such a big difference? What size is the image and how do you make the calls to repeat the operation multiple times?

And I still don't know why the local edges did not update (my second problem)

It's basically the same problem as your first one, right? The white borders you are drawing in work group (2,4) are overwritten by the other work groups. This code is causing the issue:

if(li-1 == 0)
u[(gi-1)*gWidth+gj] = uLocal[(li-1)*lWidth+lj];
else if(li+1 == lHeight-1)
u[(gi+1)*gWidth+gj] = uLocal[(li+1)*lWidth+lj];

if(lj-1 == 0)
u[gi*gWidth+gj-1] = uLocal[li*lWidth+lj-1];
else if(lj+1 == lWidth-1)
u[gi*gWidth+gj+1] = uLocal[li*lWidth+lj+1];

If you want it to go away then comment the last line in the kernel:

//u[gi*gWidth+gj] = uLocal[li*lWidth+lj];

arthur.sw
04-08-2011, 12:10 AM
Yes that's right, this is the same problem. I realized it when I woke up this morning ;-)