Hey,

I am trying to synch all workgroups using a global variable as a semaphore. My barrier function inside the kernel is as follows:

Code :
#define WORKGROUP_COUNT 15
#define THREAD0_LOCAL (idx_Local == 0)
 
inline void barrierGlobalRamp(__global volatile int* volatile synch, int idx_Local, int barrierIdx, char *direction)
{
	mem_fence (CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);	
	if (THREAD0_LOCAL)
	{
		bool goOutFlag = 0;
		switch (*direction)
		{
			case BARRIER_INCREASE:
				atomic_inc(&synch[barrierIdx]);
				while (!goOutFlag)
					if (synch[barrierIdx] >= WORKGROUP_COUNT)
							goOutFlag = 1;
				*direction = BARRIER_DECREASE;
				break;
			case BARRIER_DECREASE:
				atomic_dec(&synch[barrierIdx]);
				while (!goOutFlag)
					if (synch[barrierIdx] <= 0)
							goOutFlag = 1;
				*direction = BARRIER_INCREASE;
				break;
		}
	}
	barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
	return;
}

Every first thread of each workgroup tries increasing (or decreasing, based on the direction variable) synch and checks if the value reached the total # of workgroups and exits if so, and waits otherwise.

I am using GTX570 card which has 15 SMs and this code works if my number of workgroups, or WORKGROUP_COUNT, is 15 or less.

The problem, however, is that it doesn't seem to get out of the function (for at least some WGs) if the number of workgroups is set to 16 or higher. Anyone has any idea how this might happen?

My initial guess is that one WG is starved by its rival WG on the SM and doesn't get into the function but I'm pretty sure there is more to it!

Any hint is appreciated