Hi,

I am a newbie to OpenCL. I have been tasked to do some image processing stuff.

Anyways, I am passing 2 sets of YUV data (left and right images of size 320 x 16 to a kernel function, which will compute the gradient of each pixel using SAD (Sum of absolute differences). For my first output, I only use 1 set of YUV data and it works nicely with the values same as computed by the CPU.

However, when I try to add a 2nd set of YUV data to the kernel function and compute for the 2nd output array, nothing seems to work (on the 2nd output array). I tried hard coding all members to 5 but the output array still shows values of its own.

Here is the kernel function implementation (output arrays are the first 2 parameters):

const char grad_l_h_cl[] = " \

__kernel void grad_l_h \

( \

__global unsigned char* img_grad_left_hor \

, __global unsigned char* img_grad_right_hor \

, __global unsigned char* p1_y \

, __global unsigned char* p1_u \

, __global unsigned char* p1_v \

, __global unsigned char* p2_y \

, __global unsigned char* p2_u \

, __global unsigned char* p2_v \

, int width \

, int height \

) \

{ \

const uint index = get_global_id(0); \

unsigned char diff_y = p2_y[index]-p2_y[index+1], diff_u = 0, diff_v = 0; \

\

if (index % width == width - 1){ \

img_grad_left_hor[index] = abs_diff(p1_y[index-1],p1_y[index]) + abs_diff(p1_u[index-1],p1_u[index])+ abs_diff(p1_v[index-1],p1_v[index]); \

img_grad_right_hor[index] = abs_diff(p2_y[index-1],p2_y[index]) + abs_diff(p2_u[index-1],p2_u[index])+ abs_diff(p2_v[index-1],p2_v[index]); \

} else { \

img_grad_left_hor[index] = abs_diff(p1_y[index],p1_y[index+1]) + abs_diff(p1_u[index],p1_u[index+1])+ abs_diff(p1_v[index],p1_v[index+1]); \

img_grad_right_hor[index] = abs_diff(p2_y[index],p2_y[index+1]) + abs_diff(p2_u[index],p2_u[index+1])+ abs_diff(p2_v[index],p2_v[index+1]); \

} \

if (img_grad_left_hor[index] == 0) { \

img_grad_left_hor[index] = 0; \

} \

if (img_grad_left_hor[index] > 255) { \

img_grad_left_hor[index] = 255; \

} \

} \

";

Here is how I perform the operation (g_worksize = 8 x 320 x 168, l_worksize = 256):

error=clEnqueueNDRangeKernel(cq, k_cfg, 1, NULL, &g_worksize, &l_worksize, 0, NULL, NULL);

I have created input buffers as such (work = 320 x 16:

memp1_u=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);

I suspect some memory settings are required to hold the data in the 2nd output array but I have no idea how to do so.

Please kindly help or advise.

*My system is:

GT220 - 6 multiprocessors, 48 CUDA cores, Compute Capability 1.2

GPU Computing SDK 3.2

WinXP Pro