I've been trying to triangulate a really stubborn bug for the past couple days. I'm testing an algorithm to compute high dimensional integrals using a parallel "divide and conquer" approach. The outline is fairly simple: starting from two input arrays of all boundary data, each integration work item gets the boundary data for its subregion, computes integral and error estimates, and writes these values into respective output arrays.

I'm building everything on a MacBook Air with an NVIDIA GeForce 320M using Apple's distribution of OpenCL 1.0.

I want to emphasize that when I run the OpenCL code for CL_DEVICE_TYPE_CPU, the algorithm works perfectly. The computations concur with the mathematically exact values up to numerical tolerance.

It is when I run the code for CL_DEVICE_TYPE_GPU that problems begin.

Instead of posting all of my code, I'll try to highlight what I think is relevant. The prototype of my kernel looks like this:
Code :
__kernel void
mgk13s(int dim, __global void* params, 
	__global const float* a, __global const float* b,
	__global float* result, __global float* rawerr, __global float* resabs)
In my main routine, I'm denoting the number of work items by global_ws. The dimension of the domain (and hence boundary data) is denoted by dim. Allocating memory, buffers and setting up kernel arguments looks like this:
Code :
	size_t out_size = global_ws * sizeof(float);
	size_t in_size =  dim * out_size;
	float* a = (float*) malloc(in_size);
	float* b = (float*) malloc(in_size);
	float* result = (float*) malloc(out_size);
	float* rawerr = (float*) malloc(out_size);
	float* resabs = (float*) malloc(out_size);
 
	cl_mem par_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(params), NULL, NULL);
	cl_mem a_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, in_size, NULL, NULL);
	cl_mem b_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, in_size, NULL, NULL);
	cl_mem res_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
	cl_mem err_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
	cl_mem abs_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
 
	err  = clSetKernelArg(kernel, 0, sizeof(int), &dim);
	err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &par_buf);
	err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &a_buf);
	err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &b_buf);
	err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &res_buf);
	err |= clSetKernelArg(kernel, 5, sizeof(cl_mem), &err_buf);
	err |= clSetKernelArg(kernel, 6, sizeof(cl_mem), &abs_buf);
The code which executes the black box outlined above looks like this:
Code :
	err = clEnqueueWriteBuffer(queue, par_buf, CL_TRUE, 0, sizeof(params), params, 0, NULL, NULL);
	err |= clEnqueueWriteBuffer(queue, a_buf, CL_TRUE, 0, in_size, a, 0, NULL, NULL);
	err |= clEnqueueWriteBuffer(queue, b_buf, CL_TRUE, 0, in_size, b, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
 
	err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
	err = clEnqueueBarrier(queue);
 
	err = clEnqueueReadBuffer(queue, res_buf, CL_TRUE, 0, out_size, result, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
	err |= clEnqueueReadBuffer(queue, err_buf, CL_TRUE, 0, out_size, rawerr, 0, NULL, NULL);
	err |= clEnqueueReadBuffer(queue, abs_buf, CL_TRUE, 0, out_size, resabs, 0, NULL, NULL);
The ABORT macro just displays the line number and CL_error information, which is how I'm isolating the problem.

For certain parameter values, the first clEnqueueReadBuffer call returns an error of type CL_INVALID_COMMAND_QUEUE, although clEnqueueNDRangeKernel call returns CL_SUCCESS.

Any ideas here?

Incidentally, global and local work sizes for this example are, local_ws = 64, global_ws = 4096. These are set at the beginning, using an algorithm that distinguishes between the CPU and GPU:
Code :
	size_t local_ws = (DEVICE == CL_DEVICE_TYPE_GPU) ? 64 : 1;
	size_t factor = N / local_ws;
	size_t global_ws = (N % local_ws) ? ((factor + 1) * local_ws) : N;