I'm trying to move some calcs onto OpenCL, but just calling empty kernels is causing quite a slowdown...

The initial serial calcs are effectively in a loop, calculating flows/pressure every xth of a second in a fluid network. All the flows have 2 associated pressure nodes and each node has 1 or more associated flows.

the basic structure is : (all on CPU)...

Set Initial Conditions at time t = 0.0
for i = 0 to number pipes
calc flows between 2 nodes (pipe)
for i = 0 to number valves
calc flows between 2 nodes (valve)
for i = 0 to number nodes
calc new pressures given flow for dT secs
t += dT
until bored

Obviously, I want to make these calcs parallel so I can do all 'n' associated calcs in one call.

I have created some kernels (one pipe flow, one valve flow and one pressure node) and setup args and buffers etc using clCreateBuffer, clSetKernelArg and clEnqueueWriteBuffer, but changing the above code to call empty kernels results in a slow down compared to doing the full serial calcs.

I'm effectively performing, after copying data around:

Setup Data for Initial Conditions at time t = 0.0

clEnqueueWriteBuffer(s) of all data required on device.

clEnqueueNDRangeKernel( pipe_kernel )
clEnqueueNDRangeKernel( valve_kernel )
clEnqueueNDRangeKernel( pressure_kernel )
t += dT
until bored

But this loop exhibits a considerable slowdown even though the kernels are empty.

The actual kernels are listed below, but does anyone know why clEnqueueNDRangeKernel calls should be so slow? I would have assumed that once the data had been copied accross to the device, clEnqueueNDRangeKernel just operated on the data, so there wouldn't be a slowdown.

I'm using the Intel SDK and developing the code on the CPU.
The testing envirmoment I'm working on consists of 1 valve, 18 nodes and 16 pipes.
All local workgroup sizes set to 16 and all global sizes set to an integer multiple of 16.

Anyone got any ideas why calling clEnqueueNDRangeKernel on empty kernals should be so slow? I've not even got round to optimising the calculations or getting the results back onto the host...

Note: The kernels have been gradually reduced to empty after remming out code trying to find where the speed bottleneck was


__kernel void k_A( __global double* dGlobalA
,__global double* dGlobalB
,__global double* dGlobalC
,__constant int* i_pA
,__constant int* i_pB
,__constant int* b_pC
,__constant int* b_pD
,__constant double* d_pE
,__constant double* d_pF
,__constant double* d_pG
,__constant double* d_pH
,__constant double* d_pI
,__constant double* d_pJ
,__global double* dGlobalD
,__global double* dGlobalE
, double d_pK
, double d_pL
, double d_pM
, double d_pN
, int iNumD )
long iBase = get_global_id(0) ;

if (iBase >= iNumD )
return ;

__kernel void k_B( __global double* dGlobalA
,__constant double* d_rA
,__constant int* i_rB
,__constant int* i_rC
,__constant int* i_rD
,__global double* dGlobalF
,__global int* iGlobalG
,__constant double* d_rE
,__constant double* d_rF
,__constant double* d_rG
,__global double* dGlobalH
, double d_rH
, double d_rI
, int iNumF )

long iBase = get_global_id(0) ;

if (iBase >= iNumF )
return ;

__kernel void k_C(__global double* dGlobalA
, __constant double* dGlobalI
, __global double* dGlobalJ
, __global double* dGlobalB
, __global double* dGlobalC
, __constant double* dGlobalK
, __constant int* i_nA
, __global double* dGlobalD
, __global double* dGlobalH
, __constant int* i_nB
, __constant int* i_nC
, __constant int* i_nD
, __constant int* i_nE
, __constant int* i_nF
, __constant int* i_nG
, __constant int* i_nH
, __constant int* i_nI
, __constant int* i_nJ
, __constant int* i_nK
, __constant int* i_nL
, __constant int* i_nM
, __constant int* i_nN
, __constant int* i_nO
, __constant int* i_nP
, __constant int* i_nQ
, __constant int* i_nR
, __constant int* i_nS
, __constant int* i_nT
, __constant int* i_nU
, __constant double* d_nA
, __constant double* d_nB
, __constant double* d_nC
, __constant double* d_nD
, __constant double* d_nE
, __constant double* d_nF
, __constant double* d_nG
, __constant double* d_nH
, __constant double* d_nI
, __constant double* d_nJ
, double d_nK
, double d_nL
, int iNumA )

long iBase = get_global_id(0) ;

if (iBase >= iNumA )
return ;