I have a function Run() that calls execution of two kernels:

void Run()
//I'am using C++ bindings
queue->enqueueNDRangeKernel(*kernelRow, cl::NullRange, *globalRangeRow, *localRangeRow, NULL, eventRow);
queue->enqueueNDRangeKernel(*kernelColumn, cl::NullRange, *globalRangeCol, *localRangeCol, NULL, eventCol);

// As you see, I'm using events (eventRow, eventCol) because of profiling.

How expensive (time performance) is calling enqueueNDRangeKernel (or clEnqueueNDRangeKernel ).

With Nvidia OpenCL Profiler, I got total time of execution (on GPU) 351 ms, but when I measured time of running of method Run()
I got 622 ms.

Why this difference is so large?

I tested on NVIDIA GT240.
I also tested on ATI HD 5670 and difference is much smaller.

When is data transfered to GPU, on calling clEnqueueNDRangeKernel or when buffer is created (clCreateBuffer)?