PDA

View Full Version : bandwidth test



matrem
12-17-2009, 09:08 AM
I want to test my OpenCL memory bandwitdh.
I work on a nVidia gt280 so my kernel should write or read (in global memory) a maximum of 118GB/s.

- I tried with the simplest kernel :

void main(__global float * array)
{
array[ get_global_id(0) ] = 123.321;
}
- I work on a 16 777 216 floats array, with a non host memory buffer.
-Each thread write on float so each warp (32 threads) write 128 bytes (that's the best case for nVidia GPU compabilitie 1.3).
- I use OpenCL profiler.

=> My kernel is executing in 0.955 ms, so bandwitdh is : 65,462 GB/s.

What is the problem with my experimentation :?

matrem
12-17-2009, 09:15 AM
In fact it's each half warps that should use a coalesced zone of 128 bytes.
So here half warps use 64 bytes bloc. But it should not be a problem.

coleb
12-17-2009, 11:30 AM
Maybe it's only 118GB/s going in both directions? In only one direction it is half that?

The bandwidth testing example in the NVidia OpenCL best practices guide goes in both directions. http://www.nvidia.com/content/cudazone/ ... sGuide.pdf (http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf)

Try that to see what you get.

matrem
12-21-2009, 11:25 AM
In fact I use a gtx275 (gt280 have 240 GB/s bandwidth).

You're right, with this sample I have 99GB/s bandwith :


...
cl::PlatformList platforms;

cl::DeviceList devices;
cl::Device & device (cl::GetPlatformList(&platforms).front().GetDeviceList(cl::ALL_DEVICE_TY PE, &devices).front());
cl::Context context (device);
cl::CommandQueue commandQueue (context, device, cl::CommandQueue::IN_ORDER_EXECUTION, cl::CommandQueue::ENABLE_PROFILING);
cl::Program const program (context, cl::util::GetFileSource("sample4.cl"));
cl::Kernel kernel (program, "main");
cl::Buffer buffer1 (context, cl::Mem::WRITE_ONLY, ARRAY_SIZE * sizeof(float));
cl::Buffer buffer2 (context, cl::Mem::READ_ONLY, ARRAY_SIZE * sizeof(float));
cl::Kernel::Args const args (buffer1, buffer2);
cl::Kernel::WorkSize const workSize ( glm::size3(ARRAY_SIZE, 1, 1) );

cl::Event event = kernel.EnqueueNDRange(commandQueue, args, workSize);

event.Wait();

std::cout << cl::util::BandWidth(2 * ARRAY_SIZE * sizeof(float), cl::Event::ProfilingInformation(event)) << std::endl;
...



__kernel

void main(__global float * array1, __global float const * array2)

{

array1[ get_global_id(0) ] = array2[ get_global_id(0) ];

}

ps: I change a bit my API due to our discussion and problem about returns :D

matrem
12-21-2009, 11:31 AM
I don't know if we can say gtx275 have a 118GB/S bandwidth... :?
Perhaps it's an OpenCL implementation limitation? Does someone test the same sample with CUDA?

dbs2
01-04-2010, 01:03 AM
Two suggestions:
- Run your kernel once to warm up the card, then average your results over a few dozen/hundred runs. You can easily get very strange results with both the first kernel execution and any individual execution.
- Use the vload commands to load a larger chunk of data at once. On some architectures this can make a big difference, on others it may not.

Also, the Nvidia OpenCL drivers are apparently not as mature as the CUDA ones (not surprising given that they are far newer) so you probably won't get the same performance in some areas.