1.) Using only create/free buffer:

platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 15.666632 usec

device[1]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 4.736423 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 8.015486 usec

platform[2]=Intel(R) OpenCL
device[0]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 24.046458 usec

2.) Create/Free buffer and que:

platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 12.023229 usec

device[1]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 10.201528 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 13.844930 usec

platform[2]=Intel(R) OpenCL
device[0]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 23.317777 usec

3.) Create/Free buffer, que and do map/umap:
platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 740.339427 usec

device[1]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 22.224756 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 9735.171989 usec

platform[2]=Intel(R) OpenCL
device[0]=Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
end-start time 101.650935 usec

I moved the malloc outside of the loop and added 128byte allignment. I found also some timing overhead in my own code thanks to your example. AMD does show a relatively low overhead for the CPU device, but is still a lot more than pointer copy or a call to clSetKernelArg. Anyhow, you did say that using the pointer without map/unmap for CPU device is fine. So I guess that solves the (overhead) problem. Usually when you copy data you need to wait for the queue to stop anyway.

Thanks!
Atmapuri