Since we cannot use memcpy in OpenCL, i am wondering if there
is a similar function available that can be used to copy chunks of
data from __global to __private (or to __local) inside a kernel.

For example say I wish to copy 10 elements from global memory to
__private memory (per thread). I do not wish to make a loop like:

Code :
for (int i=0; i<n_elements.....
...

How is this generally achieved in OpenCL?

The purpose is to get a list of data into each thread. I am making a raytracer
where I need to grab a list of surface data contained within each grid cell
(or tree node if I use that).