I have a kernel where a particular element (of a data structure) from the global memory.
Other words, all the threads executing the kernel use the data at the same address in the global memory.
I am trying to use the async_work_group_copy to get the data to the shared memory, first thing in the kernel. Also, as per the OpenCL specification, async_work_group_copy is executed by all threads.
Is the following possible?
One thread executes the async copy function and gets the data to the shared memory and the rest of the threads of the work group use the data brought in by one of the thread.
Or is it better to allow the cache to handle the data accesses in this case?