I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, I have 6Gb and I can run 2880 cores in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is being used in the kernel to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels of global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ?
Any ideas to do this properly are very welcome !