For GPGPU, is it possible to not only support HPC, but also more "general" tasks like OS acceleration?
But it's very different: latency is important than throughput, and task appears randomly, not like batch work in HPC. So we can't wait for the CPU to collect a large number of threads, and launch a kernel; instead, we must delivery workgroup to GPU as soon as the number of workitem in it arrived the bottom line of the hardware scheduling (64 for AMD wavefront, 32 for NV warp). Also, if we just launch a kernel with that low number of threads, it'll be ineffecient for no enouth threads to hide cache miss latency.
So, will it be possible to dynamically add workgoup to a running kernel?