I want some DispatchIndirect in OpenCL world so setting global and local workgroup sizes from GPU mem (cl_mem object) avoiding sync in same cases as marching cubes..
could be EnqueueNDKernelIndirect..
Also expose 2d image arrays similar to RWTexture2D since this is more performant on Fermi by more than 2x to 3d images..
Also I mentioned a year ago bringing new integer instructions like find first bit set, pop count, reverse bits ,etc..
Also as other suggestions point:
* recursion(mainly bring true function calls) and other cuda goodness as malloc free in kernels,etc..
accessing host mem from GPU, etc..