I am currently doing a software rasterizer/renderer with OpenCL as the engine for the fragment shading stage.
I eventually plan on moving as much as practical to OpenCL.
In this list's opinion, given current limitations with OpenCL and threads, and also HOST<>GPU communications overhead,
what would be the best practical strategy for optimizing my scenario.
I know that modifying command queues is not thread safe (I tried it;>).
Right now the thread hierarchy looks like this:
(CPUthread0:TransformGeometry) .. (CPUthread63:TransformGeometry) (using a thread pool)
thread safe (but not locked) screen-space per material per screen tile post transform buckets
(CPUthread0:RasterizePt1) .. (CPUthread15:RasterizePt1) (using the same thread pool)
\/ (SCAN CONVERT TRIANGLES INTO PRE-SHADED FRAGMENTS)
thread safe (but not locked) tile-space per material per screen tile preshaded-fragment buffers
(Locked OpenCL Device: Fragment Shading) // CPU THREADS SERIALIZED HERE (Most time spent per frame is also here)
\/ (SHADE FRAGMENTS)
thread safe (but not locked) tile-space per material per screen tile postshaded-fragment A-Buffers
(CPUthread0:RasterizePt3) .. (CPUthread15:RasterizePt3) (using the same thread pool, actually the same workqueue job as RasterizePt1 )
\/ (ZSort, A-Buffer Composite and AntiAlias Resolve TileBuffer to FrameBuffer)
If it matters, I am not currently concerned with all hardware platforms, just mine. I will be at some point, but I am not there yet...
I am using a dual Xeon E5520 and Geforce 260 Core 216.
you can see some performance tables at