I am working on a multi-threaded code to run an OpenCL kernel on several Compute Devices simultaneously. In this code, I first launch N threads, with N equal to the user specified work device count (CPU, GPU1, GPU2 ...). In each thread, I run the same host code, creating context, create command queue, create buffers and compile a program etc.

I am wondering if this overall structure is safe and sound in a multi-threaded environment? Are there any specifications on which cl function is thread-safe or not?

Moreover, if my GPU1 and GPU2 are both of the same kind, say the two cores in a Radeon 6990, is there a way to save one clBuildProgram? the reason I ask is because the building process is kind of long at this point. It takes about 5 seconds to compile a 650 lines cl code with ati catalyst 11.3.