Since OpenCL allows you to simultaneously schedule work on multiple devices and do this asynchronously, it's very important to distinguish main host's CPU to reduce its load.

Currently, if you distribute async job between all CPU and GPU available in the system (full load), additional devices wait most of the time, because main host's CPU cannot process their requests.

Other solution would be to automatically reduce its load when either callbacks or native kernels are queued or assign higher priority to them.