Having separate command queues on separate threads is just as difficult to program as an out-of-order command queue in OpenCL. (Both need the same level of synchronization/dependencies to get correct operation; in one case it is through cl_events, in the other through OS-locks.)

Keep in mind that all of the command queues go to the same device in the runtime, so if it is possible to overlap these at the device level, using an out-of-order queue should do it. (Indeed, if it's possible then a good runtime should do it regardless of whether the queue is out-of-order as long as it doesn't have any dependency issues.) If the out-of-order queue doesn't do this, then there's no reason to believe separate CPU threads will do it. (They will most likely just re-order in some internal queue in the runtime.)

I would not recommend having multiple command queues unless it simplifies your program, which would imply that they are truly independent.