I don't know why I didn't think about this sooner, but I would like a second opinion on this method.
Often times global_size for a clEnqueueKernel() is not an even multiple of preferred_work_group_size_multiple, or even worse it's a prime number, which is a problem if you explicitly pass the local_size. Option one is to pass the global_size and pass NULL for local_size. How this is handled is implementation dependent. Option two is to pad global_size such that it's a multiple of local_size (which is a multiple of preferred_work_group_size_multiple) and pass the original desired global_size as a kernel argument and do bound checking, ignoring that extra padding. I had been doing this but started to think it's really wasteful to do bound checking when only a tiny fraction of work-items should be ignored.
Option three recently dawned on me as I became more familiar with OpenCL, enqueue the kernel with zero offset and global_size_a equal to the greatest multiple of local_size_a (which again is a multiple of preferred_work_group_size) less than or equal to global_size, and then enqueue the kernel again but with global_size_a offset, global_size_b = global_size - global_size_a, and local_size_b = global_size_b (which is less than local_size_a). Then there's no more need for explicit bound checking inside the kernel like in option two.
Maybe option one does option three internally, but option three gives the developer the choice of local_size_a and local_size_b. Which option do you think is the best? Are there any problems or issues I didn't mention with any of these options?