Nice work.

I have just a suggestion:

Very often, my kernels have the following structure:

1) copy data from global to local memory. barrier.
2) a subset of work-items in the work-group...