Local work size seems pretty small to me. On NVidia architecture, it must be a multiple of 32 for maximum efficiency. When no collaboration is required inside work group threads (ie no communication...