My GPU contains 18 compute units and each work-group supports a maximum of 256 work-items. When I execute my kernel with 16 * 256 items, OpenCL creates 16 work-groups and I get the right answer. But when I execute with 32 * 256 items, OpenCL creates 32 work-groups and I get the wrong answer.
Does the maximum # of items equal compute_units * max_work_group_size? Or is there a way to code kernels to support more work-items?
How do the extra work-groups access local memory if there are only 18 local memory blocks on the device? For example, my kernel uses barrier(CLK_LOCAL_MEM_FENCE) to synchronize local memory access. Is that causing the problem?