I have a kernel that's basically only doing a lookup. The total numer of lookups is about 32.000.

Question: Is the overhead to start & execute ONE thread PER lookup worth it?

Code :
__constant int lookup_table[256] = {[i]...some values...[/i]};
__kernel void some_kernel(__global int* in, __global int* out){
	uint tid = get_global_id(0);
	out[tid] = lookup_table[in[tid]];

Or would it be faster if I make a little loop inside the kernel to do several, lets say 16, lookups per thread? Just to avoid the overhead of too much thread creation.

The code would run on state of the art ATI/AMD and Nvidia GPUs.