If you are only copying something like a single int, then it's not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle... assuming that your hardware doesn't use a DMA engine for global->local copies.
How about this case?

- Each thread needs lets say 1000 elements to complete its work
- Number of threads in 1 work group = 1024

Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

-- Bharath