I've been trying to use prefetch to improve my performance, but haven't seen any impact one way or another. I wonder if I'm using the command the correct way. I haven't been able to find any code samples that show its correct use.
My code uses a loop to read chunks of data from global memory into local memory, and then process it. I use a barrier command to synchronize the threads, and an async_work_group_copy() command followed by a wait_group_event to transfer the data to local memory. Right after that, I kick off a prefetch command to the next chunk of global memory, and then process the data in the local memory. I *think* the next time I transfer data to local memory at the top of the loop, it should happen faster, but as I said, I don't see any performance payoff.
Am I misunderstanding how to use prefetch()? Can anyone point me to the correct usage?
BTW, I'm using a compute capability 1.1 card.