question about global memory access
I have a question about global memory access.
I compared two ways to access the global memory.
One is to let all threads access the same location and the other is to let threads access their own memory.
I found the former spent less time.
The former involves memory conflict, why it is faster? Due to cache?
In modern hardware, the case of all work items reading the same location (known as a "broadcast") does not cause a conflict and is therefore not serialized (and so fast).