I need to implement an access function to a discrete circular shaped object in a kernel. I have two option:
1. Calculate each access with floor(n*sin(t)+0.5f)
2. calculate all access indices once and access them in local memory
Does anyone know how much cycles a local memory access takes? I found numbers in CUDA for sin() function with ~16 cycles and Global Memory with ~400 cycles.