In my GPU there are 384 cores, 8 compute units (streaming multiprocessors), so there 384/8 = 48 streaming processors on each compute unit. Given that NVidia warp size is 32, which means 32 threads execute in step, doesn't that mean 48-32=16 SPs are not doing anything on each cycle? That doesn't seem to make sense to me. Can someone help to clarify?