The problem as far as we can understand is that the kernel uses too many registers for a single thread. Thus, when we try to set the block size as even a small value such as 8 x 8, the kernel fails....