I have written an openCL, and also a cuda version for my cfd code. I used the same algorithm for both of them; and same optimization level, I guess... since I didn't add anything else when compile.
I thought, if I run them on the same machine, they would have the same speed, roughly. However, I found cuda is about 2 times faster than openCL on GTX260, (3:1 speed).
Did I do anything wrong? or it should be like this? Could someone give me some suggestions? Any thought will be appreciated!