I tried to compare the performance of the BlackScholes implementation in CUDA and OpenCL found in the NVidia SDK.
The kernels are exactly the same just that the CUDA implementation uses native functions (__expf and __logf), whereas OpenCL doesn't, i.e. it uses exp and log. With these implementations the OpenCL kernel execution takes about twice as long as the CUDA implementation.
Using native functions in OpenCL (native_exp and native_log) makes the OpenCL kernel as fast as the CUDA one, however the results are less accurate...
I wrote a small kernel that only computes the exponential of the input and translated it to PTX, once with and once without native_. It seems like using native the function call is translated to ex2.approx.f32, whereas the non-native function is translated to a long sequence of instructions (strangely including ex2.approx.f32 as well).
The same can be observed when CUDA is translated to PTX. However, the CUDA implementation using __expf is as accurate as the one using expf, whereas in OpenCL the native function is a lot less accurate than the non-native one...
Any ideas why that is the case? How well are native_ function calls supported in the NVidia OpenCL SDK?