Questions on OpenCL Built-in functions?
I am trying to use OpenCL Built-in "rotate" function in one of my kernel as given below, but i am getting following errors while compiling :
clBuildProgram Error for -11 Error Number
error: call to 'rotate' is ambiguous
W = rotate(A, 5);
<built-in>:2784:22: note: candidate function
int __OVERLOADABLE__ rotate(int, int);
<built-in>:2785:23: note: candidate function
uint __OVERLOADABLE__ rotate(uint, uint);
<built-in>:2780:23: note: candidate function
char __OVERLOADABLE__ rotate(char, char);
<built-in>:2781:24: note: candidate function
uchar __OVERLOADABLE__ rotate(uchar, uchar);
<built-in>:2782:24: note: candidate function
short __OVERLOADABLE__ rotate(short, short);
<built-in>:2783:25: note: candidate function
ushort __OVERLOADABLE__ rotate(ushort, ushort);
similar for other datatype.....
and so on.......
The kernel I used for demo is as ::
__kernel void demoKernel(__global unsigned int *d_io_2d, long max_size)
long i = get_global_id(0);
if(i >= max_size)
unsigned int A;
A = d_io_2d[i];
unsigned int W;
W = rotate(A, 5);
My questions are ::
1. How to use OpenCL Built-in functions? (As in above example, what am i missing actually?)
2. What is the purpose of using Built-in functions?
3. Does it improve the performance?
1. Try rotate(A, (uint)5)
2. Some of them can use dedicated hardware circuitry (e.g. rsqrt, mad). They also circumvent the fact that OpenCL C has no standard library.
3. At least they don't degrade performance. You can expect that they are optimally implemented, i.e. you won't be able to do better.
Originally Posted by utnapishtim
1. after trying rotate(A, (uint)5), my kernel compiled and i got correct result.
3. My implementation of rotate function is :
uint rotate1(int n, uint x)
return (x << n) | (x >> (32-n));
when i benchmarked my application with built-in "rotate" function and user-defined "rotate1" function (as mentioned above), I got that built-in "rotate" function was little slower compare to user-defined "rotate1".
Then how can we expect that they are optimally implemented?
Please note that your implementation of rotate1() breaks down when n>32, whereas the built-in function rotate() is guaranteed to work for any n.
I have checked the assembly code produced by three calls in a row to rotate() with NVIDIA OpenCL.
Using built-in rotate(), 17 instructions are generated (and my NVIDIA hardware has no rotate instruction).
Using your function rotate1(), 19 instructions are generated.
So using built-in rotate() is safer and faster.
Furthermore, if this code was compiled on a hardware device that has a rotate assembly instruction, you would have to rely on the compiler to detect that "(x << n) | (x >> (32-n))" is in fact a rotation and can be optimized into the rotate instruction.