Hi,

I'm new to OpenCL and I consider using it for some graphics computation where using an OpenGL shader seems not to be natural. Before I actually do so I thought I'd try how much of a performance improvement I could get using OpenCL on my Nvidia GTX 460 over my CPU. For this reason, I implemented a simple skeleton skinning algorithm, once on the CPU, without multithreading but using the Eigen library, which provides SSE-optimized vector and matrix libraries, and once in an OpenCL kernel executing on the GPU. The vertices, bone matrices etc. are generated randomly on application start. I repeat the whole skinning several times so that it executes long enough to get meaningful timing results.

First I simply tried a kernel where I have as much work-items as I have vertices, each one generating one output vertex. I quickly saw that this is not a good idea because performance was even worse than on the CPU. I figured this was in essence a problem of too many memory accesses, mainly to the bone matrices, which are an array of float16-vectors that is addressed four times in each work-item. Then I changed the algorithm so that each work-item handles multiple output vertices, one after the other, so that I have less work-items. In each work-group I create a copy of the bone matrices in local space, and further accesses to these matrices come from local space. The interesting part of my C++ code looks like this:

Code :#define NUM_BONES 30 #define NUM_VERTICES 30000 #define NUM_VERTICES_PER_WORK_ITEM 100 #define NUM_ANIM_REPEAT 1000 uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices) { File kernelFile("/home/alemariusnexus/test/skelanim.cl"); char opts[256]; sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM); cl_program prog = BuildOpenCLProgram(kernelFile, opts); cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL); cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL); cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL); cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL); cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL); cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, NUM_VERTICES*sizeof(Vector4), NULL, NULL); uint64_t s, e; s = GetTickcount(); clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf); clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf); clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf); clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf); clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf); size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM }; size_t localWorkSize[] = { NUM_BONES }; for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) { clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); } clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL); e = GetTickcount(); return e-s; }

The associated program/kernel looks like this:

Code :inline float4 MultiplyMatrixVector(float16 m, float4 v) { return (float4) ( dot(m.s048C, v), dot(m.s159D, v), dot(m.s26AE, v), dot(m.s37BF, v) ); } kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices) { int gid = get_global_id(0); int lid = get_local_id(0); local float16 lBoneMats[NUM_BONES]; lBoneMats[lid] = boneMats[lid]; barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) { int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i; float4 vertex = vertices[vidx]; float4 w = weights[vidx]; uint4 idx = indices[vidx]; resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x) + MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y) + MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z) + MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w)); } }

Now, per work-item I have only one access to the global boneMats, when I create the local copy, and it's even a lot less work-items executing altogether. Then I have NUM_VERTICES_PER_WORK_ITEM*4 accesses to the local array afterwards. As I understand, local memory should be way faster than global memory, so I thought this would greatly improve performance. Well, the opposite is the cause: When I let lBoneMats alias to the global boneMats instead, I get actually better performance than with the kernel listed above.

What did I get wrong here?

Thanks in advance!