for a master's thesis, we could need some urgent, close-to-submission help concerning the cache utilization of the graphics card.
We're developing a rather complex spring-mass system using OpenCL. It is based on a tetrahedral topology and uses springs on edges, triangles and tetraedra. All these springs act on the adjacent vertices and apply forces, which are accumulated in a reduction operation.
Here's a brief description of the general algorithm:
- Compute forces on vertices per edge and store in temp buffer[/*:m:1iv3n5c6]
- Compute forces on vertices per triangle and store in temp buffer[/*:m:1iv3n5c6]
- Compute forces on vertices per tetraedron and store in temp buffer[/*:m:1iv3n5c6]
- Accumulate forces per vertex from temp buffers[/*:m:1iv3n5c6]
The problem is that in the first 3 kernels, each edge/triangle/tetra must lookup the position of the adjacent vertices. These are unordered and, hence, caching isn't used. In the last kernel it's even worse. Each vertex needs to lookup the forces from all adjacent edges/triangles/tetras. For this, we use three arrays with indices pointing to the elements and then fetch the force vectors in the temp buffers of the previous kernels. These lookups are also very random and don't use the cache.
The AMD profiler tells us that the cache hit is close to 0% and that the ALUs are only at about 10%, which isn't surprising as they get bored while waiting for the global memory read.
So, is there anybody with some suggestions of how to optimize the memory access??? We believe this is a well-known problem (probably in applications other than spring-mass simulations).
Any help is really appreciated!!!