Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: Profiling Code

  1. #1

    Profiling Code

    Ok, so I have my code running, but I have to say I'm disappointed with the performance.

    It's a particle system and I get the following approximate performance statistics:

    Scalar version on the CPU: 1 Million particles per second.
    GPGPU version using GLSL: 55 Million particles per second.
    OpenCL version on CPU: 5 Million particles per second.
    OpenCL version on GPU: 4 Million particles per second.

    OpenCL on the CPU seems about right. I'm doing calculations on 3 component float vectors (in float4s) and I'm on a Core 2 Duo, so two cores. A six times speed-up would be my theoretical maximum, and that's not including the fact that there's some unavoidable scalar calculation. I'm happy with that result.

    The problem is obviously the GPU based OpenCL. It's about 12x slower than my GPGPU implementation, and it's even slower than the CPU OpenCL. Obviously something is going very wrong. I suspect it's down to memory access, but I don't know for sure.

    How can I find out what is making my code slow?
    What profiling tools are there?

    I'm currently on Snow Leopard, but could probably get my code to Linux if there were better tools there.

  2. #2
    Junior Member
    Join Date
    Sep 2009
    Posts
    8

    Re: Profiling Code

    Can you post the kernel you're using? What work-sizes are you using with it?

    Beyond the profiling flag for your command queue (which probably isn't too helpful here?) there isn't a standard way of profiling.

    Have you tried commenting out various parts of your kernel and seeing where the 'slowness' comes from?

  3. #3
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Profiling Code

    Work-group size is a really important factor. If you're setting it to 1 you will get terrible utilization on GPUs! You also need to make sure that the data transfer is not killing you. On the CPU the data doesn't have to move over the PCI bus, but it does on the GPU. This means you want to move as little data as possible and do a lot of computation. Given that your GLSL performance is high I doubt this is the issue, though.

  4. #4

    Re: Profiling Code

    This is the kernel:
    Code :
    kernel void particle(constant int numberOfGalaxies,
     
                         global float4 * galaxyPositions,
                         global float * galaxyMasses,
                         float G,
                         float dT,
     
                         global float4 * starPositions,
                         global float4 * starVelocities,
     
                         global float4 * newStarPositions,
                         global float4 * newStarVelocities) {
     
        int gid = get_global_id(0);
     
        float4 starPosition = starPositions[gid];
        float4 starVelocity = starVelocities[gid];
     
        for (int galaxy = 0; galaxy < numberOfGalaxies; galaxy++) {
           float4 galaxyPosition = galaxyPositions[galaxy];
           float  galaxyMass     = galaxyMasses[galaxy];
     
           float4 dP = starPosition - galaxyPosition;
           float d = length(dP);
           float acceleration = galaxyMass * G / (d * d * d) * dT; 
     
           starVelocity = starVelocity - dP * acceleration;
        }
     
        starPosition = starPosition + starVelocity * dT;
     
        newStarPositions[gid]  = starPosition;
        newStarVelocities[gid] = starVelocity;
    }

    Each time I enqueue this it a 1-dimensional million item global work-size. I've been leaving the local work size as NULL to allow the driver to decide an optimal value.
    • starPosition/Velocities are read/write buffer objects, but used for input only.[/*:m:21bjdic6]
    • newStarPosition/Velocities are read/write buffer objects, but used for output only. [/*:m:21bjdic6]
    • After each iteration the buffer objects are swapped new<->old, avoiding any copies.[/*:m:21bjdic6]
    • One call to enqueueReadBuffer is performed on newStarPositions after an iteration to get the locations back for display. It's non-blocking, but I do wait for the event before display. Removing this read and the display routines don't make huge differences.[/*:m:21bjdic6]
    • The main load appears to be the loop. I would have liked to have put the galaxyPositions / Masses into constant space, but making that change (i.e. change global to constant) crashes the compiler.[/*:m:21bjdic6]
    • numberOfGalaxies = 20.[/*:m:21bjdic6]
    • The speeds I gave before were working on 3 element vectors. I changed it to 4 as an experiment and got a speed increase of about +1 million particles for both the CPU and GPU.[/*:m:21bjdic6]

  5. #5
    Junior Member
    Join Date
    Sep 2009
    Posts
    8

    Re: Profiling Code

    Hmm the code looks quite sensible and there's no barriers/fences in there that might mess things up. My best guess would be that the reads from the non const input arrays are causing something bad (and unneeded) to happen with the memory/cache on the device. This would hurt quite a lot, especially given that every work item reads every single galaxy.

    Does changing "global float4 * " to "global const float4" change the timings at all? It might let the compiler optimize the loads better. You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.

    I'm speculating here, so I could be miles off though!

    Alan

  6. #6

    Re: Profiling Code

    Does changing "global float4 * " to "global const float4" change the timings at all?
    It brings the GPU to approximate parity with the CPU (6M each). Worthwhile, but it's not the order of magnitude I'm looking for.

    You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.
    I just gave it a go, and it cost me a little (about 300k particles a second on the GPU - about 800k on the CPU). I have also tried just having a single buffer that's read and written to, and that didn't seem to be a win or a loss. I was wondering if I might benefit from a smaller cache footprint.

  7. #7
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Profiling Code

    Paul,
    Use the async_workgroup_copy to copy your 20 galaxies into local memory. That should give you a tremendous speed boost. Currently you are reading the same data from global memory for every operation, which is probably causing a huge slowdown. (On the CPU this gets put in the cache, so you don't see the hit as much.)

  8. #8

    Re: Profiling Code

    That's a big improvement. Up to about 15.5 Million now, so we're nearly 4x from where we started. The kernel now looks like this:
    Code :
    kernel void particle(constant int numberOfGalaxies,
     
                         global const float4 * galaxyPositions,
                         global const float * galaxyMasses,
     
                         local float4 * localGalaxyPositions,
                         local float  * localGalaxyMasses,
     
                         constant float G,
                         constant float dT,
     
                         global const float4 * starPositions,
                         global const float4 * starVelocities,
     
                         global float4 * newStarPositions,
                         global float4 * newStarVelocities) {
     
        event_t galaxyEvent[2];
     
        galaxyEvent[0] = async_work_group_copy(localGalaxyPositions, galaxyPositions, numberOfGalaxies, 0);
        galaxyEvent[1] = async_work_group_copy(localGalaxyMasses, galaxyMasses, numberOfGalaxies, 0);
     
        int gid = get_global_id(0);
     
        float4 starPosition = starPositions[gid];
        float4 starVelocity = starVelocities[gid];
     
        wait_group_events(2, galaxyEvent);
     
        for (int galaxy = 0; galaxy < numberOfGalaxies; galaxy++) {
           float4 galaxyPosition = localGalaxyPositions[galaxy];
           float  galaxyMass     = localGalaxyMasses[galaxy];
     
           float4 dP = starPosition - galaxyPosition;
           float d = length(dP);
           float acceleration = galaxyMass * G / (d * d * d) * dT; 
     
           starVelocity = starVelocity - dP * acceleration;
        }
     
        starPosition = starPosition + starVelocity * dT;
     
        newStarPositions[gid]  = starPosition;
        newStarVelocities[gid] = starVelocity;
    }
    It took me a little while to work out how to deal with the local memory. Does that look ok to people?
    I set the kernel argument with
    Code :
        err = clSetKernelArg(particleKernel, 3, sizeof(float) * 4 * numberOfGalaxies, NULL); // localGalaxyPositions

    I should probably add that adding that async_copy cost a little performance on the CPU, but not much.

  9. #9

    Re: Profiling Code

    I've now taken that improvement over to a version that shares an OpenGL VBO as the starPosition memory buffer object. This eliminates the read back and re-submit of all the position data to display the system, which was now becoming significant.

    We're now up to 30 Million particles per second, which is within a factor of 2 of my best GPGPU results on this machine, and 7.5 times what we started with.

    I expect some of the difference Vs my GLSL kernel is that the galaxyPositions/Masses were defined as uniforms so that could have been loaded once into fast memory and left alone. This kernel is having copy them local for each work group. Am I right in saying that loading the values in constant memory would have the same effect?

    Not sure how much more there is to squeeze out of this, but it's been an interesting experiment. Hope it's been useful to others too.

    Could still do with some form of profiling I know it's not that easy though.

  10. #10
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Profiling Code

    Paul,

    You should put in a memory barrier after your async_workgroup copies to make sure all outstanding memory accesses across the workgroup are done before your kernel continues. (This shouldn't matter on current hardware, but may be needed in the future.)

    The other thing you should do is enable MADs. Take a look at the OpenCL documentation for the compiler variable to pass in to the compiler to enable the use of the mad instruction. (It's something like -cl-enable-mad.) This will be off by default in CL, but on by default in GLSL, so you may be able to get a boost out of that.

Page 1 of 2 12 LastLast

Similar Threads

  1. Profiling of kernel code
    By biren in forum OpenCL
    Replies: 3
    Last Post: 04-22-2013, 03:10 AM
  2. profiling
    By amgastineau in forum OpenCL
    Replies: 1
    Last Post: 09-09-2009, 01:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •