Page 2 of 2 FirstFirst 12
Results 11 to 16 of 16

Thread: Profiling Code

  1. #11

    Re: Profiling Code

    You should put in a memory barrier after your async_workgroup copies to make sure all outstanding memory accesses across the workgroup are done before your kernel continues. (This shouldn't matter on current hardware, but may be needed in the future.)
    I thought that was what I was doing with the wait_group_events call. Would a memory barrier do something that doesn't?

    The other thing you should do is enable MADs
    I'll give it a go and report back when I have some time.

  2. #12
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Profiling Code

    The wait_group_events waits for the copies to finish, but does not guarantee that the memory operations are finished. You need to make sure that all memory operations are finished before any member of the workgroup begins to use the data. (The memory operations may be in-flight after the copies are done.)

  3. #13

    Re: Profiling Code

    The wait_group_events waits for the copies to finish, but does not guarantee that the memory operations are finished
    That's certainly a subtle distinction, but worth being aware of.

    The MADs seem to reduce the code a bit, and register usage (I've discovered that reading back the binary gives you an intermediate representation of the compiled form), but no measurable difference in speed.

    I suspect something else is the bottle neck here.

  4. #14
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Profiling Code

    Paul, what is your global size? If it's big, (>10k) you could try re-writing your kernel to have each kernel process X positions and shrink your global size by X. You want to make sure your global size is >1k, but beyond ~4k you won't see much benefit (and indeed a small decrease in performance) for using much larger sizes. This is because the hardware only supports a certainly maximum kernel launch size, and to handle larger sizes, the runtime must have to break up the launches into multiple separate launches, each of which will have some overhead. This might be worth experimenting with, because it will allow you to amortize the get_local_id overhead that OpenCL has vs. GLSL.

  5. #15
    Senior Member
    Join Date
    Sep 2002
    Location
    Santa Clara
    Posts
    105

    Re: Profiling Code

    Paul,

    You should try using constant memory for galaxyPositions and galaxyMasses. Declare them as

    constant float4 *galaxyPositions,
    constant float *galaxyMasses.

    This will be similar to uniforms in GLSL. If you do this, you no longer need to do the async_work_group_copy and wait_group_events

    Another interesting thing to try is to use fast_length instead of length.

    Also, do you really need a float4 for positions and velocities or do you really need only 3-components of float4 vector?

  6. #16

    Re: Profiling Code

    Pulling this up from the depths, as I've been working on other things the last couple of weeks, but wanted to reply to these points.

    You should try using constant memory for galaxyPositions and galaxyMasses. Declare them as

    constant float4 *galaxyPositions,
    constant float *galaxyMasses.

    This will be similar to uniforms in GLSL. If you do this, you no longer need to do the async_work_group_copy and wait_group_events
    On SnowLeopard at the moment, declaring those parameters as constant buffers causes the compiler to crash. It seems to be down to using a loop counter to index into them. This has been reported to Apple, and I've had a response back saying it's a known issue.

    Another interesting thing to try is to use fast_length instead of length.
    Just gave that a quick try and it does boost performance a little. Thanks.

    Also, do you really need a float4 for positions and velocities or do you really need only 3-components of float4 vector?
    No, I don't, but using float4's is faster. You have to jump through hoops to load and store 3-component vectors and that costs. My code was originally GLSL vec3s. I moved that to vec4s for a fairer comparison early on though. If OpenCL had a 3-component vector type I'd be using it.

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Profiling of kernel code
    By biren in forum OpenCL
    Replies: 3
    Last Post: 04-22-2013, 03:10 AM
  2. profiling
    By amgastineau in forum OpenCL
    Replies: 1
    Last Post: 09-09-2009, 01:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •