Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Workgroups and global IDs

  1. #1
    Junior Member
    Join Date
    Jul 2011
    Posts
    6

    Workgroups and global IDs

    I have a bit of a beginners question on how to set the number of work items correctly for my program.

    My program does some calculations on a grid, and my parallel calculation runs column wise. I need all the calculations on a column to finish before I start the next, so I want a single work group with a barrier, and every work item then processes a single row, with a barrier after it finishes working on each column. I cannot use more than one work group as each cell depends on all values in the previously calculated columns in the grid.

    I was then planning on using the global_id (get_global_id(0)) to basically set the row number for each work item, but if my program has a very large number of rows (e.g. 250k ,or 2.5m) I'm not sure that I can do this.

    Can I use the global_id in this way, and if so how would I call clEnqueueNDRangeKernel?
    Does my program structure sound sensible?

    Many thanks

  2. #2
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    Re: Workgroups and global IDs

    Do you understand that the performance with single workgroup willl be very poor? The kernel will utilize just one compute unit. While AMD 6950 has 22 CUs, GF114 has 8 CUs. Even Intel Core i2500 has 4 cores.
    Blog (in russian)

  3. #3
    Junior Member
    Join Date
    Jul 2011
    Posts
    6

    Re: Workgroups and global IDs

    I do understand thanks, but as I need the synchronisation I don't believe I can split my work to use more than 1 - it's a dynamic programming problem, and as such each row and column depend on all the previous rows and columns.

    Cheers

  4. #4
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    Re: Workgroups and global IDs

    Quote Originally Posted by stephennt
    I do understand thanks, but as I need the synchronisation I don't believe I can split my work to use more than 1 - it's a dynamic programming problem, and as such each row and column depend on all the previous rows and columns.

    Cheers
    What about organizing work in the following way:
    1) The kernel generates one value for the single column using values from previous columns generated by previous kernel runs.
    2) Global work size (number of work items) is the row count. 250k is much more than enough to make device busy.
    3) The amount of kernel runs (enqeueNDRange) is equal to the number of columns.

    But wait... Do you mean that the value for the specific cell is dependent not only on the values in the previous columns but on some values in the same column?
    Blog (in russian)

  5. #5
    Junior Member
    Join Date
    Jul 2011
    Posts
    6

    Re: Workgroups and global IDs

    The value is only dependent upon previous columns

    This sounds like an interesting suggestion - enqueue guarantees that the commands are executed in order, so it should work.

  6. #6
    Junior Member
    Join Date
    Jul 2011
    Posts
    6

    Re: Workgroups and global IDs

    I knocked up a quick version using the method you suggested. The results are quite interesting as if I run this OpenCL code on CPU (max workgroup size = 1) it's about 25% faster than a software equivalent; if I run it on my GPU (max workgroup size = 320) it's about 50% slower.

    My approach with barriers on the other hand runs about 80% faster on CPU (OpenCL) and 15% faster on GPU, relative to my pure software benchmark.

    Strange results, as in theory the GPU should blitz this, but I guess the memory xfers are too expensive.

  7. #7
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Workgroups and global IDs

    I don't understand how the workload looks like. You've described it as a grid. That's all I get. What are the columns? What are the rows? Can you execute all items in a column in parallel? And once that is finished you can execute the next column?

    What is the typical height and width of that grid?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #8
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    Re: Workgroups and global IDs

    Quote Originally Posted by stephennt
    I knocked up a quick version using the method you suggested. The results are quite interesting as if I run this OpenCL code on CPU (max workgroup size = 1) it's about 25% faster than a software equivalent; if I run it on my GPU (max workgroup size = 320) it's about 50% slower.
    What is the global work size?
    Blog (in russian)

  9. #9
    Junior Member
    Join Date
    Jul 2011
    Posts
    6

    Re: Workgroups and global IDs

    @david:
    On my test I was running on a grid about 1000 rows * 10k columns.
    Every item in a column can be executed in parallel. So i have two approaches at the moment:
    1) single work group, every work item is assigned to a row (so the entire work group works on a column). I then use a barrier to block before the work item moves onto the next column.
    I call enqueue numCols/workGroupSize times, and the global and local parameters in enqueue are just set to workGroupSize (320 on my GPU for this kernel).

    2) suggestion as per Maxim, where the kernel works on a single column. I use multiple work groups to maximise the number of work items.
    I call enqueue numCols times, and the command queuing thus ensures that every column is processed in sequence.
    Perhaps if I rotated the grid such that there were more rows than columns this approach would be more performant.


    @Maxim
    as per your suggestion I set the global number to the number of rows in the grid (and NULL for the next param to let the implementation decide upon the best way to allocate work between groups).

  10. #10
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    Re: Workgroups and global IDs

    Quote Originally Posted by stephennt
    @david:
    On my test I was running on a grid about 1000 rows * 10k columns.
    ...
    @Maxim
    as per your suggestion I set the global number to the number of rows in the grid (and NULL for the next param to let the implementation decide upon the best way to allocate work between groups).
    1) 1000 work items most probably is not enough to fully load high-end GPUs.
    2) Make global work-size to be multiple of 128. So if it is 10,000, then change it to 10,112. In fact, you can play here with that number, try changing it to 64 or 256. The reason is that some implementations are VERY bad in defining local worksize.
    Blog (in russian)

Page 1 of 2 12 LastLast

Similar Threads

  1. OpenCL Synchronization between workgroups.
    By abhisrn in forum OpenCL
    Replies: 3
    Last Post: 10-21-2012, 04:14 PM
  2. using workgroups
    By dlw in forum OpenCL
    Replies: 1
    Last Post: 07-06-2011, 03:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •