PDA

View Full Version : Regarding async_work_group_copy(global to local)



bharath.s.jois
03-05-2011, 03:44 PM
Hi folks,

I have a kernel where a particular element (of a data structure) from the global memory.
Other words, all the threads executing the kernel use the data at the same address in the global memory.

I am trying to use the async_work_group_copy to get the data to the shared memory, first thing in the kernel. Also, as per the OpenCL specification, async_work_group_copy is executed by all threads.

Is the following possible?

One thread executes the async copy function and gets the data to the shared memory and the rest of the threads of the work group use the data brought in by one of the thread.

Or is it better to allow the cache to handle the data accesses in this case?

-- Bharath

david.garcia
03-05-2011, 05:35 PM
All work-items from the same work-group share the same local memory. async_work_group_copy() is a function that loads data from global memory into local memory and it is executed by all work-items in a work-group. In other words, all work-items in the work-group must call async_work_group_copy() with the same arguments.

After async_work_group_copy() has finished performing the memory transfer, all work-items in the work-group can read from local memory to access the data that was transferred.

bharath.s.jois
03-05-2011, 06:02 PM
I am not sure if my understanding of the local memory and async copy is correct. If I may ask a few questions...

I would like to know why does the requirement of "same arguments" come in.

Eg: The kernel has the following lines...

__local char temp;
async_work_group_copy((__local char *)&temp, (__global char *)globalvar, (event_t)0);

Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration "__local char temp"? Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?

-- Bharath

david.garcia
03-05-2011, 07:18 PM
I would like to know why does the requirement of "same arguments" come in.

Short answer: because the OpenCL specification requires it.

Long answer: because all work-items in the work-group will perform the copy together. It's not a single thread doing the work. All threads collaborate.


Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration "__local char temp"?

Only one variable (one byte).


Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?

Yes, it will be exactly the same.

bharath.s.jois
03-05-2011, 08:58 PM
I get the point. But when several threads try to access the Global memory, wouldn't there be clashes leading to further increase in the completion of copy?

Also,

Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?

david.garcia
03-05-2011, 09:20 PM
But when several threads try to access the Global memory, wouldn't there be clashes leading to further increase in the completion of copy?

The should be no issue.


Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?

Are you asking whether the copy is performed by a single warp? That doesn't have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.

bharath.s.jois
03-06-2011, 07:59 AM
Are you asking whether the copy is performed by a single warp? That doesn't have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.

You got my question right, but I don't think I understand the explanation. If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?

I was hoping for something close to prefetch (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/prefetch.html) but to the shared memory.

Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?

-- Bharath

david.garcia
03-06-2011, 01:08 PM
If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?

Because each warp will only do part of the copy. Again, this will be done differently in different hardware.


Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?

I suggest referring to your hardware vendor's documentation. Some hardware doesn't even have a global memory cache.

bharath.s.jois
03-06-2011, 01:49 PM
Because each warp will only do part of the copy. Again, this will be done differently in different hardware.

So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.

-- Bharath

david.garcia
03-06-2011, 05:40 PM
So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.

If you are only copying something like a single int, then it's not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle... assuming that your hardware doesn't use a DMA engine for global->local copies.

bharath.s.jois
03-07-2011, 10:16 AM
If you are only copying something like a single int, then it's not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle... assuming that your hardware doesn't use a DMA engine for global->local copies.

How about this case?

- Each thread needs lets say 1000 elements to complete its work
- Number of threads in 1 work group = 1024

Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

-- Bharath

david.garcia
03-07-2011, 10:40 AM
How about this case?

- Each thread needs lets say 1000 elements to complete its work
- Number of threads in 1 work group = 1024

Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

In most implementations, that's not true. If you have to copy 1000 elements and your work-group size is 1024, the first 1000 work items will copy one element each and the last 24 work-items will not do any work.

Again, this is somewhat hardware-dependent.


Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don't know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.

bharath.s.jois
03-07-2011, 11:56 AM
Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don't know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.

I quite get the point regarding how the contents are brought form the global to local by separate threads. But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.

But yes, I also understand that OpenCL was not tailored for my application. :)

-- Bharath

david.garcia
03-07-2011, 12:13 PM
But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.

I honestly don't understand where is the problem. When you put some data in local memory it's because you want all work-items in the work-group to access all that data. In that case the cost of copying the data from global to local memory is usually negligible compared to the alternative of fetching global memory over and over. If each work-item is only going to access a small piece, then local memory is not needed.

Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.

bharath.s.jois
03-08-2011, 02:07 AM
Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.

Actually, I am solving a knapsack problem ("http://en.wikipedia.org/wiki/Knapsack_problem).

We'd have N items having value V(0).. V(N-1) and weights W(0)..W(N-1) and a bag of capacity C. I am currently using dynamic programming technique and the kernel would look like


For i=0:N-1
For j=1:C
//some code - trivial arithmetic using V[i] and W[i]
endFor
endFor

What goes into the OpenCL kernel is "//some code", and I launch C threads at a time and the kernel is enqueued N times (corresponding to the outer loop).

During the ith call to the kernel, the code uses the ith element of the V array and the W array.
I am currently getting some speedup using OpenCL(global memory) for good values of N and C, but I am wondering if I could use the shared memory to improve the performance significantly.

-- Bharath

david.garcia
03-08-2011, 06:25 AM
Ah, I see. Interesting :)

If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with "in a register".



__kernel void knapsack(..., float v_i, float w_i)
{
// ...
}


That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits (http://www.google.ca/search?sourceid=chrome&ie=UTF-8&q=parallel+knapsack+problem).

bharath.s.jois
03-09-2011, 05:21 AM
Bit late on this. Held up debugging one similar implementation.


If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with "in a register".

This helped a bit. :) But I will have to move back to the shared memory usage when the number of elements required by one thread is "many".

I did manage to understand and make use of the async copy in another similar context, although I did not quite get the speed up initially expected. Realized there was another bottleneck.


That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits.

Thanks for this suggestion. It will take a while before I digest these. :D

-- Bharath

bharath.s.jois
03-14-2011, 10:32 AM
Well, I am back with a few more questions.

Previously, all the threads in a workgroup used a single value from the val[i] and wgt[i].

Currently, I am working on a variant of the Knapsack problem, called the multiple choice knapsack problem. For this, each of the thread would need the access to the complete array val and wgt. I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory. Something like...



__kernel... (__global value_t *val...)
{
__local value_t localvals[NUM_ITEMS];
//fetch using async_work_group_copy(global->local)
loop: 1 to number_of_values_fetched
// Work using the values fetched
// Use localvals instead of val
end loop
}


I see a good decrease in the number of global load requests (OK) but the amount of GPU time increases when compared to the global memory implementation (val and wgt are in global memory).

IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases. Am I right in thinking so?

I can also see the output from the profiler (from Nvidia), but cannot make good use of it as to where I am losing the time I gained due to the shared memory accesses.

I guess I am being a bit vague, but any suggestions what numbers I could look into to understand what is happening?

-- Bharath

bharath.s.jois
03-14-2011, 10:56 AM
To add to what I have said, I see that the branches and number of divergent branches has increased in the shared memory implementation. Do async_work_group_copy or wait_group_events contribute to the branches, in any way? The rest of the kernel, the conditions branches remain the same for both shared and global memory implementations.

-- Bharath

david.garcia
03-14-2011, 03:56 PM
I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory.

If val and wgt fit in local memory they almost certainly fit in constant memory as well. Have you tried that? The only difference for you is that instead of declaring them as __global you declare them as __constant.


IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases.

What is number_of_values_fetched? Is it the same as NUM_ITEMS? If data is read only once from global memory then there will be no benefit in using local memory.

Where local memory is a win is where the kernel would fetch from the same global memory over and over.


Do async_work_group_copy or wait_group_events contribute to the branches, in any way?

Sure they can. It's implementation-dependent.

bharath.s.jois
03-14-2011, 05:00 PM
If val and wgt fit in local memory they almost certainly fit in constant memory as well. Have you tried that? The only difference for you is that instead of declaring them as __global you declare them as __constant.

Although its a good idea to put the val and wgt into the __const address space, I am not sure if it helps with performance. I have best come across an Nvidia article (or post) which says that the constant cache is faster than the global cache. Or are you trying to say something else here? :? And I tried this piece and swept across the range of variables with no good result.


What is number_of_values_fetched? Is it the same as NUM_ITEMS? If data is read only once from global memory then there will be no benefit in using local memory.


The number_of_values_fetched actually might varies in each kernel invocation, between... say 10 and 2000. I will try to give a better picture of the kernel here.




//Shared memory

__kernel void mckp(__global value_t *val,
__global weight_t *wgt,
__global value_t *soln,
__global int *keep,
const int i,
const unsigned int capacity,
const unsigned int num_choices,
const unsigned int offset)
{
__local value_t locval[SB_SIZE];
__local weight_t locwgt[SB_SIZE];

event_t copydone;

copydone = async_work_group_copy((__local value_t *)locval,
(__global value_t *)(val + offset),
num_choices, (event_t)0);

copydone = async_work_group_copy((__local weight_t *)locwgt,
(__global weight_t *)(wgt+offset),
num_choices, (event_t)copydone);

// Some irrelevant code

wait_group_events(1, &copydone);

for(k = 0; k < num_choices; ++k) {
//Code which accesses locvals[k] once and locwgts[k] once
//More irrelevant code
}
}




//Without shared memory
__kernel void mckp(__constant value_t *val,
__constant weight_t *wgt,
__global value_t *soln,
__global int *keep,
const int i,
const unsigned int capacity,
const unsigned int num_choices,
const unsigned int offset)
{

// Some code

for(k = 0; k < num_choices; ++k) {

// Code accessing val[offset+k] once and wgt[offset+k] once
// More irrelevant code

}
}





Where local memory is a win is where the kernel would fetch from the same global memory over and over.

Well, although a single thread uses a particular value only once, a bigger perspective, considering the work-group leads me to think...

Without shared memory:
Total number of load requests = 2 * num_choices * work_group_size (Global), with at least 2 * num_choices cache misses.

With shared memory:
Total number of load requests = 2 * num_choices (Global) +
2 * work_group_size * num_choices (Local / Shared)


But I might be wrong with the above *thoughts*.

I am not sure if this is the right place for the question, but is there any way to check how the GPU time inside the kernel was spent? I am not too sure if the Nvidia profiler helps me there.

-- Bharath

david.garcia
03-14-2011, 05:08 PM
OK, I get it. Yes, the greater is num_choices the more benefit you will see from local memory. Your analysis on memory accesses looks right. It's possible that for smallish values of num_choices local memory will not be worth it.


but is there any way to check how the GPU time inside the kernel was spent?

That's a great question. Unfortunately there's no standard way to do this. You will have to use proprietary profilers. I'm quite sure that NVidia supports this feature -- try asking in their forums :)

bharath.s.jois
03-20-2011, 12:21 PM
That's a great question. Unfortunately there's no standard way to do this. You will have to use proprietary profilers. I'm quite sure that NVidia supports this feature -- try asking in their forums :)

Although I didn't get a response from the Nvidia forums, I figured out that there is a way to get some information while inside the kernel using a profiler trigger, but I don't think I will be able to use it. 1. The kernel compiler cannot recognise the function. 2. I am not sure if the trigger works with Compute Capability 2.0 since the documentation does not specify this.

But apart from these, I have tried out a few things which gave me results which are tough to understand, but are good (not completely :wink: )

I tried my hand with CUDA, global and shared memory implementations. The shared memory was used like...



// somewhere inside the kernel

get_local_thread_ID()
if local_thread_ID < num_choices
fetch (local_thread_ID)th data from the global memory

// use the shared memory


I could clearly see that the shared memory gives a good speed-up even with not-so-large values of num_choices.

With this result, I tried doing the same with OpenCL, i.e. without using async_work_group_copy and wait_group_events. Surprisingly, it was at least 20% faster than the async_work_group_copy version, which was as fast as or slower than the global memory version. And it was functionally correct too.

I remembered I forgot the synchronise the threads after collecting the data from the global memory to shared memory (both in OpenCL and CUDA) and added a barrier (barrier(CLK_LOCAL_MEM_FENCE) for OpenCL and __syncthreads() for CUDA) before using the shared memory to see that the timings went up, beyond the global memory implementation in the OpenCL implementation while the impact of barrier was not too much in case of CUDA. To sum up...

Fastest to slowest:

CPU: 158629.718750
GPU-CUDA-GLOBAL: 919.958984
GPU-CUDA-SHARED-WITH-BARRIER: 790.421021
GPU-CUDA-SHARED-WITHOUT-BARRIER: 779.372986
GPU-OPENCL-SHARED-WITH-BARRIER: 701.237000
GPU-OPENCL-SHARED-WITH-ASYNC-WITH-GROUP-WAIT: 700.955017
GPU-OPENCL-SHARED-WITH-ASYNC-WITHOUT-GROUP-WAIT: 698.851990
GPU-OPENCL-GLOBAL: 680.328003
GPU-OPENCL-SHARED-WITHOUT-BARRIER: 607.349976

For the runs, the number of items to be fetched varied from 200 to 300 and the number of threads in a work-group (thread-block for CUDA) was 512

I cannot in particular make a remark about the implementation since I don't have any other reference, but in general...

- async_work_group_copy seems to be slower than program guided fetching
- Barriers are damn costly!! Is this the case?

Any comments on these?

-- Bharath

david.garcia
03-20-2011, 12:50 PM
- Barriers are damn costly!! Is this the case?

Yes, they are rather costly. They are also necessary.


async_work_group_copy seems to be slower than program guided fetching

That's not very surprising either. async_work_group_copy has to support odd ratios between work-items and the amount of memory to copy so there's some overhead to support all cases. In your kernel each work-item copies at most a single item from global memory, right?

bharath.s.jois
03-20-2011, 01:08 PM
Yes, they are rather costly. They are also necessary.

So, that would mean we need to have enough work with the shared data to overcome the prefetch+sync time. But I see that the CUDA implementation does better here. But I guess it depends on the implementations.


That's not very surprising either. async_work_group_copy has to support odd ratios between work-items and the amount of memory to copy so there's some overhead to support all cases. In your kernel each work-item copies at most a single item from global memory, right?

Yes, that is right, at most one. Ah, and I understand that async_work_group_copy will have to handle many cases, including cases where amount-of-data > number-of-threads.

-- Bharath

bharath.s.jois
03-23-2011, 05:08 AM
Regarding barriers, does the cost of using a barrier depend on the workgroup size? Like, lesser number of threads per workgroup, lesser is the time?

-- Bharath

david.garcia
03-23-2011, 03:18 PM
Like, lesser number of threads per workgroup, lesser is the time?

Yeah, I think that's a fair assumption. However, overall small work-groups are typically less efficient than large work-groups. At least on GPUs.

UMDdev
05-11-2011, 09:33 AM
Bharat, thanks for asking these questions. I'm a complete newbie, so this was a great read.

steavy
05-20-2011, 12:16 AM
that thread was useful for me too. thanks, guys