Results 1 to 9 of 9

Thread: Moving data from __global to __private

  1. #1
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Moving data from __global to __private

    Since we cannot use memcpy in OpenCL, i am wondering if there
    is a similar function available that can be used to copy chunks of
    data from __global to __private (or to __local) inside a kernel.

    For example say I wish to copy 10 elements from global memory to
    __private memory (per thread). I do not wish to make a loop like:

    Code :
    for (int i=0; i<n_elements.....
    ...

    How is this generally achieved in OpenCL?

    The purpose is to get a list of data into each thread. I am making a raytracer
    where I need to grab a list of surface data contained within each grid cell
    (or tree node if I use that).

  2. #2
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Moving data from __global to __private

    You have to do the copy for each work-item in the kernel code. There is a provide async_work_group_copy which will copy from global to local using all the work-items in a work-group, but there is no provided function for copying to private memory.

    Your example is about what you need to do. At the start of the kernel you copy in the data you need from global memory. Remember that if you are copying to memory that will be shared across the work-group (local) you need to insert a barrier after the copy to ensure that all work-items have finished before any try to access it.

  3. #3
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Re: Moving data from __global to __private

    I was afraid of that...

    Will the included async copy to local mem be significantly faster than
    loop-copying it to private? It will be difficult to implement because :

    I am making a monte carlo forward raytracing software,
    where each workItem is one independent ray traced through
    a geometry. The geometry is split in a grid right now (may use
    kd-trees later depending on what happens) and each time a ray/photon
    enters a new grid cell it must check if there are surfaces inside this
    cell, or it must intersect one of the bounding planes.

    Copying each element float by float takes a lot of time
    (I assume this is due to global memory access times ).
    I think i was able to reduce the time spent by grabbing
    them as float4s from Image object memory, but I am not certain.

    I assume image memory objects are the same as CUDA textures,
    which this guy here recommends ?
    http://bouliiii.blogspot.com/2008/08/re ... a-100.html

  4. #4
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Moving data from __global to __private

    The speed of the async copy global to local will depend entirely on the implementation (e.g., Cell could use a DMA engine, but I don't think there are DMA engines for this on most GPUs) so I doubt it will be a win over a for-loop for you.

    You should take a look a the vload functions which can do optimized vector loads of data. This will get you the best performance if your data is aligned to a vector size.

    It sounds like there is no way to get around the copying of each float from what you are saying. Copying them into private/local memory is only a win if you have reuse. Note that private memory on most GPUs today is just registers, so there's no real benefit over just keeping them around in your kernel as variables. Copying to local has a win because multiple work-items can share them so you can get more reuse.

    Image memory objects are textures, so on GPUs that have texture caches (e.g., all of them) you will get the benefits of caching which can be far faster than buffer accesses if you have good spacial locality.

    The other big issue is memory access coalescing. I know that on Nvidia GPUs this can make an order-of-magnitude difference in your memory bandwidth. Devices before the GT280 could only coalesce accesses from a work-group that were sequential. (E.g., each work-item accesses the next item.) The GT280 is more flexible so it should do better. However, if each work-item is doing its own random accesses, you will get very little coalescing so you will never be able to get close to the maximum bandwidth. This may just be a problem with mapping the algorithm to the hardware. Pre-loading data into the local memory can help if you can predict what data you are likely to use and it fits.

  5. #5
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Re: Moving data from __global to __private

    Unfortunately I do not see any way to predict what data a work Item will
    require, since it is a ray with random direction but also with a spread in origin.

    There is no way to manually pre-cache data into local memory from what I see,
    because of this randomness...

    So I will try to compare the approach of using vload-functions vs
    loading from textures/image objects to see if I can find a speed
    advantage in either method.

  6. #6
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Re: Moving data from __global to __private

    one more thing, I am using a GTX260 to do most of the work,
    while the final program will run on a system with multiple Tesla 1060 cards
    ( Funny as it turns out that so far my program has run at identical speeds
    when comparing Tesla vs GTX260, even a slight advantage to the GTX260 )

  7. #7
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Re: Moving data from __global to __private

    Results

    I let my kernel run some test code and timed the execution.
    These results are on my GTX260. I don't have the tesla cards
    here at home (theyre at my work place) but I think I ran a similar
    test on them.

    On the GTX260 I had inside my kernel :

    Code :
     
     
    	__private float aX = 0;
    	__private int TR = 100;
     
    	// One by One access
    	for (int i=0; i<TR; i++) {
    		__private float4 R = {	nodes[0],
    								nodes[1],
    								nodes[2],
    								nodes[3]	};
    		aX += R[0] + R[1] + R[2] + R[3];
    	}
     
     
    	// Vector access
    	for (int i=0; i<TR; i++) {
    		__private float4 R = vload4(0, nodes);
    		aX += R[0] + R[1] + R[2] + R[3];
    	}
     
     
    	for (int i=0; i<TR; i++) {
    		__private int2 coord = { 0, 0 };
    		__private float4 R = read_imagef( src_image, samplerA, coord );
    		aX += R[0] + R[1] + R[2] + R[3];
    	}
     
    	// Assign some arbitrary data to test read back 
    	for (int i=0; i < MAX_HITS; i++) {
    		energies[thread_id*MAX_HITS + i] = aX;
    	}

    Reference : The kernel without any of the tests took about 3-5 ms to execute.

    Test 1 : (one by one access) took about 95-100 ms for a global work size of 1.000.000

    Test 2 : (using vload4) took about 120-125 ms for a global work size of 1.000.000

    Test 3 : (image objects) took about 20-25 ms for a global work size of 1.000.000

    So I guess it was a good idea to stick with image objects then

  8. #8
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Moving data from __global to __private

    Image objects are cached, so any spatial locality will get you a big boost in performance. Note that variables are by default private so you don't have to put __private (or just "private") in any of those cases.

    I'd also be a bit careful here. It looks like you are reading the same data everywhere. That will mean that the first image read will load the cache and every single read thereafter will hit. The other ones are not cached so they will have to do the read each time. This appears to cover the case where you have extremely good locality. If your real code is accessing all over the place you will see far less benefit from the image access.

  9. #9
    Junior Member
    Join Date
    Oct 2009
    Posts
    10

    Re: Moving data from __global to __private

    Quote Originally Posted by dbs2
    Image objects are cached, so any spatial locality will get you a big boost in performance. Note that variables are by default private so you don't have to put __private (or just "private") in any of those cases.

    I'd also be a bit careful here. It looks like you are reading the same data everywhere. That will mean that the first image read will load the cache and every single read thereafter will hit. The other ones are not cached so they will have to do the read each time. This appears to cover the case where you have extremely good locality. If your real code is accessing all over the place you will see far less benefit from the image access.
    Thanks. That is good to know.
    The important part is just that it wont be significantly slower than reading it element by element.

    At least the grid which is only about 20x10x20 should be cached to some degree, as
    it is quite sparse. (each entry in the grid holds 2 float4s)

    Right now I just let every work item read the same data to test performance.
    How much data do you think could be cached?

Similar Threads

  1. Allocate __private memory inside a kernel
    By tehwalrus in forum OpenCL
    Replies: 1
    Last Post: 11-23-2011, 03:58 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •