Page 2 of 2 FirstFirst 12
Results 11 to 15 of 15

Thread: How to avoid double allocation on CPU

  1. #11
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: How to avoid double allocation on CPU

    Quote Originally Posted by CNugteren
    First of all, thank you for your very long and thorough responses, it is greatly appreciated.

    ...

    I understand. I'm targeting an Intel CPU only, no GPUs involved at any point. I'm
    Ok. I just find that puzzling There are probably easier ways to parallelise code if you only have a CPU to work on.

    I understand all that, I do have some experience programming CUDA and OpenCL for GPUs. The only thing I'm trying to do now is to omit the memory copies on the CPU.
    Just as an intellectual exercise?

    This is what I understand, please correct me if I'm wrong:
    * If I have a memory object which is created using 'CL_MEM_USE_HOST_PTR', it is meant to be accessed by the accelerator only (read/write).
    * After I've created such an object, I should not access the host version of it, as it contains undefined data.
    Well it depends on who is writing to it and the read/write flags, but in general where both sides are writing, then yes.
    * If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.
    Again, it depends on who is writing it. If you're only mapping it for read then both sides can still read it.
    * If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).
    Well the heap memory wont be unmapped from the process: you will still have access to it. It's just that if you subsequently invoke a kernel, and have written to it in the mean-time, there's no guarantee the kernel will get any of those writes.

    If you're only reading a result or never use it for a kernel, the data will stay around and be valid after you unmap it.
    Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I've made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I'm not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).
    245 void* pointer_to_B = clEnqueueMapBuffer(bones_queue,device_B,CL_TRUE,CL _MAP_READ,(N * 1)*sizeof(int),0,0,NULL,NULL,&bones_errors); error_check(bones_errors);

    This looks wrong, you're passing the size as the offset, and mapping 0 bytes. i.e. pointer_to_B should end up being &B[N*4], not &B[0]

    (another example of why actual source is much better than fragments/discussions).
    A small question to end with: why do I want to 'unmap' the object? After my OpenCL kernel has ran I will do a lot of computations on the resulting data outside of OpenCL, no kernels anymore. Ideally I just want to 'map' the object directly after the kernel has ended to give it back to the CPU and never 'unmap' it. That doesn't seem to be possible, as the 'map' requires you to specify either read or write.

    Thanks again for the help!
    Because it's part of the api? You've effectively allocated a resource and it's just a resource management issue.

    But anyway unmapping will work if you are either only reading it, or never using that same buffer ever again in a kernel. If you need to do some processing and subsequently invoke another kernel on it, you will either need to keep the map around during the whole host-side update, or alternatively release the buffer and create a new one when you need it again.

  2. #12
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: How to avoid double allocation on CPU

    BTW I know you're only investigating but you really don't need all those clFinish() calls.

    If you do a blocking map (or any blocking operation), it will always ensure that particular operation is completely finished implicitly (and by extension, all previous operations on a single, serial queue).

    clFinish() is handy for timing and debugging, but again it's only needed when you don't end the sequence on a blocking call.

  3. #13
    Junior Member
    Join Date
    Mar 2012
    Posts
    6

    Re: How to avoid double allocation on CPU

    Quote Originally Posted by notzed
    Quote Originally Posted by CNugteren
    First of all, thank you for your very long and thorough responses, it is greatly appreciated.

    ...

    I understand. I'm targeting an Intel CPU only, no GPUs involved at any point. I'm
    Ok. I just find that puzzling There are probably easier ways to parallelise code if you only have a CPU to work on.
    True, true. But you only figure that out after you try, right? I guess since I am quite familiar with CUDA and have a little bit of OpenCL GPU experience, I must be able to write some OpenCL CPU code. The good thing about the Intel OpenCL compiler is that it does vectorization for you as well as uses multiple CPU threads.

    Quote Originally Posted by notzed

    I understand all that, I do have some experience programming CUDA and OpenCL for GPUs. The only thing I'm trying to do now is to omit the memory copies on the CPU.
    Just as an intellectual exercise?

    [quote:whek347w]This is what I understand, please correct me if I'm wrong:
    * If I have a memory object which is created using 'CL_MEM_USE_HOST_PTR', it is meant to be accessed by the accelerator only (read/write).
    * After I've created such an object, I should not access the host version of it, as it contains undefined data.
    Well it depends on who is writing to it and the read/write flags, but in general where both sides are writing, then yes.
    * If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.
    Again, it depends on who is writing it. If you're only mapping it for read then both sides can still read it.
    * If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).
    Well the heap memory wont be unmapped from the process: you will still have access to it. It's just that if you subsequently invoke a kernel, and have written to it in the mean-time, there's no guarantee the kernel will get any of those writes.

    If you're only reading a result or never use it for a kernel, the data will stay around and be valid after you unmap it.
    Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I've made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I'm not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).
    245 void* pointer_to_B = clEnqueueMapBuffer(bones_queue,device_B,CL_TRUE,CL _MAP_READ,(N * 1)*sizeof(int),0,0,NULL,NULL,&bones_errors); error_check(bones_errors);

    This looks wrong, you're passing the size as the offset, and mapping 0 bytes. i.e. pointer_to_B should end up being &B[N*4], not &B[0]

    (another example of why actual source is much better than fragments/discussions).[/quote:whek347w]

    I guess I understand the map/unmap procedure now. And the bug indeed appeared to be a small mistake, swapping the offset and the size. I've fixed that and the code behaves correctly again!

    Quote Originally Posted by notzed
    A small question to end with: why do I want to 'unmap' the object? After my OpenCL kernel has ran I will do a lot of computations on the resulting data outside of OpenCL, no kernels anymore. Ideally I just want to 'map' the object directly after the kernel has ended to give it back to the CPU and never 'unmap' it. That doesn't seem to be possible, as the 'map' requires you to specify either read or write.

    Thanks again for the help!
    Because it's part of the api? You've effectively allocated a resource and it's just a resource management issue.

    But anyway unmapping will work if you are either only reading it, or never using that same buffer ever again in a kernel. If you need to do some processing and subsequently invoke another kernel on it, you will either need to keep the map around during the whole host-side update, or alternatively release the buffer and create a new one when you need it again.
    Quote Originally Posted by notzed
    BTW I know you're only investigating but you really don't need all those clFinish() calls.

    If you do a blocking map (or any blocking operation), it will always ensure that particular operation is completely finished implicitly (and by extension, all previous operations on a single, serial queue).

    clFinish() is handy for timing and debugging, but again it's only needed when you don't end the sequence on a blocking call.
    I've removed some of the clFinish calls. I just put them there to be sure that was not causing the problem

    Anyway, many thanks for your help and explanations! It works as expected now! The only thing left is trying to make it actually perform zero-copy, because execution time still hints at the fact that a copy is made

  4. #14
    Junior Member
    Join Date
    Mar 2012
    Posts
    6

    Re: How to avoid double allocation on CPU

    For those that are interested, the malloc/free implementations that I use to allocate at a 128-byte boundary are as follows:

    Code :
    // Allocate a 128-byte aligned pointer
    void *malloc128(size_t size) {
      char *pointer;
      char *pointer2;
      char *aligned_pointer;
     
      // Allocate the memory plus a little bit extra
      pointer = (char *)malloc(size + 128 + sizeof(int));
      if(pointer==NULL) { return(NULL); }
     
      // Create the aligned pointer
      pointer2 = pointer + sizeof(int);
      aligned_pointer = pointer2 + (128 - ((size_t)pointer2 & 127));
     
      // Set the padding size
      pointer2 = aligned_pointer - sizeof(int);
      *((int *)pointer2) = (int)(aligned_pointer - pointer);
     
      // Return the 128-byte aligned pointer
      return (aligned_pointer);
    }

    Code :
    // Free the 128-byte aligned pointer
    void free128(void *pointer) {
      int *pointer2=(int *)pointer - 1;
      pointer -= *pointer2;
      free(pointer);
    }

    These malloc's, in combination with 'CL_MEM_USE_HOST_PTR', 'clEnqueueMapBuffer' and 'clEnqueueUnmapMemObject' give me a zero-copy on an Intel CPU using Intel's OpenCL SDK.

  5. #15
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: How to avoid double allocation on CPU

    What about posix_memalign()?

Page 2 of 2 FirstFirst 12

Similar Threads

  1. GPU/CPU Thread allocation
    By danginsburg in forum OpenCL
    Replies: 4
    Last Post: 12-14-2009, 03:33 PM
  2. cl.h please avoid bitshift definitions
    By dbuenzli in forum OpenCL
    Replies: 1
    Last Post: 06-15-2009, 05:14 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •