Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: Sharing host memory with clSetKernelArg!

  1. #1
    Junior Member
    Join Date
    May 2011
    Posts
    24

    Sharing host memory with clSetKernelArg!

    Hi!

    For CPU and AMD Fusion devices which share the same (host) memory, there is no point in relying on the clCreateBuffer to copy data to the device and back. In such cases it would make sense that clSetKernelArg should be allowed to accept (properly alligned) pointer to host memory directly. clSetKernelArg also has a very small overhead in compare to the "buffer" functions, which were designed especially with the "split" memory scenarios in mind.

    Thanks!
    Atmapuri

  2. #2

    Re: Sharing host memory with clSetKernelArg!

    Your phrase "share the same (host) memory" is also known as host-unified memory. However, you still need to use clCreateBuffer. You seem to assume that an implementation must always use "split" memory scenarios, but that is not the case when using a host-unified memory device. Naturally it all depends upon the implementation and the device, but the ones I'm familiar with do not make any extra copies or splits in this case. Then when clSetKernelArg is called it just uses the host memory referenced by the Buffer. As a result there is no need for clSetKernelArg "to accept (properly aligned) pointer to host memory directly".

  3. #3
    Junior Member
    Join Date
    May 2011
    Posts
    24

    Re: Sharing host memory with clSetKernelArg!

    Quote Originally Posted by bwatt
    Naturally it all depends upon the implementation and the device, but the ones I'm familiar with do not make any extra copies or splits in this case. Then when clSetKernelArg is called it just uses the host memory referenced by the Buffer. As a result there is no need for clSetKernelArg "to accept (properly aligned) pointer to host memory directly".
    I measured overhead between 50 and 2000us and you say it is not there?
    Going over clCreateBuffer or clEnqueueRead/Write even for CPU devices adds an overhead considerably (1000x) above the optimum (pointer copy).

    Thanks!
    Atmapuri

  4. #4

    Re: Sharing host memory with clSetKernelArg!

    What parameters are you passing into clCreateBuffer? Can you cut & paste the line of code.

  5. #5
    Junior Member
    Join Date
    May 2011
    Posts
    24

    Re: Sharing host memory with clSetKernelArg!

    I cant post complete code as it is scattered across lots of other code. I call clCreateBuffer with:

    CL_MEM_READ_WRITE

    clEnqueueMapBuffer has CL_TRUE for blocking and CL_MAP_READ when reading and CL_MAP_WRITE when writing.

    I tried using additionaly:

    CL_MEM_ALLOC_HOST_PTR

    (clCreateBuffer) but ATI GPU device appears to constantly mirror (copy) all the changes from Host to GPU and back in the background when this flag is specified (thus slowing down the computation). When this flag is specified, the time to copy data from the GPU using Map/Unmap is same as for the CPU device (1.5ms for 4MBytes of data). When this flag is not specified, the time to copy data is 17ms. (makes sense).

    To copy 4MBytes of memory in C++takes 200us on my machine and with 1.5ms overhead for CPU device that is not "zero cost".

    Is there some special reason why I couldn't use CL_MEM_USE_HOST_PTR with clCreateBuffer and then do a clFinish, before copying memory with C++ code (When device is CPU)? (directly referencing the Host pointer passed to clCreateBuffer)

    I create buffer in the context shared by both CPU and GPU, but kernels are currently en-queued always to one device during the lifetime of the buffer.

    Thanks!
    Atmapuri

  6. #6

    Re: Sharing host memory with clSetKernelArg!

    No need to "post complete code", just the one line of the clCreateBuffer was all I wanted to see, however, I think you have explained much more about your application and your use of OpenCL - thanks. I now understand that you are trying to use both the CPU and GPU which operate on one or more buffers that are shared between them.

    Background: The only valid choices of the clCreateBuffer "HOST_PTR" flags are the following combinations (ignoring the "READ/WRITE" flags which are orthogonal):
    • none[/*:m:2ir5vteg]
    • CL_MEM_COPY_HOST_PTR[/*:m:2ir5vteg]
    • CL_MEM_ALLOC_HOST_PTR[/*:m:2ir5vteg]
    • CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR[/*:m:2ir5vteg]
    • CL_MEM_USE_HOST_PTR[/*:m:2ir5vteg]


    The use of each combination depends upon a number of factors or strategies. For example here is one (and there are more, if you are interested please ask and I'll write more).

    • Your host application has already allocated and computed some data outside of OpenCL, for example, 1M floats, that is,
      Code :
      float array[1024*1024]
      or
      Code :
      float *array = (float*)malloc(1024*1024)
      I think this might be what you are doing.[/*:m:2ir5vteg]
    • Now you wish to access it from you OpenCL kernel. So you should issue a clCreateBuffer w/ CL_MEM_USE_HOST_PTR. This call takes a pointer to your existing application data, array. For example,
      Code :
      cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024, array, &error)
      This is preferred when the application already has allocated the data. Any other flag choice causes an allocation. Once created you should assume that the host application NO longer has access to the data, that is, only the OpenCL devices can access the data until you release the buffer.[/*:m:2ir5vteg]
    • For the CPU device the runtime uses the data directly in the array and invokes the kernel passing a pointer to this data. There should be no need to move the data or make copies of the data. This is what I think you're trying to accomplish. For example,
      Code :
      error = clSetKernelArg(kernel, 0, sizeof(buffer0, buffer)
      and
      Code :
      error = clEnqueueTask(cpucommandqueue, kernel, 0, NULL, NULL)
      [/*:m:2ir5vteg]
    • For the GPU however, the runtime MUST transfer the array from the host memory to the device memory and invoke the kernel using the data that is now in the device memory. Naturally this transfer takes time depending upon how much data there is. For example,
      Code :
      error = clEnqueueTask(gpucommandqueue, kernel, 0, NULL, NULL)
      [/*:m:2ir5vteg]
    • If another GPU device kernel is enqueued for this data, then the runtime knows that the data is already on the device so no data transfer should happen. For example,
      Code :
      error = clEnqueueTask(gpucommandqueue, kernel2, 0, NULL, NULL)
      [/*:m:2ir5vteg]
    • After the GPU device kernel completes execution then the runtime can transfer the data to host memory when requested by either the host application or the CPU device. [/*:m:2ir5vteg]
    • If the CPU device kernel needs this data, then the runtime must transfer the data from device memory to host memory, incur the transfer time and invoke the kernel passing a pointer to the data. For example,
      Code :
      error = clEnqueueTask(cpucommandqueue, kernel2, 0, NULL, NULL)
      [/*:m:2ir5vteg]
    • If the application needs this data, then the runtime may or may not transfer the data depending if it is or is not in host memory. However, in general, for buffers created with CL_MEM_USE_HOST_PTR it is best to use clEnqueueMapBuffer, because if the data is already in host memory, then there is no need to transfer the data back. For example,
      Code :
      void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error)
      access the data at mapaddr, then
      Code :
      error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL)
      [/*:m:2ir5vteg]
    • If you are done using OpenCL then release the buffer to regain access to the data. For example,
      Code :
      error = clReleaseMemObject(buffer)
      [/*:m:2ir5vteg]

    Note: I have not compiled any of this code, so there might be some typos in it.

  7. #7
    Junior Member
    Join Date
    May 2011
    Posts
    24

    Re: Sharing host memory with clSetKernelArg!

    I appreciate the detailed and well put answer. Here are some timings that I performed for the CPU device:
    Code :
    cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
    error = clReleaseMemObject(buffer);
    AMD driver: 25us
    Intel driver: 40us

    Time of copying the array in C++ (not in cache/ cached)

    Cold: 750us
    Warm: 180us

    Timing with mapping:
    Code :
    cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
    void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error);
    error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL);
    error = clReleaseMemObject(buffer);
    AMD driver: 37us
    Intel driver (release version): ~70us

    What are the possible side-effects, if I use the array pointer directly without calling the map/unmap pair to obtain back the already known value? Instead I only make sure that que has clFinished.

    Even though 37us does not seem much for AMD, it is still 37x more than clSetKernelArg (not to mention the utter simplicity of one function call against 4 which must be properly configured out of many options). In terms of computational power, 37us is enough to compute 4x 1024 point FFT. (on one core). 12us is what the map/unmap alone cost and that is still enough for 1x 1024 point FFT.

    It may be that for the world of GPU this numbers are "small", but the CPU device is a different ballgame.

    Thanks!
    Atmapuri

    P.S.
    Intel driver requires 1024byte array alignment in order not to copy the memory.

  8. #8

    Re: Sharing host memory with clSetKernelArg!

    On the hgpu.org machine whose platform parameters are:
    • OS: OpenSUSE 11.4[/*:m:1qgy69nu]
    • SDK: AMD Accelerated Parallel Processing (APP) SDK 2.4[/*:m:1qgy69nu]
    • GPU device 0: ATI Radeon HD 5870 2GB, 850MHz[/*:m:1qgy69nu]
    • GPU device 1: ATI Radeon HD 6970 2GB, 880MHz[/*:m:1qgy69nu]
    • CPU: AMD Phenom II X6 @ 2.8GHz 1055T[/*:m:1qgy69nu]
    • RAM: 12GB[/*:m:1qgy69nu]
    • HDD: 2TB, Raid-0[/*:m:1qgy69nu]


    With the following program
    Code :
    #define _BSD_SOURCE
    #include <sys/time.h>
    #include <CL/cl.h>
    #include <stdio.h>
    #include <malloc.h>
    #include <stdlib.h>
     
    #define CHECK(function_call) \
    do { \
      /* printf("CHECK(" #function_call ") in: %s, line: %d\n", __FILE__, __LINE__); */ \
      int _rc = (function_call); \
      if (_rc != 0) { \
    	printf("ERROR:  function rc = %d\n", _rc); \
    	fflush(stdout); \
    	exit(_rc); \
      } \
    } while(0);
     
    #define CHECK_ERR(function_call, _rc) \
    do { \
      /* printf("CHECK(" #function_call ") in: %s, line: %d\n", __FILE__, __LINE__); */ \
      (function_call); \
      if (_rc != 0) { \
    	printf("ERROR:  function rc = %d\n", _rc); \
    	fflush(stdout); \
    	exit(_rc); \
      } \
    } while(0);
     
    int main(int argc, char **argv) {
     
    	int err;
    	struct timeval start, end, diff;
    	cl_uint num_platforms;
    	cl_platform_id *platforms;
    	cl_uint num_devices;
    	cl_device_id *devices;
    	cl_context context;
    	cl_mem buffer;
     
    	// Get platforms
    	CHECK(clGetPlatformIDs(0, NULL, &num_platforms));
    	platforms = (cl_platform_id *) malloc(
    			num_platforms * sizeof(cl_platform_id));
    	CHECK(clGetPlatformIDs(num_platforms, platforms, NULL));
     
    	// Loop through all platforms
    	unsigned int p;
    	for (p = 0; p < num_platforms; p++) {
     
    		// Output platform name
    		size_t platform_name_size;
    		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, 0, NULL, &platform_name_size));
    		char *platform_name = (char *) malloc(platform_name_size);
    		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, platform_name_size, platform_name, NULL));
    		printf("platform[%u]=%s\n", p, platform_name);
    		free(platform_name);
     
    		// Get devices
    		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices));
    		devices = (cl_device_id *) malloc(num_devices * sizeof(cl_device_id));
    		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL));
     
    		// Loop through all devices
    		unsigned int d;
    		for (d = 0; d < num_devices; d++) {
     
    			// Output device name
    			size_t device_name_size;
    			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 0, NULL, &device_name_size));
    			char *device_name = (char *) malloc(device_name_size);
    			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, device_name_size, device_name, NULL));
    			printf("device[%u]=%s\n", d, device_name);
    			free(device_name);
     
    			// Create Context
    			cl_context_properties context_properties[3] = {
    					CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[p],
    					0 };
    			CHECK_ERR(context = clCreateContext(context_properties, num_devices, devices,
    					NULL, NULL, &err), err);
     
    			// Start timing
    			err = gettimeofday(&start, NULL);
    			if (err != 0) {
    				printf("gettimeofday(start, NULL) failed err=%d\n", err);
    				exit(err);
    			}
     
    			// Allocate Buffer
    			int *array = (int *) malloc(1024 * 1024 * sizeof(int));
     
    			// Create Buffer
    			CHECK_ERR(buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR,
    					1024 * 1024 * sizeof(int), array, &err), err);
     
    			// Release Buffer
    			CHECK(clReleaseMemObject(buffer));
     
    			// End timing
    			err = gettimeofday(&end, NULL);
    			if (err != 0) {
    				printf("gettimeofday(end, NULL) failed err=%d\n", err);
    				exit(err);
    			}
     
    			// Get end-start difference and print it
    			timersub(&end, &start, &diff);
    			float time = ((float) diff.tv_sec * 1000000)
    					+ ((float) diff.tv_usec);
    			printf("end-start time %f usec\n", time);
     
    			CHECK(clReleaseContext(context));
     
    		}
     
    		free(devices);
     
    	}
     
    	free(platforms);
     
    	return 0;
    }
    I get the following timing results
    Code :
    platform[0]=AMD Accelerated Parallel Processing
    device[0]=Cypress
    end-start time 11.000000 usec
    device[1]=Cayman
    end-start time 4.000000 usec
    device[2]=AMD Phenom(tm) II X6 1055T Processor
    end-start time 4.000000 usec
    This is faster than what you have. Does this program match what you have used? What do you get when you run this program on your system? If you wish please update this program to include more timings and repost it here.

    What are the possible side-effects, if I use the array pointer directly without calling the map/unmap pair to obtain back the already known value? Instead I only make sure that que has clFinished.
    Your host memory will be stale (out-of-date) if you use a GPU device because the values in the device memory will not be read back (or mapped) into host memory.

    It may be that for the world of GPU this numbers are "small", but the CPU device is a different ballgame.
    I'm not sure what you are saying, please explain. If you are saying that using a GPU takes more overhead than dispatching a simple call from a CPU host application to a CPU compute function, then yes you are right. A call on the CPU is typically only a few instructions whereas the GPU hardware is an I/O attached device which require much more overhead to transfer data to it and dispatch a function on it. However, a GPU has tremendous parallel processing capability. So it is all a game to hide the bandwidth and latency by doing an enough work to make the overhead warranted.

  9. #9

    Re: Sharing host memory with clSetKernelArg!

    Continuing...Adding in a command queue and a map/unmap I get the following results
    Code :
    platform[0]=AMD Accelerated Parallel Processing
    device[0]=Cypress
    end-start time 122.000000 usec
    device[1]=Cayman
    end-start time 114.000000 usec
    device[2]=AMD Phenom(tm) II X6 1055T Processor
    end-start time 165.000000 usec
    The program is now
    Code :
    #define _BSD_SOURCE
    #include <sys/time.h>
    #include <CL/cl.h>
    #include <stdio.h>
    #include <malloc.h>
    #include <stdlib.h>
     
    #define CHECK(function_call) \
    do { \
      /* printf("CHECK(" #function_call ") in: %s, line: %d\n", __FILE__, __LINE__); */ \
      int _rc = (function_call); \
      if (_rc != 0) { \
    	printf("ERROR:  function rc = %d\n", _rc); \
    	fflush(stdout); \
    	exit(_rc); \
      } \
    } while(0);
     
    #define CHECK_ERR(function_call, _rc) \
    do { \
      /* printf("CHECK(" #function_call ") in: %s, line: %d\n", __FILE__, __LINE__); */ \
      (function_call); \
      if (_rc != 0) { \
    	printf("ERROR:  function rc = %d\n", _rc); \
    	fflush(stdout); \
    	exit(_rc); \
      } \
    } while(0);
     
    int main(int argc, char **argv) {
     
    	int err;
    	struct timeval start, end, diff;
    	cl_uint num_platforms;
    	cl_platform_id *platforms;
    	cl_uint num_devices;
    	cl_device_id *devices;
    	cl_context context;
    	cl_command_queue commandqueue;
    	cl_mem buffer;
     
    	// Get platforms
    	CHECK(clGetPlatformIDs(0, NULL, &num_platforms));
    	platforms = (cl_platform_id *) malloc(
    			num_platforms * sizeof(cl_platform_id));
    	CHECK(clGetPlatformIDs(num_platforms, platforms, NULL));
     
    	// Loop through all platforms
    	unsigned int p;
    	for (p = 0; p < num_platforms; p++) {
     
    		// Output platform name
    		size_t platform_name_size;
    		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, 0, NULL, &platform_name_size));
    		char *platform_name = (char *) malloc(platform_name_size);
    		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, platform_name_size, platform_name, NULL));
    		printf("platform[%u]=%s\n", p, platform_name);
    		free(platform_name);
     
    		// Get devices
    		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices));
    		devices = (cl_device_id *) malloc(num_devices * sizeof(cl_device_id));
    		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL));
     
    		// Create Context with all devices
    		cl_context_properties context_properties[3] = { CL_CONTEXT_PLATFORM,
    				(cl_context_properties) platforms[p], 0 };
    		CHECK_ERR(context = clCreateContext(context_properties, num_devices, devices,
    						NULL, NULL, &err), err);
     
    		// Loop through all devices
    		unsigned int d;
    		for (d = 0; d < num_devices; d++) {
     
    			// Output device name
    			size_t device_name_size;
    			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 0, NULL, &device_name_size));
    			char *device_name = (char *) malloc(device_name_size);
    			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, device_name_size, device_name, NULL));
    			printf("device[%u]=%s\n", d, device_name);
    			free(device_name);
     
    			// Create command queue
    			CHECK_ERR(commandqueue = clCreateCommandQueue(context, devices[d], 0, &err), err);
     
    			// Start timing
    			CHECK(gettimeofday(&start, NULL));
     
    			// Allocate Buffer
    			int *array = (int *) malloc(1024 * 1024 * sizeof(int));
     
    			// Create Buffer
    			CHECK_ERR(buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR,
    							1024 * 1024 * sizeof(int), array, &err), err);
     
    			// Map Buffer
    			void *mapaddr;
    			CHECK_ERR(mapaddr = clEnqueueMapBuffer(commandqueue, buffer, CL_TRUE, CL_MAP_WRITE, 0, 1024 * 1024 * sizeof(int), 0, NULL, NULL, &err), err);
     
    			// Unmap Memory Object
    			CHECK(clEnqueueUnmapMemObject(commandqueue, buffer, mapaddr, 0, NULL, NULL));
     
    			// Release Buffer
    			CHECK(clReleaseMemObject(buffer));
     
    			// End timing
    			CHECK(gettimeofday(&end, NULL));
     
    			// Compute end-start timing difference and print it
    			timersub(&end, &start, &diff);
    			float time = ((float) diff.tv_sec * 1000000)
    					+ ((float) diff.tv_usec);
    			printf("end-start time %f usec\n", time);
     
    			// Release command queue
    			CHECK(clReleaseCommandQueue(commandqueue));
     
    		}
     
    		// Release context
    		CHECK(clReleaseContext(context));
     
    		free(devices);
     
    	}
     
    	free(platforms);
     
    	return 0;
    }

  10. #10

    Re: Sharing host memory with clSetKernelArg!

    Changing the above program to specify only CL_DEVICE_TYPE_CPU (no GPU usage) gets the following results with map/unmap
    Code :
    platform[0]=AMD Accelerated Parallel Processing
    device[0]=AMD Phenom(tm) II X6 1055T Processor
    end-start time 71.000000 usec

Page 1 of 2 12 LastLast

Similar Threads

  1. copying a variable from host memory to device memory
    By shahsaurabh1990 in forum OpenCL
    Replies: 4
    Last Post: 03-26-2013, 01:10 AM
  2. Replies: 1
    Last Post: 05-03-2011, 05:08 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •