Page 1 of 2 12 LastLast
Results 1 to 10 of 17

Thread: are clCreateProgramWithSource & clBuildProgram thread-safe?

  1. #1
    Junior Member
    Join Date
    Apr 2011
    Posts
    13

    are clCreateProgramWithSource & clBuildProgram thread-safe?

    I am working on a multi-threaded code to run an OpenCL kernel on several Compute Devices simultaneously. In this code, I first launch N threads, with N equal to the user specified work device count (CPU, GPU1, GPU2 ...). In each thread, I run the same host code, creating context, create command queue, create buffers and compile a program etc.

    I am wondering if this overall structure is safe and sound in a multi-threaded environment? Are there any specifications on which cl function is thread-safe or not?

    Moreover, if my GPU1 and GPU2 are both of the same kind, say the two cores in a Radeon 6990, is there a way to save one clBuildProgram? the reason I ask is because the building process is kind of long at this point. It takes about 5 seconds to compile a 650 lines cl code with ati catalyst 11.3.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    In each thread, I run the same host code, creating context, create command queue, create buffers and compile a program etc.

    I am wondering if this overall structure is safe and sound in a multi-threaded environment? Are there any specifications on which cl function is thread-safe or not?
    It depends on whether you are targeting OpenCL 1.0 or 1.1. In OpenCL 1.1. all functions are thread-safe except for clSetKernelArg(), so it is strongly recommended to have different threads use different kernel objects -- you simply call clCreateKernel() multiple times, one for each thread. Note that it's okay to have multiple kernel objects for the same kernel function in your source code.

    In OpenCL 1.0 the only thread-safe functions are clCreateXXX, clRetainXXX and clReleaseXXX.

    Moreover, if my GPU1 and GPU2 are both of the same kind, say the two cores in a Radeon 6990, is there a way to save one clBuildProgram? the reason I ask is because the building process is kind of long at this point. It takes about 5 seconds to compile a 650 lines cl code with ati catalyst 11.3.
    If you have multiple devices from the same vendor I recommend putting all of them into the same context. That way you can share the same program object(s) across both devices and you only need to call clBuildProgram() once. Notice that when you call to clBuildProgram() you pass a list of device IDs. On sensible implementations of OpenCL you would expect that passing multiple devices would be faster than calling clBuildProgram() multiple times.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Apr 2011
    Posts
    13

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    thanks!

    just to make sure I understand, when I say "thread-safe", I meant calling the same function(s) simultaneously in multiple threads. Is this the same definition in your comment?

    For example, if I have a computer with a CPU, a Radeon 4xxx GPU and a 6xxx GPU, are you suggesting me to use two threads to build the program: 1 for CPU, 1 for the two Radeons?

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    just to make sure I understand, when I say "thread-safe", I meant calling the same function(s) simultaneously in multiple threads. Is this the same definition in your comment?
    Thread safety in OpenCL is defined in Appendix A.2. The idea is: any number of threads can call any of the thread-safe functions at any time and it must work, including things like multiple threads calling the same function simultaneously.

    For example, if I have a computer with a CPU, a Radeon 4xxx GPU and a 6xxx GPU, are you suggesting me to use two threads to build the program: 1 for CPU, 1 for the two Radeons?
    I'm not actually suggesting that. I mean, it will work but I doubt it will buy you much so why do it?

    Generally speaking, what's the reason you implemented this app with multiple threads? Do you expect to see any benefits?

    Threads = headaches in my book OpenCL driver developers go through quite a bit of pain enable developers to write apps that scale with multicore systems without having to write multithreaded code. That's one of the big reasons why OpenCL exists.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Apr 2011
    Posts
    13

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    Quote Originally Posted by david.garcia
    I'm not actually suggesting that. I mean, it will work but I doubt it will buy you much so why do it?

    Generally speaking, what's the reason you implemented this app with multiple threads? Do you expect to see any benefits?

    Threads = headaches in my book OpenCL driver developers go through quite a bit of pain enable developers to write apps that scale with multicore systems without having to write multithreaded code. That's one of the big reasons why OpenCL exists.
    hi David, thanks again for the insightful comments.

    My motivation of doing so is based on an observation: when running a simulation kernel on the GPU, my CPU load drops to very low. That means, if I can have another thread to maintain a kernel running on the CPU simultaneously, I might be able to maximize the simulation performance. This naive view can be extended to multiple CPUs/GPUs on a heterogeneous system.

    When writing this code, I first wrote a host function to compile/launch/post-process data for a single device at a time. To make it to work for multiple devices simultaneously, I came up with a lazy (or portable) approach where I added OpenMP constructs to parallize the host function, something like:

    Code :
    ...
    #ifdef _OPENMP
         omp_set_num_threads(activedev);
    #endif
     
    #pragma omp parallel private(threadid)
    {
    #ifdef _OPENMP
         threadid=omp_get_thread_num();
    #endif
         // this launches the simulation
         run_simulation(&mcxconfig,threadid,activedev,fluence,&totalenergy);
    }
    ...

    where "activedev" is the number of working-devices that I want to use for my simulation; for each working-device, I launch an OpenMP thread to execute run_simulation(), where it builds/launches/post-processes on each device. Not surprisingly, several omp barrier/critical sections are needed in run_simulation(), but overall, this approach let me extend the simulation to multiple devices with minimum coding.

    I wrote this code in early 2010, and did not read deeply into OpenCL, and did not follow the progress in OpenCL 1.1. I am not sure if there are easier-to-use OpenCL functions that can use multiple devices without involving multi-threading.

    Although it works, I did observe a low efficiency running the above code on both CPU and GPU at the same time, as you have suspected. The time-saving is not attractive at all. Sometimes it takes even longer than running purely on the GPU. I am still investigating what's going on.

    In your opinion, is my above approach justified ? are you aware of any other robust alternatives that can fully exploit the computational power from a heterogeneous platform? or this is out of the design-scope of OpenCL?

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    My motivation of doing so is based on an observation: when running a simulation kernel on the GPU, my CPU load drops to very low. That means, if I can have another thread to maintain a kernel running on the CPU simultaneously, I might be able to maximize the simulation performance. This naive view can be extended to multiple CPUs/GPUs on a heterogeneous system.
    You don't need multiple threads for that. All you need is multiple command queues, one for each device. That will easily keep all devices at 100% usage.

    I can't think of any case where it makes sense to mix OpenMP with OpenCL. OpenCL basically does everything OpenMP does and then some... at the cost of being quite a lot more complex to program for, of course.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Junior Member
    Join Date
    Apr 2011
    Posts
    13

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    Quote Originally Posted by david.garcia
    You don't need multiple threads for that. All you need is multiple command queues, one for each device. That will easily keep all devices at 100% usage.

    I can't think of any case where it makes sense to mix OpenMP with OpenCL. OpenCL basically does everything OpenMP does and then some... at the cost of being quite a lot more complex to program for, of course.
    ah, looks like I haven't really understood opencl

    I will definitely do more readings on this, but to get me started, I am wondering if you can give me some quick suggestion on what needs to be modified to achieve what you recommended.

    The following is a pseudo-code outline of what were done inside the run_simulation() function in my old implementation:

    Code :
    run_simulation(int devid,int iscpu){
      context=clCreateContextFromType(iscpu?CL_DEVICE_TYPE_CPU:CL_DEVICE_TYPE_GPU)
      clGetContextInfo(context,...,devices)
      commands=clCreateCommandQueue(context,devices[devid])
      var1=clCreateBuffer(context,...)
      var2=clCreateBuffer(context,...)
      ...
      program=clCreateProgramWithSource(context)
      status=clBuildProgram(program)
      kernel = clCreateKernel(program)
     
      clSetKernelArg(kernel,0,var1)
      clSetKernelArg(kernel,1,var1)
      ...
      clEnqueueNDRangeKernel(commands,kernel)
      clEnqueueReadBuffer(commands,var1)
      ...
    }

    As you see, it is designed for a single device per call. To make it work for all available OpenCL devices, I would imagine I need to do

    • in clCreateContextFromType, I should use CL_DEVICE_TYPE_ALL[/*:m:7mj2y5c6]
    • in the clCreateCommandQueue call, I do a loop for all members in the devices[] array to create commands[i][/*:m:7mj2y5c6]
    • it looks like clCreateBuffer, clCreateProgramWithSource, clBuildProgram and clSetKernelArg are only associated with the context and not specific device, so, I don't need to do anything[/*:m:7mj2y5c6]
    • for the clEnqueueNDRangeKernel() call, I need to use another loop, launching my kernel for each commands[i]; after launching, I will wait for the results with a blocking or non-blocking read.[/*:m:7mj2y5c6]


    do the above changes look reasonable to you?

    what if I want to launch kernels with diff arguments on each device (for example kernel(10) on CPU and kernel(1000) on a GPU for load-balancing purposes)?

    thank you again for your kind suggestions.

  8. #8
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    Don't worry, we are all learning one thing or another.

    Answering your question, what the code looks like will depend on how you want to split the work across devices and read the results back. In the simplest case it could look like this:

    Code :
    run_simulation()
    {
      context=clCreateContextFromType(CL_DEVICE_TYPE_ALL)
      clGetContextInfo(context,...,devices)
     
      program=clCreateProgramWithSource(context)
      status=clBuildProgram(program, all devices in context)
     
      for each device in context
      {
        commands=clCreateCommandQueue(context,devices[devid])
        // having different buffers for each device can prevent unnecessary
        // data transfers between devices
        var1=clCreateBuffer(context,...)
        var2=clCreateBuffer(context,...)
        ...
        // it's fine to create multiple kernel objects
        // all associated with the same kernel function
        // (or different ones for that matter)
        kernel = clCreateKernel(program)
     
        clSetKernelArg(kernel,0,var1)
        clSetKernelArg(kernel,1,var1)
        ...
        clEnqueueNDRangeKernel(commands,kernel)
        clEnqueueReadBuffer(commands, NON_BLOCKING, var1)
        ...
      }
      clWaitForEvents(all the ReadBuffer events above)
    }

    If you have multiple incompatible platforms on your system, such as an NVidia GPU and an x86 CPU (assuming NVidia's OpenCL driver doesn't support x86) then you will have to create one context for each device.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  9. #9
    Junior Member
    Join Date
    Apr 2011
    Posts
    13

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    Quote Originally Posted by david.garcia
    Answering your question, what the code looks like will depend on how you want to split the work across devices and read the results back. In the simplest case it could look like this:

    If you have multiple incompatible platforms on your system, such as an NVidia GPU and an x86 CPU (assuming NVidia's OpenCL driver doesn't support x86) then you will have to create one context for each device.
    thanks a lot David, I briefly tested the work-flow as suggested in your previous post, it seems to work very well!

    I have one more question: in the kernel parameter list, I have a scalar input to control how many photons to simulate (this application is designed for optical imaging, the CUDA version is at http://mcx.sf.net ). When launching this kernel for different devices, I came up with a load-balancing scheme by setting different photon numbers for each device: a smaller number for CPU and a higher number for GPU (can be a function of core numbers etc). From the code outline, the clSetKernelArg() calls only deal with a kernel, which does not seem to distinguish devices. To achieve the load-balancing, do I have to define another kernel for the CPU target and set arg. separately? Or I have to pass the full workload and total device/core numbers to the kernel and compute the device load inside the kernel?

    Any other mechanism that I can use in OpenCL's specification?

  10. #10
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: are clCreateProgramWithSource & clBuildProgram thread-sa

    Monte Carlo simulations of light transport! That's one of my favorite topics

    From the code outline, the clSetKernelArg() calls only deal with a kernel, which does not seem to distinguish devices. To achieve the load-balancing, do I have to define another kernel for the CPU target and set arg. separately? Or I have to pass the full workload and total device/core numbers to the kernel and compute the device load inside the kernel?
    Since we have split the input data into different buffer objects for different devices, you will need different kernel objects for different devices as well. For example, if we have assigned buffer B1 and B2 to device D1, then you will create a kernel object K1 and set the kernel arguments to B1 and B2, then you will enqueue an NDRange using K1 on D1. Do the same for buffers B3 and B4 assigned to device D2: create a kernel K2, set the kernel arguments to B3 and B4, then enqueue an NDRange using K2 on D2. Etc.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Page 1 of 2 12 LastLast

Similar Threads

  1. WFD Error Codes Thread Safe?
    By lewk in forum OpenWF
    Replies: 2
    Last Post: 07-12-2011, 09:04 AM
  2. is clGetEventInfo thread-safe?
    By diegor in forum OpenCL
    Replies: 0
    Last Post: 12-02-2009, 04:55 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •