Results 1 to 4 of 4

Thread: cpu working, gpu not

  1. #1
    Junior Member
    Join Date
    Aug 2010
    Posts
    3

    cpu working, gpu not

    Code :
    __kernel void deform(   __constant  float4 * inCage,
                            __constant uint * wnum,
                            __constant uint * wskip,
                            __constant float *weights,
                            __constant uint *cageID,
                            __global  float4 * outMesh,
                            const uint nend)
    {
     
        uint i = get_global_id(0);
        if(i < nend)
        {
            float4 pos = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
            uint j = wskip[i];
            uint end = j + wnum[i];
     
            for(; j < end ; j++)
            {
                pos += inCage[cageID[j]] * weights[j];
            }
            outMesh[i] = pos;
       }
    }
    Got above code working correctly on CPU, but GPU outputs mess and it's about 100x slower. My NDRange is over 30 000. I've got Nvidia gtx260, athlon 4-core and windows XP 64bit. Got ATI stream installed and nvidias 195.39 forceware (later ones output mess that wasn't affected by input in any way, these drivers at least react to input by outputting mess that changes).

    My first though was that slowdown might be because it's all in global/constant memory. Unfortunately I am in loss how to get inCage to local memory nicely as it's not possible to assign pointer from __global to __local. How to copy inCage to local memory efficiently?

    I've ditched all zero weights from data and thus for loop length depends on how many weights exist per i. Every element of cageID and weights are read only once in whole NDRange, so it would be logical to keep them as __constant or __global.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: cpu working, gpu not

    Have you made sure that the GPU is not returning any error codes when you call clEnqueueNDRangeKernel()? It looks like you are using a ton of __constant arguments and there are limits to these. Try replacing __constant with __global in the source code and see if the code runs correctly this time.

    Also, instead of passing that "nend" argument I recommend changing the values you pass to clEnqueueNDRangeKernel() so that you only spawn "nend" work-items. Do something like this:

    Code :
    errcode = clEnqueueNDRangeKernel(queue, kernel, 1, 0, nend, 0, num_wait, wait_list, &event);


    How to copy inCage to local memory efficiently?
    If you are using OpenCL 1.1. then you can use async_work_group_copy() (look it up in the spec).
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Aug 2010
    Posts
    3

    Re: cpu working, gpu not

    Thanks for your reply.

    There were no error codes returned at any stage. I'm going to give __global a go though I'm quite sure I've tried it before. As for nend and if, I observed that on cpu it's faster to have that if and set the local worksize by hand and have global worksize be multiple of local worksize. As far as I've understood global worksize has to be multiple of local worksize, or does it?

    I just went through sample code where the global data was pulled into local data by having different threads and then having a memory barrier to ensure all the data is transferred. Is this good way to go in case I have to settle with opencl 1.0 spec? Hopefully I don't have to as async_work_group_copy usage looks rather simple to use.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: cpu working, gpu not

    As far as I've understood global worksize has to be multiple of local worksize, or does it?
    The local work size has to be a multiple of the global work size, but you don't need to specify a local work size every time. Unless you need a particular size for some reason (usually due to your algorithm), simply pass NULL as the local_size argument of clEnqueueNDRangeKernel(). The driver will pick a value that is most suitable for your device.

    I just went through sample code where the global data was pulled into local data by having different threads and then having a memory barrier to ensure all the data is transferred. Is this good way to go in case I have to settle with opencl 1.0 spec?
    Yes, it's the right way to do it
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. Power management with GPU not working
    By sowmyaa in forum OpenGL ES general technical discussions
    Replies: 1
    Last Post: 02-14-2013, 04:52 PM
  2. Replies: 1
    Last Post: 11-18-2011, 08:13 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •