Khronos.org Message Boards

Public discussions about the Khronos Dynamic Media APIs
It is currently Thu Sep 09, 2010 4:29 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: Data Feedback
PostPosted: Wed Feb 03, 2010 2:43 am 
Offline

Joined: Wed Feb 03, 2010 2:22 am
Posts: 3
Hi

I'm new to OpenCL and am trying to port the source code below to Nvidia GPU Quadro FX1700 using OpenCL. There are data feedback (i.e. alpha_t[s]=new_alpha_t[s]) in the nested loops so that the intermediate results are used in the subsequent computations. How do I achieve data feedback in the kernel? I used a global work size of 752 and local work size of 8 for my kernel. In addition, I perform a loop unrolling in the innermost loop (i.e. z) to achieve sum[0], sum[1], sum[2] and sum[3].

Code:
sm_lut[4][8] = {{1,5,5,7,6,3,0,5},
                {2,6,4,6,5,2,2,7},
                {7,7,3,1,3,5,2,2},
                {0,1,2,0,4,4,6,1}
               };

int s,m,z;
int alpha_t[8]={0};
int new_alpha_t[8];

for (m=0; m<752; m++)
{
    for (s=0; s<8; s++)
    {
        int sum[4];

        for (z=0; z<4; z++)
        {
            int sm1;
            sm1 = sm_lut[z][s];
            sum[z] = alpha_t[sm1];
        }
        new_alpha_t[s]=max4(sum[0],sum[1],sum[2],sum[3]);
    }

    for (s=0; s<8; s++)
        alpha_t[s]=new_alpha_t[s];
}


Thanks in advance for your help.


Top
 Profile E-mail  
 
 Post subject: Re: Data Feedback
PostPosted: Thu Feb 04, 2010 2:53 pm 
Offline

Joined: Thu Dec 17, 2009 5:33 pm
Posts: 22
copy it back to host


Top
 Profile E-mail  
 
 Post subject: Re: Data Feedback
PostPosted: Sat Feb 06, 2010 9:19 am 
Offline

Joined: Tue Jul 21, 2009 11:23 am
Posts: 311
Location: Northern Europe
If I understand you correctly, you want to feed data back between separate work-groups within one kernel execution. You can do this, but OpenCL does *not* allow synchronization between work-groups. This means you need to ensure at the algorithm level that you do not have any data races, and then you can do it through global memory. If you need synchronization (e.g., all work-groups are done before the next iteration) you have to do it through multiple kernel executions with the results stored to global memory.


Top
 Profile E-mail  
 
 Post subject: Re: Data Feedback
PostPosted: Mon Feb 08, 2010 1:32 am 
Offline

Joined: Wed Feb 03, 2010 2:22 am
Posts: 3
I need synchronization (e.g., all work-groups are done before the next iteration) due to the innermost loop (ie. z) using alpha_t from the previous loop to compute the sum values in subsequent loops.

I used a global worksize of 6016 (= 752 * 8) and local worksize of 8 for my kernel. The data size of _beta0 - _beta3 is 6016 and they were computed from another kernel. However I am unable to get the correct results using the kernel below:
Code:
__kernel void forward(__global int *_dResult,
                      __global int *_beta0,
                      __global int *_beta1,
                      __global int *_beta2,
                      __global int *_beta3)
{
    const int sm_lut[32] = {0, 1, 6, 7, 2, 3, 4, 5,
                            5, 4, 3, 2, 7, 6, 1, 0,
                            1, 0, 7, 6, 3, 2, 5, 4,
                            4, 5, 2, 3, 6, 7, 0, 1
                           };

    volatile int alpha_t[8];
    volatile int new_alpha_t[8];
    int sum[4] = {0};

    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int idx = lid << 2;
    int size = get_global_size(0) / get_local_size(0);

    for (int i = 0; i < 8; i++)
        new_alpha_t[i] = alpha_t[i] = 0;

    for (int t = 0; t < size; t++)
    {
        sum[0] = alpha_t[s_to_sm1[idx]]     + _beta0[gid];
        sum[1] = alpha_t[s_to_sm1[idx + 1]] + _beta1[gid];
        sum[2] = alpha_t[s_to_sm1[idx + 2]] + _beta2[gid];
        sum[3] = alpha_t[s_to_sm1[idx + 3]] + _beta3[gid];

        new_alpha_t[lid] = max4(sum[0], sum[1], sum[2], sum[3]);
        alpha_t[lid] = new_alpha_t[lid];
    }
    _dResult[gid] = alpha_t[lid];
}


Could u pls let me know how i could compute the (6016) results through multiple kernel executions?

Thanks for your help.


Top
 Profile E-mail  
 
 Post subject: Re: Data Feedback
PostPosted: Mon Feb 08, 2010 11:39 am 
Offline

Joined: Tue Jul 21, 2009 11:23 am
Posts: 311
Location: Northern Europe
If you're trying to synchronize across work-groups (which is not allowed in OpenCL except via multiple kernel executions) you just write out your new alpha_t values to global memory when you're done with one execution, and then read them in with the next execution. That way you know that before the second kernel execution starts all the values from the first kernel execution have been written out. Ideally you would re-factor your algorithm to only need synchronization within a work-group and you wouldn't have to incur that overhead. (If your kernel isn't doing a lot of work on an iteration, the overhead of starting it can be a substantial amount of the total execution.)


Top
 Profile E-mail  
 
 Post subject: Re: Data Feedback
PostPosted: Tue Feb 09, 2010 6:59 pm 
Offline

Joined: Wed Feb 03, 2010 2:22 am
Posts: 3
Got it. Thanks for your help. :lol:


Top
 Profile E-mail  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group