Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: Information concerning memory access

  1. #1
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Information concerning memory access

    Hello,
    I've a problem of wrong code and/or memory access. I'm kind of stuck optimizing a program for OpenCL.
    My program is about Dijsktra algorithm: I've written 3 kernels so far, one for each step -> initialization, min search and update distances.
    I've manage to never re-read the memory for the update part so that doesn't loose time but I'm stuck with the find min part.
    My kernel is as follow:
    Code :
    __kernel void find_min(__global int* distanceVector,
        __global int* usedSummitVector, const int size,
        __global int* min, __global int* lMin)
    {
        int k = get_global_id(0);
        int gap = get_global_size(0);
     
        lMin[k] = 0;
     
        int startIndex = k * gap;
        int endIndex = startIndex + gap;
     
        if (endIndex > size) endIndex = size;
     
        int i = startIndex;
     
        for (; i < endIndex; i++)
        {
            if (distanceVector[lMin[k]] <= 0 && usedSummitVector[i] == 0
                || usedSummitVector[lMin[k]])
                lMin[k] = i;
            else
            {
                if (distanceVector[i] > 0
                    && distanceVector[i] < distanceVector[lMin[k]]
                    && usedSummitVector[i] == 0)
                {
                    lMin[k] = i;
                }
            }
        }
     
        barrier(CLK_GLOBAL_MEM_FENCE);
     
        if (k == 0)
        {
            *min = 0;
     
            int i = 0;
            for (; i < gap; i++)
            {
                if (distanceVector[*min] <= 0 && usedSummitVector[lMin[i]] == 0
                    || usedSummitVector[*min])
                    *min = lMin[i];
                else
                {
                    if (distanceVector[lMin[i]] > 0
                        && distanceVector[lMin[i]] < distanceVector[*min]
                        && usedSummitVector[lMin[i]] == 0)
                    {
                        *min = lMin[i];
                    }
                }
            }
        }
     
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
    With this it's working but each time I run the kernel I need to either call clEnqueueReadBuffer or clFinish and that's taking like 10ms just for that and I don't understand why.
    If anybody could tell give me a hint on where to search for this problem that's would be very helpfull.
    Thank you in advance,
    Benjamin.

  2. #2
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Information concerning memory access

    It looks like you are trying to get work-item 0 to sum across the results of all work-items. The barrier you have in your code will only work across work-items in the *same* work-group, so unless you only have one work-group this code is invalid for OpenCL. (You're only allowed to synchronize between work-items in a work-group, because there is no guarantee on the scheduler.) If you need to do a reduction (which is what this looks like) and you can't fit into one work-group, you have to do it as multiple kernel executions.

    I'm not sure why you need a clFinish, though.

  3. #3
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: Information concerning memory access

    Thank you for the reply,

    I tought of that so I put a somme like that:
    Code :
    err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1, NULL,
                                         &global, &global, 0, NULL, NULL);

    And if I remember well that still didn't work. My point is I can't let my program waste 10ms into memory sync in each loop because it growth with the scale so my sequential program is always better than my multi-core version. I really don't understand why putting the command up there the barrier function didn't worked at all.

    Thank in advance for any help or hint.
    Benjamin.

  4. #4
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Information concerning memory access

    If you set your global and local sizes to be the same you will only execute one work-group on the device and will only use a small fraction (1/4 to 1/16th on a discrete GPU) of the processors. That will dramatically hurt your performance, besides which you'll be limited to a maximum of 512 for your global size.

    I'm afraid I'm not really following what you are saying. Why do you need to read the data back at all? (I'm assuming you mean a clEnqueueRead when you say "read back".) You should be able to just enqueue the kernel executions and let each one run right after each other. Or perhaps I'm not understanding the algorithm correctly.

  5. #5
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: Information concerning memory access

    Thank for the info about the workgroup size so how can I choose manually a work group size or how can I sync work group in my function ?
    My problem, is I think, linked directly with the sync of the work groups what I do now to have the good result is to do a CLFinish or a CLEnqueueReadBuffer on the min part of my alogorithm to pass it to the update kernel. But, now, this operation takes like 3-5ms sec and this time is growing in size with my data size. What I want to know is to insure that the data of the min is good so I can directly pass it to the update kernel without the result with the CLEnqueueReadBuffer function.

    I hope this is a little clearer and thank you,
    Benjamin.

  6. #6
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Information concerning memory access

    Benjamin, I'm still not quite following you.

    It seems like you are trying to find the minimum value across a large amount of data. This is effectively a reduction operation which requires synchronization between elements. If that's the case, then you will need to either fit the whole reduction in one work-group (bad for performance) or split it across multiple kernel executions such that each work-group's results are guaranteed to be done between them. That would look like this:

    enqueueKernel(first_reduction_step) -> generates partial min results
    enqueueKernel(second_reduction_step) -> combines partial min results
    enqueueKernel(third_reduction_step) -> combines partial min results further
    etc.

    You do not need to finish or read back the data between them (as the kernel will finish before the next one starts on current GPUs or in-order-queues).

    For more info on this, try doing a google search for "reduction synchronization opencl".

  7. #7
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: Information concerning memory access

    Ok thank you so much for the hint I'm going to try it.
    And sorry for not being real clear that's hard to explain exactly what I mean whitout showing all my code.
    So thank you again.

  8. #8
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: Information concerning memory access

    Hello again, I've tried to implement the kernel in a different way and run it in a different way too.
    It's really close to the result but for some reason I can see, some values of the partial min are overriden and I don't know why.
    Would you mind poiting out to me where is my mistake ?
    Here is the kernel:
    Code :
    __kernel void find_min(__global int* distanceVector,
        __global int* usedSummitVector, const int size,
        __global int* oldMin, __global int* lMin, const int firstPass,
        const int lmax)
    {
        int k = get_global_id(0);
     
        int i = k * 2;
     
        if (firstPass == 1) {
            lMin[k] = i;
     
            if (i + 1 < size)
            {
                if (distanceVector[lMin[k]] <= 0 || usedSummitVector[lMin[k]] != 0)
                    lMin[k] = i + 1;
                else
                {
                    if (distanceVector[i + 1] > 0 
                        && distanceVector[i + 1] < distanceVector[lMin[k]]
                        && usedSummitVector[i + 1] == 0)
                    {
                        lMin[k] = i + 1;
                    }
                }
            }
        }
        else
        {
            lMin[k] = oldMin[k];
     
            if (i + 1 < lmax)
            {
                if (distanceVector[lMin[k]] <= 0 || usedSummitVector[lMin[k]] != 0)
                    lMin[k] = oldMin[k + 1];
                else
                {
                    if (distanceVector[oldMin[k + 1]] > 0
                        && distanceVector[oldMin[k + 1]] < distanceVector[lMin[k]]
                        && usedSummitVector[oldMin[k + 1]] == 0)
                    {
                        lMin[k] = oldMin[k + 1];
                    }
                }
            }
        }
    }
    And here is how I use it:
    Code :
    int firstPass = 1;
     
            int lsize = size / 2;
            int lmax = lsize;
            if (size % 2 != 0)
                lsize++;
     
            while (lsize > 2)
            {
                size_t global = lsize;
     
                // Set the arguments to our compute kernel
                err = 0;
                err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                     &this->distanceVectorMem);
                err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                      &this->usedSummitVectorMem);
                err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                      &size);
                err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                      &this->pFindMinMem);
                err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                      &this->findMinMem);
                err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                      &firstPass);
                err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                      &lmax);
     
                if (err != CL_SUCCESS)
                {
                    printf("Error: Failed to set kernel arguments\n");
     
                    exit(EXIT_FAILURE);
                }
     
                // Execute the kernel over the entire range of the data set
                err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                             NULL,
                                             &global, NULL, 0, NULL, NULL);
     
                if (err != CL_SUCCESS)
                {
                    printf("Error: Failed to execute the kernel\n");
     
                    exit(EXIT_FAILURE);
                }
     
                this->pFindMinMem = this->findMinMem;
     
                lmax = lsize;
     
                int oldSize = lsize;
                lsize = oldSize / 2;
                if (oldSize % 2 != 0)
                    lsize++;
     
                if (lsize % 2 != 0)
                    lmax--;
     
                if (firstPass == 1)
                    firstPass = 0;
            }
     
            err = 0;
            err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                 &this->distanceVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                  &this->usedSummitVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                  &size);
            err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                  &this->pFindMinMem);
            err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                  &this->findMinMem);
            err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                  &firstPass);
            err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                  &lmax);
     
            size_t global = 1;
            err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                         NULL,
                                         &global, NULL, 0, NULL, NULL);
     
            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to execute the kernel\n");
     
                exit(EXIT_FAILURE);
            }
     
            clFinish(this->queue);
    I'm sure there is a problem somewhere something I don't see because I still need to use a clFinish for the result to work. I use the result of the kernel directly like this:
    Code :
    err |= clSetKernelArg(this->updateDistancesKernel, 3, sizeof(cl_mem),
                                  &this->findMinMem);
    In the next kernel.
    I've searched over internet and found out an exemple of reduction but I've not managed to implement it the way I need it for my min. Because my min kernel does not work on values but on indices of values.

    Thank you for the help in advance,
    Benjamin.

  9. #9
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: Information concerning memory access

    It sounds like you've got a problem with a data race between work-items. I don't fully understand your code, so I would suggest drawing out a picture and making sure there is no way two work-items can be writing to the same location, or one reading the results another has written.

  10. #10
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: Information concerning memory access

    Hello again,

    I managed to modify my code so that now I'm sure that anything override anything else. But still I have problem with the memory object and I really need an explaination here. I think I'm completly mistaken in how to make a good use of the cl_mem objects.

    Here is the code of the creation of my mem objects:
    Code :
    this->distanceVectorMem = clCreateBuffer(context,
                CL_MEM_READ_WRITE, sizeof(int) * summitCount,
                NULL, NULL);
     
            if (!this->distanceVectorMem)
            {
                cout << "Error: Failed to create distanceVectorMem" << endl;
     
                exit(EXIT_FAILURE);
            }
     
            this->usedSummitVectorMem = clCreateBuffer(context,
                CL_MEM_READ_WRITE, sizeof(int) * summitCount,
                NULL, NULL);
     
            if (!this->usedSummitVectorMem)
            {
                cout << "Error: Failed to create usedSummitVectorMem" << endl;
     
                exit(EXIT_FAILURE);
            }

    Now I run it this way:
    Code :
    // Set the arguments to our compute kernel
            err = 0;
            err = clSetKernelArg(this->initKernel, 0, sizeof(int), &summitIndex);
            err |= clSetKernelArg(this->initKernel, 1, sizeof(cl_mem),
                                  &this->distanceVectorMem);
            err |= clSetKernelArg(this->initKernel, 2, sizeof(cl_mem),
                                  &this->usedSummitVectorMem);
            err |= clSetKernelArg(this->initKernel, 3, sizeof(int),
                                  &size);
            err |= clSetKernelArg(this->initKernel, 4, sizeof(int),
                                  &startIndex);
            err |= clSetKernelArg(this->initKernel, 5, sizeof(int),
                                  &count);
            err |= clSetKernelArg(this->initKernel, 6, sizeof(cl_mem),
                                  &this->adjacencyListMem);
            err |= clSetKernelArg(this->initKernel, 7, sizeof(cl_mem),
                                  &this->adjacencyListDistancesMem);
     
            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to set kernel arguments\n");
     
                exit(EXIT_FAILURE);
            }
     
            size_t global = size;
            if (size % 2 != 0) global++;
     
            // Execute the kernel over the entire range of the data set
            err = clEnqueueNDRangeKernel(this->queue, this->initKernel, 1, NULL,
                                         &global, NULL, 0, NULL, NULL);
     
            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to execute the kernel\n");
     
                exit(EXIT_FAILURE);
            }

    With this kernel code:
    Code :
    __kernel void init_int(const int summitIndex, __global int* distanceVector,
        __global int* usedSummitVector,
        const int size, const int startIndex, const int count,
        __global const int* adjacencyList,
        __global const int* adjacencyListDistances)
    {
        int i = get_global_id(0);
     
        printf("entering in init (%d) \n", i);
     
        if (i < size)
        {
            distanceVector[i] = -1;
            usedSummitVector[i] = 0;
        }
     
        distanceVector[summitIndex] = 0;
        usedSummitVector[summitIndex] = 1;
     
        if (i == 0)
        {
            for (int k = startIndex + 1; k < startIndex + 1 + count; k++)
            {
                distanceVector[adjacencyList[k] - 1] = adjacencyListDistances[k];
                printf("distanceVector[%d] = %d\n", adjacencyList[k] - 1, distanceVector[adjacencyList[k] - 1]);
            }
        }
     
        printf("exiting in init (%d) \n", i);
    }

    The trace of printf result is actually good something like that :
    entering in init (0)
    distanceVector[1] = 10
    distanceVector[3] = 4
    distanceVector[5] = 10
    exiting in init (0)

    After that I directly use distanceVector and usedSummitVector in my find min function:
    Code :
    err = 0;
                err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                     &this->distanceVectorMem);
                err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                      &this->usedSummitVectorMem);
                err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                      &size);
                err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                      &this->pFindMinMem);
                err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                      &this->findMinMem);
                err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                      &firstPass);
                err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                      &lmax);
     
                if (err != CL_SUCCESS)
                {
                    printf("Error: Failed to set kernel arguments\n");
     
                    exit(EXIT_FAILURE);
                }
     
                // Execute the kernel over the entire range of the data set
                err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                             NULL,
                                             &global, NULL, 0, NULL, NULL);
     
                if (err != CL_SUCCESS)
                {
                    printf("Error: Failed to execute the kernel\n");
     
                    exit(EXIT_FAILURE);
                }
    without doing anything between the kernels executions. Here is the find min kernel:
    Code :
    __kernel void find_min(__global int* distanceVector,
        __global int* usedSummitVector, const int size,
        __global int* oldMin, __global int* lMin, const int firstPass,
        const int lmax)
    {
        int k = get_global_id(0);
     
        printf("entering in find min (%d) \n", k);
     
        int i = k * 2;
     
        if (firstPass == 1) {
            int index = i;
     
            printf("select %d as index in %d with size %d\n", i, k, size);
     
            if (index + 1 < size)
            {
                printf("select %d as index + 1 in %d with size %d\n", i + 1, k, size);
     
                printf("distanceVector[%d] = %d\n", index, distanceVector[index]);
                printf("usedSummitVector[%d] = %d\n", index + 1, usedSummitVector[index + 1]);
                printf("usedSummitVector[%d] = %d\n", index, usedSummitVector[index]);
     
                if (distanceVector[index] <= 0 && usedSummitVector[index + 1] == 0
                    || usedSummitVector[index] != 0)
                {
                    index = i + 1;
                }
                else
                {
                    printf("distanceVector[%d] = %d\n", index + 1, distanceVector[index + 1]);
                    printf("distanceVector[%d] = %d\n", index, distanceVector[index]);
                    printf("usedSummitVector[%d] = %d\n", index + 1, usedSummitVector[index + 1]);
     
                    if (distanceVector[index + 1] > 0
                        && distanceVector[index + 1] < distanceVector[index]
                        && usedSummitVector[index + 1] == 0)
                    {
                        index = i + 1;
                    }
                }
            }
     
            lMin[k] = index;
        }
        else
        {
     
            int index = oldMin[i];
     
            printf("distanceVector[%d] = %d\n", index, distanceVector[index]);
            printf("usedSummitVector[%d] = %d\n", index, usedSummitVector[index]);
     
            if (i + 1 < lmax)
            {
                int index2 = oldMin[i + 1];
     
                printf("distanceVector[%d] = %d\n", index2, distanceVector[index2]);
                printf("usedSummitVector[%d] = %d\n", index2, usedSummitVector[index2]);
     
                if (distanceVector[index] <= 0
                    && usedSummitVector[index2] == 0
                    || usedSummitVector[index] != 0)
                    index = index2;
                else
                {
                    if (distanceVector[index2] > 0
                        && distanceVector[index2] < distanceVector[index]
                        && usedSummitVector[index2] == 0)
                    {
                        index = index2;
                    }
                }
            }
     
            lMin[k] = index;
        }
     
        printf("found %d as min in %d first pass ? %d\n", lMin[k], k, firstPass);
     
        printf("exiting find min (%d) \n", k);
    }
    And the printf trace tells me here:
    distanceVector[1] = -1

    Why on the earth the memory isn't committed between the two kernels execution ? What am I missing to do or what am I doing wrong here ?

    Thank you so much in advance for the answer,
    Benjamin.

Page 1 of 2 12 LastLast

Similar Threads

  1. Memory access pattern
    By nachovall in forum OpenCL
    Replies: 2
    Last Post: 10-13-2011, 04:20 PM
  2. Memory Access
    By lcnepo in forum OpenCL
    Replies: 1
    Last Post: 12-09-2010, 04:22 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •