Results 1 to 6 of 6

Thread: Affine transform, floatnxm

  1. #1
    Junior Member
    Join Date
    Jun 2010
    Posts
    6

    Affine transform, floatnxm

    Hi,
    I want to apply a transform to a bunch of points. I see OpenCL has floatnxm but I can't find a mention of any function that takes this data type as argument. Furthermore I'm using ATI Stream SDK and declaring data type float4x4 myMatrix; gives an error "identifier undefined". I don't know if I'm using it wrong or if ATI doesn't support this - even though I don't see this type defined as optional.

    So are there any built in ways to do affine transform? If I have to write my own, what's a good way to load this matrix into local memory for all threads? i.e. maybe there's a way to load the matrix for the work group, rather than each thread having to parse the float* argument into a data structure before doing the transform.

    Thanks.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Affine transform, floatnxm

    I see OpenCL has floatnxm
    No, it doesn't. These are reserved keywords which are otherwise unused yet.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Jun 2010
    Posts
    6

    Re: Affine transform, floatnxm

    I see, and my other question? If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory. There will also be memory bank contention.

    So is this a wrong way to partition the problem, or the wrong way to pass the argument? What's a good way?

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Affine transform, floatnxm

    If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory.
    I don't understand why would you do that. Instead, you should write your functions in such a way that they accept a "__global float* matrix" argument instead of requiring the data to be packed into a struct. The only difference [1] between a "float* m" and a "float m[4][4]" is that the latter has a nice syntax to access the matrix elements. If all you have is a "float *m" you will have to access elements by hand. I.e. you will need to do "m[4*row + column]" instead of "m[row][column]". You could even create a macro to make the code more readable if you want.

    Code :
    #define IDX(row,column) (4*(row) + (column))

    You may also want to take advantage of the lower latency of __constant kernel arguments.


    [1] Again, don't shoot me
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Jun 2010
    Posts
    6

    Re: Affine transform, floatnxm

    Well what I mean is my multiplication function is something like this
    Code :
    float4 multiply(float4 point, __global float * matrix)
    {
        float4 result;
        result.x = matrix[0] * point.x + matrix[1] * point.y + matrix[2] * point.z + matrix[3];
        result.y = matrix[4] * point.x + matrix[5] * point.y + matrix[6] * point.z + matrix[7]
     
        ...
        return result;
    }

    From what I understand, if I have 32 threads calling multiply(), that's 32 threads reading the same 16 values from global memory. Maybe the reads aren't even cached to local memory. I was just thinking if there was a way not to do that many reads at all.

    NVidia tutorial on matrix multiplication uses local memory to reduce number of reads to __global memory - they have each thread working on a portion of the matrix, copying from global to local. However they partition their problem so that the number of threads that are run = num of blocks required. In my case if I did a copy from global to local I don't know if it wouldn't help because I still have 32 threads copying the same 16 values to local address space.

    Anyway I thought there was a magic bullet. Just some way to specify that for this group of work items I'm creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group. I will use __constant to get faster reads like you suggested.

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Affine transform, floatnxm

    Ah, I see. I think you are doing the right thing. You can assume that all devices will have some sort of cache for global memory, and even if it's very small, sixteen floats will not be a problem.

    Maybe the reads aren't even cached to local memory
    Correct. Unless you explicitly copy the data to local memory I don't think it's reasonable to expect the CL to do it "automagically".

    Just some way to specify that for this group of work items I'm creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group.
    You could use the work-group to cooperatively move the matrix into a __local variable, but declaring the matrix as __constant is not just going to be faster, it's easier to code and to understand.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. Replies: 6
    Last Post: 02-01-2011, 11:09 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •