Results 1 to 3 of 3

Thread: Enqueue/Finish Scheme for FFT

  1. #1
    Junior Member
    Join Date
    Feb 2013
    Posts
    10

    Enqueue/Finish Scheme for FFT

    This is likely the first part of 2 posts related to some trouble I have involving an FFT signal cross correlation module I'm creating, (makes use of circular convolution theorem, etc etc). I'd like to just confirm that the following scheme will ensure that particular recursion levels of FFT butterfly computation are finished before the next level starts and that the buffers containing the data are fully written to/done with. So the circular correlation/convolution involves an FFT, a vector-wise inner product and then an IFFT.

    Because of this scheme, I have no kernels which order the data in bit-reversed index order. The forward FFT kernel produces a bit-reversed-order FFT and, after the inner products, the IFFT just uses this result to compute a natural order solution.

    I should mention I have multiple GPU's.

    Anyways, here is a pseudo-code representation of what is going on for every FFT/IFFT, (access/operation algorithms are equivalent, other than conjugated twiddle factors, the normalization kernel comes later:

    Code :
     
    for numDevices:
        data -> buffers
        buffers -> kernel arguments
     
        for fftRecursionLevels:
            for numDevices:
                recursionLevel -> kernel arguments
                deviceCommandQueue -> enqueueNDRangeKernel
                deviceCommandQueue ->  flush()
     
            for numDevices:
                 deviceCommandQueue -> finish()

    Can I get away with that? As far as I understand finish() is a blocking function and that very last for loop would not complete until every kernel is finished computing across its global range, (here fftSize / 2, see any literature on Radix-2 butterfly operations), and, for bonus points, some of the kernels are executing already due to flush() while I'm enqueueing the remaining kernels.

    Overall, I'm geting some funky/garbage results using openCL/c++ for this particular software. I have implemented the full data pipeline in python, (the algorithm is, "topologically equivalent" if you will, obviously there is no host<-->device buffer/instructions w/ the python), and simulated how the kernel should run and it produces IDENTICAL results to when I use scipy.fftpack modules and just operate on the signal data vectors.

    Thanks in advance.

  2. #2
    Junior Member
    Join Date
    Feb 2013
    Posts
    10
    As a follow up, I'm also wondering if maybe the problem is the number of intra-kernel operations I've got going on is too many for just using global memory and registers, (is that even a thing? a variable register limit? ). As far as I can tell, I do not make any use of local memory, (no __local address space usage).

    Sorry if this sounds vague, I'm loath to post code at the moment and really get into it, I want to see if there is any obvious stuff I'm missing first. Please let me know if you'd like me to post something specific.

  3. #3
    Junior Member
    Join Date
    Feb 2013
    Posts
    10
    I guess some pictures would help. Here is exactly what is happening in both programs.

    1) generate gaussian vector
    2) zero pad gaussian vector to next next highest power of 2 length
    3) forward FFT, resulting in a natural order (in w ) result
    4) plot

    Here is the python simulation of my kernel, compared to just using scipy.fftpack.fft( vector ) :

    http://i.imgur.com/pGcYTrL.png

    They are the same. Now compare that to either of:

    http://i.imgur.com/pbiYGpR.png


    They are all the same starting data. And they should all look like the green/blue line in image one, but they do not. My eyes have glazed over from how long I've been staring at the host/device code for that second program and I do not see any typos or incorrect algorithms. I highly suspect that something is happening device side that I am not aware of, hence my posting here. Clearly the algorithm LOOKS like its operating correctly, (the general shape of the red/red is approximately that of blue/green no matter the starting data. I've ran the algorithm on different starting sets and it consistently looks like blue/green but with that nonsensical noise/error), but something is amiss.

    I greatly appreciate/welcome any assistance.
    Last edited by steveStevens; 06-16-2013 at 01:27 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •