BCM2835 "GPU_FFT" release 2.0 BETA by Andrew Holme, 2014.

GPU_FFT is an FFT library for the Raspberry Pi which exploits the BCM2835 SoC
3D hardware to deliver ten times more data throughput than is possible on the
700 MHz ARM.  Kernels are provided for all power-of-2 FFT lengths between 256
and 1,048,576 points inclusive.  A transpose function, which also uses the 3D
hardware, is provided to support 2-dimensional transforms.


*** Accuracy ***

GPU_FFT uses single-precision floats for data and twiddle factors.  The output
is not scaled.  The relative root-mean-square (rms) error in parts-per-million
(ppm) for different transform lengths (N) is typically:

log2(N) |  8    | 9    | 10   |  11   |  12  |  13  |  14  |  15  |  16 |  17
ppm rms |  0.27 | 0.42 | 0.50 |  0.70 |  2.3 |  4.4 |  7.6 |  9.2 |  18 |  70

log2(N) |  18 |  19 |  20 |                 8...17 batch of 10
ppm rms | 100 | 180 | 360 |                18...20 batch of  1


*** Throughput ***

GPU_FFT 1.0 had to be invoked through a "mailbox" which added a 100us overhead
on every call.  To mitigate this, batches of transforms could be submitted via
a single call.  GPU_FFT 2.0 avoids this 100us overhead by poking GPU registers
directly from the ARM if total batch runtime will be short; but still uses the
mailbox for longer jobs to avoid busy waiting at 100% CPU for too long.

Typical per-transform runtimes for batch sizes of 1 and 10; and comparative
figures for FFTW (FFTW_MEASURE mode) are:

log2(N) |   8   |   9   |  10   |  11   |  12  |  13  |  14  |  15  |
      1 | 0.036 | 0.051 | 0.070 | 0.11  | 0.24 | 0.58 |  1.2 |  3.3 |
     10 | 0.016 | 0.027 | 0.045 | 0.095 | 0.25 | 0.61 |  1.2 |  3.2 |
   FFTW | 0.092 | 0.22  | 0.48  | 0.95  | 3.0  | 5.1  | 12   | 31   |

log2(N) |  16  |  17 |  18 |  19 |   20 |           All times in
      1 |  6.8 |  16 |  42 |  95 |  190 |           milliseconds
   FFTW | 83   | 180 | 560 | 670 | 1600 |           2 sig. figs.


*** API functions ***

    gpu_fft_prepare()       Call once to allocate memory and initialise data
                            structures.  Returns 0 for success.

    gpu_fft_execute()       Call one or more times to execute a previously
                            prepared FFT batch.  Returns 0 for success.

    gpu_fft_release()       Call once to release resources after use.
                            GPU memory is permanently lost if not freed.


*** Parameters ***

    int mb          Mailbox file descriptor obtained by calling mbox_open()

    int log2_N      log2(FFT length) = 8 to 20

    int direction   FFT direction:  GPU_FFT_FWD for forward FFT
                                    GPU_FFT_REV for inverse FFT

    int jobs        Number of transforms in batch = 1 or more

    GPU_FFT **      Output parameter from prepare: control structure.
    GPU_FFT *       Input parameter to execute and release


*** Data format ***

Complex data arrays are stored as alternate real and imaginary parts:

    struct GPU_FFT_COMPLEX {
        float re, im;
    };

The GPU_FFT struct created by gpu_fft_prepare() contains pointers to the input
and output arrays:

    struct GPU_FFT {
       struct GPU_FFT_COMPLEX *in, *out;

When executing a batch of transforms, buffer pointers are obtained as follows:

    struct GPU_FFT *fft = gpu_fft_prepare( ... , jobs);
    for (int j=0; j<jobs; j++) {
       struct GPU_FFT_COMPLEX *in  = fft->in  + j*fft->step;
       struct GPU_FFT_COMPLEX *out = fft->out + j*fft->step;

GPU_FFT.step is greater than FFT length because a guard space is left between
buffers for caching and alignment reasons.

GPU_FFT performs multiple passes between ping-pong buffers.  The final output
lands in the same buffer as input after an even number of passes.  Transforms
where log2_N=12...16 use an odd number of passes and the final result is left
out-of-place.  The input data is never preserved.


*** Example program ***

The code that produced the above accuracy and performance figures is included
as a demo with the latest Raspbian distro.  Build and run it as follows:

cd /opt/vc/src/hello_pi/hello_fft
make
sudo mknod char_dev c 100 0
sudo ./hello_fft.bin 12

It accepts three optional command-line arguments: <log2_N> <batch> <loops>

The special character device is required for the ioctl mailbox through which
the ARM communicates with the Videocore GPU.


*** With Open GL ***

GPU_FFT and Open GL will run concurrently if the GPU_FFT_MEM_* defines in
file gpu_fft.c are changed as follows:

#define GPU_FFT_MEM_FLG 0x4        // cached=0xC; direct=0x4
#define GPU_FFT_MEM_MAP 0x20000000 // cached=0x0; direct=0x20000000

Overall performance will probably be higher if GPU_FFT and Open GL take turns
at using the 3D hardware.  Since eglSwapBuffers() returns immediately without
waiting for rendering, call glFlush() and glFinish() afterwards as follows:

    for (;;) {
        ....
        eglSwapBuffers(....); // non-blocking call returns immediately
        glFlush();
        glFinish(); // wait until V3D hardware is idle
        ....
        gpu_fft_execute(....); // blocking call
        ....
    }


*** 2-dimensional FFT ***

Please study the hello_fft_2d demo source, which is built and executed thus:

make hello_fft_2d.bin
sudo ./hello_fft_2d.bin

This generates a Windows BMP file: "hello_fft_2d.bmp"

The demo uses a square 512x512 array; however, rectangular arrays are allowed.
The following lines in gpu_fft_trans.c will do what is safe:

    ptr.arm.uptr[6] = src->x < dst->y? src->x : dst->y;
    ptr.arm.uptr[7] = src->y < dst->x? src->y : dst->x;

One may transpose the output from the second FFT pass back into the first pass
input buffer, by preparing and executing a second transposition; however, this
is probably unnecessary.  It depends on how the final output will be accessed.
