SkylondaWorks FFT Scaling

A Scrapbook of Digital Signal Processing

FFT Scaling

FFT scaling covered here falls into two categories:

Assignment of magnitudes in the frequency domain.
Scaling required to implement fixed-point transforms.

ASSIGNMENT OF MAGNITUDE IN THE FREQUENCY DOMAIN

The usual assymetrically-scaled Discrete Fourier Transform scaling for a transform of size N looks as follows:

     Forward Transform (Time to Frequency)            1/N
     Reverse Transform (Frequency to Time)            1

A pure sine wave of maximum 8-bit binary amplitude 01111111/10000000 (+127/-128)) in the time domain will yield an 8-bit binary Fourier coefficient of 00000000 for the real part and 01111111 for the imaginary part. If this Fourier coefficient value is treated as a binary fraction equal to decimal .999.., (not as a decimal integer equal to 127) then the log of this value will always be 0 dB, and all other possible values will be in -dB (less than unity). This will cause the coherent SIGNAL value to remain constant, independent of FFT size, whereas the non-coherent NOISE floor will decrease with increasing FFT size (see Coherent vs. Non-Coherent Integration and Dynamic Range and Real-Time Log Power Computation). This has the effect of making the value of 0 dB a consistent marker for the maximum possible value, independent of integration gain.

SCALING REQUIRED TO IMPLEMENT FIXED-POINT TRANSFORMS

In applications where high-speed floating-point hardware is not available, or where it is desired to maximize dynamic range by devoting all of the available bits to the mantissa (and none to the exponent), it is possible to use scaled fixed-point transforms. This is normally done by placing the data in the uppermost bits of the fixed-point mantissa and then using arithmetic down-shifting (dividing by 2) at the end of each transform step (an alternative, less-used option is to use block scaling, whereby down-shifting is done only when overflow is imminent in the next step). The question to be addressed here is -- if arithmetic down-shifting is used at the end of each step, how many places (guard bits) should be allowed for overflow, that is -- should the scaling in an 8-bit signed binary word look like SSMMMMM or SSSMMMMM (S is sign, M is mantissa)? In the following, it will be seen that a scaling of SSMMMMMM (rather than SSSMMMMM) can, in some cases,result in undesirable arithmetic overflow.

Each stage of a radix-2 Fast Fourier Transform operates on pairs of complex points using an expression of the form:

     (a1 + j b1) + (a3 + j b3) x (cos + j sin) = (a1 + (a3 cos - b3 sin)) + j (b1 + (a3 sin + b3 cos))

To show how overflow can occur, we will assume the use of a 16-bit processor word, and that the binary representation of the A/D samples is placed in the upper 8 bits of the 16-bit word and then sign-extended by one place to look like:

          A = 00111111 10000000 for the positive maximum
          A = 11000000 00000000 for the negative maximum

If we assume only realizable components of a unit rotating vector, overflow should, in principle, not occur, since the a+jb vectors are constrained to contain only the sine and cosine components of the maximum A/D value. However, if either noise is added or if a small amount of non-linearity occurs in either the 'a' or the 'b' components of the a+jb vector, then overflow can occur as illustrated below (each decimal integer shown has been up-shifted 8 binary places in the 16-bit words, but is shown in unshifted decimal form to simplify the arithmetic conversion -- for the same reason, lower-order bits from the multiply operations have been truncated):

      angle  = - pi/4   a1 =  63 (00111111 00000000)   a1     =  63 (00111111 00000000)
  cos(-pi/4) =  .707    a3 =  45 (00101101 00000000)   a3 cos =  32 (00100000 00000000)
  sin(-pi/4) = -.707    b3 =  47 (00101111 00000000)   b3 sin = -33 (11011111 00000000)

then                     a1         00111111 00000000
                     +   a3 cos   + 00100000 00000000 
                     -   b3 sin   + 00100001 00000000
                                    -----------------
                                    10000000 00000000

that is, 63 + 32 + 33 = 128, which, upshifted by 8 places as shown, is equal to the most negative binary number (1000000000000000). Note that both a3 and b3 could be either very slightly non-linear numbers (the square root of the sum of their squares is 65, not 63) or could have had a small amount of noise added to them. Because this situation is quite likely to occur in real life, especially if large amounts of noise are present, a reasonably robust scaling scheme should be used to avoid it. Therefore, to prevent intermediate sum overflow in radix-2 transforms which use a single down-shift at the end of each FFT state, the minimum scaling for the input samples should be:

          A = 000111111 for the positive maximum  (00111111 00000000 for 16-bit mantissas)
          A = 111000000 for the negative maximum  (11100000 00000000 for 16-bit mantissas)

Interestingly, a widely circulated fixed-point FFT algorithm contained this bug for many years, before it was uncovered while doing unrelated algorithmic testing. Before the advent of 32-bit processors and 32-bit floating-point DSP chips, this sort of problem was not atypical of many algorithms which struggled to surmount the dynamic-range problems of 16-bit words, but which not infrequently turned out to be mischievously data-sensitive.

Home Page or Table of Contents