This paper shows that staging and abstract interpretation can in fact be used to produce circuits with the same number of floating-point operations as each of split-radix and FFTW. In addition, choosing between two standard implementations of complex multiplication produces results that match each of the two algorithms. Thus, we provide a constructive method for deriving the two distinct algorithms.
Generating radix-2 FFT kernels with the operation count
exactly matching that of FFTW is much harder than one may think.
It is emphatically not enough to specialize the textbook FFT code
to a particular sample size N, and apply a partial evaluator
or a smart compiler to eliminate trivial multiplications (by one and zero)
and trivial additions (to zero), and to lift common factors.
First of all, the specialized FFT code will contain factors
like -2.44921270764475452e-16
(which is the computed
IEEE double value of sin((pi/. 2.0) *. 4.0)
). That factor
is not zero and the corresponding multiplication is not trivial,
although it should be.
A frequently occurring expression (sin(pi/4)*x + cos(pi/4)*y)
presents a subtler problem: although the ideal sin(pi/4)
and
cos(pi/4)
are equal, their computed values are not. Therefore, a
partial evaluator or a compiler, which see
only the numeric values of the FFT factors, are not able to
convert the expression to sin(pi/4)*(x + y)
, saving one
multiplication. Thus the necessary condition for generating the
optimal FFT kernels is the use of the exact trigonometry. The roots
of unity exp(-I j*2pi/N)
must be manipulated symbolically rather
than numerically. This condition is not sufficient: an additional
insight -- the use of symmetry -- was required to match the operation
count of FFTW (see the function add_sub_aa
in the generator code
below).
fft-neo.ml [58K]
The self-contained commented MetaOCaml code
The code derives radix-2 FFT kernels with the number of
additions and multiplications exactly matching those in
FFTW kernels. A slight change in the generation algorithm
produces split-radix FFT kernels.
<http://www.fftw.org>
FFTW: Fastest Fourier Transform in the West
Generating optimal code with confidence
The explanation of the general framework of staging and
abstract interpretation
An example of speed being far more important than precision is Software Defined Radio (SDR), specifically, FM Radio reception. Doing almost all the work of an FM Radio in software, on an ordinary laptop is quite a challenge: the HackRF One board often used to acquire and digitize radio signals sends data at a rate of 2 to 20 million samples per second. Each sample is a pair of two 32-bit FP numbers. A typical 3GHz CPU may then spend no more than 1500 CPU cycles per sample. For comparison, on a high-class Sandy Bridge class E Intel CPU, L3 cache access may take up to 38 cycles; main memory access more than 100 cycles. The cycle budget is tight.
One of the most demanding operations in FM Radio reception is
demodulation: specifically, computing arc tangent, or so-called
atan2
. Although the x86 instruction set has a special instruction,
FPATAN
, for that, using it is not a good idea. It is slow
(taking up to 160 cycles) and is not vectorizable. Calling
libc
's atan2f
is also not a good idea, because of the overhead
of the function call and saving/restoring of registers. It also obstructs
auto-vectorization. Both FPATAN
and
atan2f
compute arc tangent to the precision required by the IEEE FP
arithmetic. However, the received radio signal is inherently
noisy: high precision is not necessary. Therefore, we can get
by with a slightly less-precise atan calculation and aim for speed.
(FM Radio code can also ensure the input samples are
normal FP numbers -- rather than NaN, infinities, -0.0
-- and take
advantage of that fact in the atan calculation.)
For all these reasons GNU Radio -- the state of the art in SDR -- uses
so-called fast_atan2f.cc
. Its key idea is replacing the atan calculation with
a look-up in a table of pre-computed values, with linear
interpolation. The idea is excellent, but its implementation
is not on par. The code uses too many comparisons,
resulting in too many branches.
This article presents the re-written, pure C, fast atan code, which is
tighter, smaller, and with fewer comparisons. We confirmed that
computations with our atan2 are auto-vectorizable (in vector %ymm
registers). The precision hardly suffers: compared to libc
's
atan2f
, the largest absolute error is 1.4e-06
, the largest
relative error 4e-06
and the average error 5.3e-07
.
The interface matches atan2f
of libc
:
float fast_atan2f(float y, float x);Just like
libc
's atan2f
, it calculates the principal value of the
arc tangent of y/x
, using the signs of the two arguments to
determine the quadrant of the result. The return value is in radians
and within [-pi,pi]
. The code assumes the arguments are
normal floats: no NaNs, no infinites, and no -0.0
.
Like GNU Radio's fast_atan2f
, it relies on table lookup and linear
interpolation.
We check the performance on the benchmark representative of FM Radio processing: take complex numbers from a buffer, compute their phase and store in another buffer:
void bench(float complex const x [], float out [], int const n) { for(int i=0; i<n; i++) out[i] = fast_atan2f(crealf(x[i]), cimagf(x[i])); }
Here is the running time, for the arrays of 200,000,000 elements
on Intel Core i5-1135G7 SkyLakeX (turbo-boosted to
4186 MHz). The code is compiled by clang 13.0.0 with flags -Ofast -march=native -ffp-contract=on
.
fast_atan2f
fast_atan2f
We can see the factor of 15 improvement over libc, and a notable improvement over the original GNU Radio code.
To summarize, we presented an improved fast_atan2f
code that can be
used in GNU Radio and other SDR software. It is just as precise and
portable as GNU Radio's, but notably faster. The performance is also
portable: our code performs well on x86-64 (with SSE and AVX2) and
also on AArch32 (Raspberry Pi Zero).
.h
file so that it can be inlined.
Makefile [2K]
The benchmark and test code is in the same directory as the Makefile.
2021 Francesco Mazzoli: Speeding up atan2f by 50x (2021-08-16)
A different way to optimize atan2f, using polynomial approximation
and manual vectorization. It gives an additional factor of 3
speed improvement, at the cost of portability.
The precision is also about 3 times worse.
Matrix
, Vector
,
sub-matrices and LAStream
s -- and many operations on them, from
basic to SVD. Distinctive features are efficient matrix views
(row, column, diagonal), matrix streams, and
LazyMatrices
. Lazy construction lets us write matrix
expressions in a natural way without introducing hidden temporaries,
deep copying, or any reference counting.
LinAlg stresses Matrix streams, which provide a sequential access to a
matrix or its parts. LABlockStream
s may span over an arbitrary
rectangular block, from a single element to the whole matrix. Matrix streams
are seek
-able and subrange-able. A stream or a substream are always
created in-line. They are safe and allocate no heap storage.
LinAlg demonstrates that many of Linear Algebra algorithms indeed require only a sequential access to a matrix or its sub-blocks. The sequential access is simpler and faster than a random one. Notable examples of such ``serial algorithms'' are matrix inverse and Householder bidiagonalizations. Every feature of the package is extensively tested by the included validation code.
Other features:
Ax=b
, where the matrix
A
does not have to be square, and b
does not
have to be a vector.
Matrix A = inverse(D); C = A*B;there are what looks like matrix assignments, e.g., of the result of
A*B
to C
. However, in reality no
deep copying is done:
the product of A
and B
is computed right in the place of C
.
assert(of_every(ConstMatrixRow(m,i)).max_abs(of_every(vect)) == 2.0 );
All the code is in Public Domain.
LinAlg.tar.gz [115K]
Complete source code, validation code, sample code, and test
run outputs
LinAlg.h [58K]
LinAlg interface definition
LAStreams.h [28K]
LAStreams interface definition
Functional Style in C++: Closures, Late Binding, and
Lambda Abstractions
uses many code snippets from LinAlg
more_promises.txt [9K]
(lazy) SVD promising a matrix in extended LinAlg.shar
An old but still relevant article posted on
comp.c++.moderated, comp.sys.mac.programmer.tools,
comp.sys.mac.scitech, sci.math.num-analysis newsgroups on Tue Feb 27
11:31:08 CST 1996
LinAlg was also submitted to Netlib
sin
, cos
, and complex Fourier
integrals of an evenly tabulated function. The input can be either
real
or complex
with/without zero padding; you can request either
the full complex transform, or only real
, im
, or abs
part of it.
N
may be as big as 2000
.
The code is the intended solution to the problem presented at the 1995 Programming Contest organized by University of North Texas' Chapter of ACM. The full text of the problem is given in the title comments to the code.
What counts is the overall speed, of computing the result
and converting it to ASCII. And fast the code is: on HP
9000/770/J210, it completes 32000 under 0.09 seconds,
whereas bc
takes 0.3 seconds of user time. The present
code uses no multiplications or divisions to compute and print
3N.