Numerical Math and Scientific Computing

 

 

Generating optimal FFT code and relating FFTW and Split-Radix

Recent work showed that staging and abstract interpretation can be used to derive correct families of combinatorial circuits, and illustrated this technique with an in-depth analysis of the Fast Fourier Transform (FFT) for sizes 2n. While the quality of the generated code was promising, it used more floating-point operations than the well-known FFTW codelets and the split-radix algorithm.

This paper shows that staging and abstract interpretation can in fact be used to produce circuits with the same number of floating-point operations as each of split-radix and FFTW. In addition, choosing between two standard implementations of complex multiplication produces results that match each of the two algorithms. Thus, we provide a constructive method for deriving the two distinct algorithms.

Generating radix-2 FFT kernels with the operation count exactly matching that of FFTW is much harder than one may think. It is emphatically not enough to specialize the textbook FFT code to a particular sample size N, and apply a partial evaluator or a smart compiler to eliminate trivial multiplications (by one and zero) and trivial additions (to zero), and to lift common factors. First of all, the specialized FFT code will contain factors like -2.44921270764475452e-16 (which is the computed IEEE double value of sin((pi/. 2.0) *. 4.0)). That factor is not zero and the corresponding multiplication is not trivial, although it should be.

A frequently occurring expression (sin(pi/4)*x + cos(pi/4)*y) presents a subtler problem: although the ideal sin(pi/4) and cos(pi/4) are equal, their computed values are not. Therefore, a partial evaluator or a compiler, which see only the numeric values of the FFT factors, are not able to convert the expression to sin(pi/4)*(x + y), saving one multiplication. Thus the necessary condition for generating the optimal FFT kernels is the use of the exact trigonometry. The roots of unity exp(-I j*2pi/N) must be manipulated symbolically rather than numerically. This condition is not sufficient: an additional insight -- the use of symmetry -- was required to match the operation count of FFTW (see the function add_sub_aa in the generator code below).

Version
The current version is December 2004
References
fft.pdf [69K]
Relating FFTW and Split-Radix. Joint work with Walid Taha.
Proc. of ICESS'04, the First International Conference on Embedded Software and System, December 9-10 2004, Hangzhou (Zhejiang), China.

fft-neo.ml [58K]
The self-contained commented MetaOCaml code
The code derives radix-2 FFT kernels with the number of additions and multiplications exactly matching those in FFTW kernels. A slight change in the generation algorithm produces split-radix FFT kernels.

<http://www.fftw.org>
FFTW: Fastest Fourier Transform in the West

Generating optimal code with confidence
The explanation of the general framework of staging and abstract interpretation

 

Fast and vectorizable atan

Speed vs. precision trade-off is pervasive in numerical computing. We present one example: computing arc tangent (atan) with the acceptable for the target domain precision but portably fast. Our code performs well on x86-64 architectures (where atan computations are auto-vectorized) and also on a AArch32 architecture (Raspberry Pi Zero).

An example of speed being far more important than precision is Software Defined Radio (SDR), specifically, FM Radio reception. Doing almost all the work of an FM Radio in software, on an ordinary laptop is quite a challenge: the HackRF One board often used to acquire and digitize radio signals sends data at a rate of 2 to 20 million samples per second. Each sample is a pair of two 32-bit FP numbers. A typical 3GHz CPU may then spend no more than 1500 CPU cycles per sample. For comparison, on a high-class Sandy Bridge class E Intel CPU, L3 cache access may take up to 38 cycles; main memory access more than 100 cycles. The cycle budget is tight.

One of the most demanding operations in FM Radio reception is demodulation: specifically, computing arc tangent, or so-called atan2. Although the x86 instruction set has a special instruction, FPATAN, for that, using it is not a good idea. It is slow (taking up to 160 cycles) and is not vectorizable. Calling libc's atan2f is also not a good idea, because of the overhead of the function call and saving/restoring of registers. It also obstructs auto-vectorization. Both FPATAN and atan2f compute arc tangent to the precision required by the IEEE FP arithmetic. However, the received radio signal is inherently noisy: high precision is not necessary. Therefore, we can get by with a slightly less-precise atan calculation and aim for speed. (FM Radio code can also ensure the input samples are normal FP numbers -- rather than NaN, infinities, -0.0 -- and take advantage of that fact in the atan calculation.)

For all these reasons GNU Radio -- the state of the art in SDR -- uses so-called fast_atan2f.cc. Its key idea is replacing the atan calculation with a look-up in a table of pre-computed values, with linear interpolation. The idea is excellent, but its implementation is not on par. The code uses too many comparisons, resulting in too many branches.

This article presents the re-written, pure C, fast atan code, which is tighter, smaller, and with fewer comparisons. We confirmed that computations with our atan2 are auto-vectorizable (in vector %ymm registers). The precision hardly suffers: compared to libc's atan2f, the largest absolute error is 1.4e-06, the largest relative error 4e-06 and the average error 5.3e-07.

The interface matches atan2f of libc:

    float fast_atan2f(float y, float x);
Just like libc's atan2f, it calculates the principal value of the arc tangent of y/x, using the signs of the two arguments to determine the quadrant of the result. The return value is in radians and within [-pi,pi]. The code assumes the arguments are normal floats: no NaNs, no infinites, and no -0.0. Like GNU Radio's fast_atan2f, it relies on table lookup and linear interpolation.

We check the performance on the benchmark representative of FM Radio processing: take complex numbers from a buffer, compute their phase and store in another buffer:

    void bench(float complex const x [], float out [], int const n) {
      for(int i=0; i<n; i++) 
        out[i] = fast_atan2f(crealf(x[i]), cimagf(x[i]));
    }

Here is the running time, for the arrays of 200,000,000 elements on Intel Core i5-1135G7 SkyLakeX (turbo-boosted to 4186 MHz). The code is compiled by clang 13.0.0 with flags -Ofast -march=native -ffp-contract=on.

Baseline: libc
3480 ms
GNU Radio fast_atan2f
323 ms
Our fast_atan2f
217 ms

We can see the factor of 15 improvement over libc, and a notable improvement over the original GNU Radio code.

To summarize, we presented an improved fast_atan2f code that can be used in GNU Radio and other SDR software. It is just as precise and portable as GNU Radio's, but notably faster. The performance is also portable: our code performs well on x86-64 (with SSE and AVX2) and also on AArch32 (Raspberry Pi Zero).

Version
The current version is September 2024
References
fast_atanf.h [7K]
The fast atan code. It is placed into an .h file so that it can be inlined.

Makefile [2K]
The benchmark and test code is in the same directory as the Makefile.

2021 Francesco Mazzoli: Speeding up atan2f by 50x (2021-08-16)
A different way to optimize atan2f, using polynomial approximation and manual vectorization. It gives an additional factor of 3 speed improvement, at the cost of portability. The precision is also about 3 times worse.

 

LinAlg: Basic Linear Algebra and Optimization classlib

This C++ class library implements real-valued Matrix, Vector, sub-matrices and LAStreams -- and many operations on them, from basic to SVD. Distinctive features are efficient matrix views (row, column, diagonal), matrix streams, and LazyMatrices. Lazy construction lets us write matrix expressions in a natural way without introducing hidden temporaries, deep copying, or any reference counting.

LinAlg stresses Matrix streams, which provide a sequential access to a matrix or its parts. LABlockStreams may span over an arbitrary rectangular block, from a single element to the whole matrix. Matrix streams are seek-able and subrange-able. A stream or a substream are always created in-line. They are safe and allocate no heap storage.

LinAlg demonstrates that many of Linear Algebra algorithms indeed require only a sequential access to a matrix or its sub-blocks. The sequential access is simpler and faster than a random one. Notable examples of such ``serial algorithms'' are matrix inverse and Householder bidiagonalizations. Every feature of the package is extensively tested by the included validation code.

Other features:

The README file tells far more about the features and how to use them. The file contains many commented code snippets.

All the code is in Public Domain.

Version
The current version is 4.4, November 29, 2002
References
README.txt [49K]

LinAlg.tar.gz [115K]
Complete source code, validation code, sample code, and test run outputs

LinAlg.h [58K]
LinAlg interface definition

LAStreams.h [28K]
LAStreams interface definition

Functional Style in C++: Closures, Late Binding, and Lambda Abstractions
uses many code snippets from LinAlg

more_promises.txt [9K]
(lazy) SVD promising a matrix in extended LinAlg.shar
An old but still relevant article posted on comp.c++.moderated, comp.sys.mac.programmer.tools, comp.sys.mac.scitech, sci.math.num-analysis newsgroups on Tue Feb 27 11:31:08 CST 1996

LinAlg was also submitted to Netlib

 

Programming Language Support for Emerging Memory Technologies

Optimal placement and movement of data between storage, memory and processing components has been the main driver for higher performance and lower energy consumption of all software. Attaining (or even approaching) optimality requires an accurate model. A flat homogeneous address space abstraction (the von Neumann model), presented by hardware and by the hardware abstraction layers of the OS/runtime, is becoming harder and harder to maintain and is less and less accurate as a model.

What new abstractions should compiler writers and programmers use instead? In the emerging landscape of new memory devices, with radically different processing elements in systems spanning from super-computers to IoT devices, how can we save users from being overwhelmed with the diversity and complexity of memory architectures? Conversely, how can we give programmers and compiler writers some idea of the memory architecture and some degree of control over data placement and movement?

These questions were the topic of the seminar that took place at Shonan Village Center, Kanagawa, Japan on May 7--10, 2024.

References
shonan-memory.pdf [5996K]
The final report, including the abstracts of the talks and the bibliography

<https://shonan.nii.ac.jp/seminars/181/>
The web page of the Seminar

 

Fourier integrals and Discrete/Fast Fourier transform

This C++ class library performs radix-2 Discrete Fourier transform of a real or complex sequence, and sin, cos, and complex Fourier integrals of an evenly tabulated function. The input can be either real or complex with/without zero padding; you can request either the full complex transform, or only real, im, or abs part of it.
Version
The current version is 1.2, December 25, 1998
References
fft12.tar.gz [36K]
Commented C++ source code, verification code and test run outputs. The verification code demonstrates many possible usage scenarios. LinAlg is required as a dependency.

 

A neuron that learns how to play blackjack

A very simple C code that learns how to play blackjack, by playing it with itself. The code implements a bona fide neural network with back-propagation. The network is made of a single neuron, possessing a single byte of intelligence. The neuron has ``weights'' and a ``threshold''; the neuron ``fires'' when the value of the activation function computed over the current input exceeds the current threshold. In game terms, using the current score and the number of aces on hand the neuron decides if to draw another card or to stay. The threshold (the neuron's memory) is adjusted as the game proceeds, using back-propagation -- if you can discern it in such a primitive setting.
Version
The current version is 1.1, December 18, 1997
References
black-jack.c [10K]
ANSI C source code, albeit written in a C++ style. The code and its sample output are well annotated.

 

Powers of three

A simple self-contained C code that computes and prints out 3N very fast. A non-negative integer exponent N may be as big as 2000.

The code is the intended solution to the problem presented at the 1995 Programming Contest organized by University of North Texas' Chapter of ACM. The full text of the problem is given in the title comments to the code.

What counts is the overall speed, of computing the result and converting it to ASCII. And fast the code is: on HP 9000/770/J210, it completes 32000 under 0.09 seconds, whereas bc takes 0.3 seconds of user time. The present code uses no multiplications or divisions to compute and print 3N.

Version
The current version is 1.1, March 1995
References
pow3.c [6K]
Commented C source code