Project Nayuki


Fast Fourier transform in x86 assembly

I have a simple fast Fourier transform implementation on the page Free small FFT in multiple languages, whose asymptotic time complexity is the optimal Θ(n log n). I wanted to know how much additional speed-up could be obtained by reorganizing memory access patterns and using SIMD vector arithmetic instructions. The best assembly language implementation here has under 150 lines of clear code and runs 2× to 4× faster than the simple C++ implementation.

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the other page, only the forward radix-2 FFT is provided, and the FFT is unscaled.

Source code

All the C++ source files are fully portable and do not require x86-64 or vector instructions. All the .s files (assembly language) are only for Unix on x86-64.

  • FftTest.cpp: Runnable main test program
  • Fft.cpp: FFT library implementation, excluding the transform core
  • Fft.hpp: FFT library header
  • Fft-transformImpl-simple.cpp: FFT core which computes each level fully and reads the exponential table with varying strides, simply adapted from FftComplex.cpp
  • Fft-transformImpl-linearVector.cpp: FFT core which computes each level fully and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
  • Fft-transformImpl-linearVector.s: FFT core which computes each level fully and reads the exponential table with unit stride, only for Unix on x86-64
  • Fft-transformImpl-recurseVector.cpp: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
  • Fft-transformImpl-recurseVector.s: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, only for Unix on x86-64

License: MIT (open source)

To compile, use a command like: Compiler -std=c++17 Optimizations -o FftTest FftTest.cpp Fft.cpp TransformImpl.cpp/s. For example:

  • c++ -std=c++17 -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-simple.cpp
  • g++ -std=c++17 -O -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.s
  • g++ -std=c++17 -Os -ftree-vectorize -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.cpp
  • g++ -std=c++17 -O2 -march=native -mtune=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.s
  • clang++ -std=c++17 -O3 -march=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.cpp

Source code notes:

  • While there are 5 implementations of the FFT core, the precomputation of tables has only one implementation in C++ because it’s not as time-critical.

  • Each of the two assembly implementations has a pure-C++ version which use only scalar operations to almost perfectly mimic the vector operations and memory access patterns. This allows the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual. Also in theory, if the C++ compiler is good at vectorization, then it should generate machine code that is nearly identical to the hand-written vector assembly code.

  • All the C++ code files are portable and can work on any system with any definition of size_t. But when linking Fft.cpp to Fft-transformImpl-*.s (x86-64 assembly code), it is required that size_t = uint64_t (in other words, std::numeric_limits<size_t>::digits == 64), otherwise the assembly function implementation will corrupt memory and crash. This constraint is not checked at compile time because it would require extra C++ source files.

  • A previous version of this page featured code that: was implemented in C, arranged the real and imaginary parts of the complex vector in separate arrays instead of interleaving them, and didn’t have a recursive transform. The old codebase is slower and less readable, and is no longer supported.

Memory access patterns

This animation visualizes how the various FFT core implementations read/write different elements of the two arrays over time.

FFT size:
Speed:
Implementation:


Animation:

Input/output vector:

Exponential table:

Note: Each cell represents an std::complex<double>, whose size is 16 bytes.

Benchmark

The main numbers in the tables are the number of nanoseconds per FFT computation for a given FFT size and transform core implementation.

Compiler: GCC 10.2.0

Transform implementationsimple.cpplinearVector.cpplinearVector.srecurseVector.cpprecurseVector.s
Compile options-O2 -march=native -mtune=native -ftree-vectorize-O1 -march=native -mtune=native -ftree-vectorize-O1 -march=native -mtune=native-O1 -march=native -mtune=native -ftree-vectorize-Os -march=native -mtune=native
FFT size
87644414085
16127615962112
32238107101112166
64488218212234292
1281 016471443500563
2562 2321 0501 0111 1271 181
5124 8472 3712 3162 5362 561
1 02410 6345 4315 2035 7015 668
2 04823 59512 37412 07912 93112 534
4 09660 42134 20034 50532 28832 007
8 192145 18282 86582 73479 07177 133
16 384328 538202 016196 209179 904178 106
32 768794 685472 896473 776399 099395 295
65 5361 848 5361 011 1561 017 823912 735896 032
131 0724 048 9782 229 4072 271 8991 919 8361 856 634
262 1449 568 1065 377 7145 509 5654 463 8684 555 393
524 28829 331 85618 053 21118 372 96311 717 06611 640 852
1 048 57684 874 36144 012 73944 233 22627 502 04627 383 013
2 097 152197 736 25194 209 74295 788 13958 103 29361 914 542
4 194 304477 559 022200 105 030197 380 868125 288 316126 381 213
8 388 6081 150 845 702442 110 966443 317 765297 457 984284 732 072
16 777 2162 701 063 686985 539 5831 015 481 605702 461 388694 133 979
33 554 4326 634 030 3432 170 786 6912 219 600 0861 593 194 7941 542 642 833
67 108 86414 389 143 9944 795 731 5894 782 791 1734 319 947 6073 634 788 206
134 217 72834 308 652 91110 767 479 09610 808 237 98110 046 959 83110 735 480 888
268 435 45678 795 861 50423 323 416 51423 280 238 63318 613 868 12819 604 131 740
536 870 912191 424 874 23755 481 062 94858 530 032 21248 347 285 48835 887 147 074

Compiler: Clang 11.0.0

Transform implementationsimple.cpplinearVector.cpplinearVector.srecurseVector.cpprecurseVector.s
Compile options-O3 -march=x86-64-O3 -march=native-O2 -march=native-O2 -march=native-O2 -march=x86-64
FFT size
86940413785
16130555959111
3227299100108165
64581203206227290
1281 269443445490561
2562 8721 0271 0101 1201 180
5126 3512 3582 3032 5522 576
1 02413 9395 4285 2645 8525 886
2 04830 28212 80012 10413 46812 677
4 09671 96834 99133 75334 82632 480
8 192170 65785 14881 70784 98576 661
16 384378 781210 178210 709195 597176 178
32 768922 679490 980473 141424 887394 891
65 5362 125 9451 048 5641 066 460961 162854 839
131 0724 510 8662 218 8432 177 3401 969 0011 856 780
262 14410 488 8995 596 0535 814 9414 742 6764 477 989
524 28831 566 16817 693 10317 897 82911 727 88911 621 123
1 048 57684 814 58545 435 87743 211 77326 532 91226 297 249
2 097 152217 636 77497 135 54994 059 05360 634 89558 207 924
4 194 304510 150 349205 862 060196 633 667132 114 212121 363 532
8 388 6081 264 917 458437 783 298440 487 726314 048 050304 741 011
16 777 2162 969 336 243984 612 216988 402 710717 541 640697 949 297
33 554 4326 552 070 0642 187 774 5182 145 750 2961 650 733 0061 539 819 002
67 108 86414 178 321 1215 321 209 5674 661 419 5794 222 166 2713 962 320 705
134 217 72832 760 302 81111 068 710 48811 110 860 5908 870 709 5438 686 626 184
268 435 45671 142 211 05421 113 781 61124 373 299 95518 847 519 48419 521 796 294
536 870 912186 516 806 72855 904 453 62257 198 381 57748 821 368 14246 479 427 904

Methodology and observations:

  • All timing numbers were obtained on an Intel Core i5-4690 (Haswell) 3.50 GHz CPU (Turbo Boost disabled), running Ubuntu 20.10.

  • For the GCC compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -march=x86-64/native -mtune=generic/native -ftree-vectorize

  • For the Clang compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -Oz -march=x86-64/native

  • Although I tested many sets of compiler flags, for each transform core, I selected the set of flags that produced the fastest run times overall and published just those flags and numbers.

  • For the simple transform core, the best code produced by Clang is about 10% slower than GCC. For the other 4 cores, there is no clear winner between GCC and Clang; hence I consider both compilers to be competitive in the quality of their generated code.

  • The primary difference between the simple core and the vector cores is that the simple algorithm reads the complex exponential (trigonometric) table with varying strides, whereas the other cores read the table contiguously without skipping any elements. The result of this change is that all the other cores are at least 2× and up to 4× faster than the simple core.

  • For the linear vector transform core algorithm, the C++ implementation is about the same speed as the hand-written assembly. The compilers do a good job at vectorizing the code.

  • For the recursive vector transform core algorithm, the C++ implementation is a few percent slower than the hand-written assembly. The compilers still do a decent job at vectorization.

  • The linear vector core is faster than the recursive vector core up to around FFT size 1024. The recursive vector core is faster than the linear vector core around FFT size 4096 and up. Hence, the linear core is better at smaller sizes and the recursive core is better at larger sizes, which is what one would expect from the memory hierarchy and cache-oblivious algorithms.

  • The largest FFT size of 229 complex elements yielded significant variance in the timings (based on unpublished raw data), so I wouldn’t trust that row for drawing conclusions. This is strange because the memory consumption for that size, including both the data vector and precomputed tables, is about 20 GiB, which clearly fits in the machine’s 32 GiB of RAM without swapping.