Fast Fourier transform in x86 assembly
I have a simple fast Fourier transform implementation on the page Free small FFT in multiple languages, whose asymptotic time complexity is the optimal Θ(n log n). I wanted to know how much additional speed-up could be obtained by reorganizing memory access patterns and using SIMD vector arithmetic instructions. The best assembly language implementation here has under 150 lines of clear code and runs 2× to 4× faster than the simple C++ implementation.
The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the other page, only the forward radix-2 FFT is provided, and the FFT is unscaled.
Source code
All the C++ source files are fully portable and do not require x86-64 or vector instructions. All the .s files (assembly language) are only for Unix on x86-64.
- FftTest.cpp: Runnable main test program
- Fft.cpp: FFT library implementation, excluding the transform core
- Fft.hpp: FFT library header
- Fft-transformImpl-simple.cpp: FFT core which computes each level fully and reads the exponential table with varying strides, simply adapted from FftComplex.cpp
- Fft-transformImpl-linearVector.cpp: FFT core which computes each level fully and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
- Fft-transformImpl-linearVector.s: FFT core which computes each level fully and reads the exponential table with unit stride, only for Unix on x86-64
- Fft-transformImpl-recurseVector.cpp: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
- Fft-transformImpl-recurseVector.s: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, only for Unix on x86-64
License: MIT (open source)
To compile, use a command like: Compiler -std=c++17 Optimizations -o FftTest FftTest.cpp Fft.cpp TransformImpl.cpp/s
. For example:
c++ -std=c++17 -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-simple.cpp
g++ -std=c++17 -O -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.s
g++ -std=c++17 -Os -ftree-vectorize -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.cpp
g++ -std=c++17 -O2 -march=native -mtune=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.s
clang++ -std=c++17 -O3 -march=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.cpp
Source code notes:
While there are 5 implementations of the FFT core, the precomputation of tables has only one implementation in C++ because it’s not as time-critical.
Each of the two assembly implementations has a pure-C++ version which use only scalar operations to almost perfectly mimic the vector operations and memory access patterns. This allows the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual. Also in theory, if the C++ compiler is good at vectorization, then it should generate machine code that is nearly identical to the hand-written vector assembly code.
All the C++ code files are portable and can work on any system with any definition of
size_t
. But when linking Fft.cpp to Fft-transformImpl-*.s (x86-64 assembly code), it is required thatsize_t
=uint64_t
(in other words,std::numeric_limits<size_t>::digits == 64
), otherwise the assembly function implementation will corrupt memory and crash. This constraint is not checked at compile time because it would require extra C++ source files.A previous version of this page featured code that: was implemented in C, arranged the real and imaginary parts of the complex vector in separate arrays instead of interleaving them, and didn’t have a recursive transform. The old codebase is slower and less readable, and is no longer supported.
Memory access patterns
This animation visualizes how the various FFT core implementations read/
FFT size: | |
Speed: | |
Implementation: |
|
Animation: |
Input/output vector:
Exponential table:
Note: Each cell represents an std::complex<double>
, whose size is 16 bytes.
Benchmark
The main numbers in the tables are the number of nanoseconds per FFT computation for a given FFT size and transform core implementation.
Compiler: GCC 10.2.0
Transform implementation | simple.cpp | linearVector.cpp | linearVector.s | recurseVector.cpp | recurseVector.s |
---|---|---|---|---|---|
Compile options | -O2 -march=native -mtune=native -ftree-vectorize | -O1 -march=native -mtune=native -ftree-vectorize | -O1 -march=native -mtune=native | -O1 -march=native -mtune=native -ftree-vectorize | -Os -march=native -mtune=native |
FFT size | |||||
8 | 76 | 44 | 41 | 40 | 85 |
16 | 127 | 61 | 59 | 62 | 112 |
32 | 238 | 107 | 101 | 112 | 166 |
64 | 488 | 218 | 212 | 234 | 292 |
128 | 1 016 | 471 | 443 | 500 | 563 |
256 | 2 232 | 1 050 | 1 011 | 1 127 | 1 181 |
512 | 4 847 | 2 371 | 2 316 | 2 536 | 2 561 |
1 024 | 10 634 | 5 431 | 5 203 | 5 701 | 5 668 |
2 048 | 23 595 | 12 374 | 12 079 | 12 931 | 12 534 |
4 096 | 60 421 | 34 200 | 34 505 | 32 288 | 32 007 |
8 192 | 145 182 | 82 865 | 82 734 | 79 071 | 77 133 |
16 384 | 328 538 | 202 016 | 196 209 | 179 904 | 178 106 |
32 768 | 794 685 | 472 896 | 473 776 | 399 099 | 395 295 |
65 536 | 1 848 536 | 1 011 156 | 1 017 823 | 912 735 | 896 032 |
131 072 | 4 048 978 | 2 229 407 | 2 271 899 | 1 919 836 | 1 856 634 |
262 144 | 9 568 106 | 5 377 714 | 5 509 565 | 4 463 868 | 4 555 393 |
524 288 | 29 331 856 | 18 053 211 | 18 372 963 | 11 717 066 | 11 640 852 |
1 048 576 | 84 874 361 | 44 012 739 | 44 233 226 | 27 502 046 | 27 383 013 |
2 097 152 | 197 736 251 | 94 209 742 | 95 788 139 | 58 103 293 | 61 914 542 |
4 194 304 | 477 559 022 | 200 105 030 | 197 380 868 | 125 288 316 | 126 381 213 |
8 388 608 | 1 150 845 702 | 442 110 966 | 443 317 765 | 297 457 984 | 284 732 072 |
16 777 216 | 2 701 063 686 | 985 539 583 | 1 015 481 605 | 702 461 388 | 694 133 979 |
33 554 432 | 6 634 030 343 | 2 170 786 691 | 2 219 600 086 | 1 593 194 794 | 1 542 642 833 |
67 108 864 | 14 389 143 994 | 4 795 731 589 | 4 782 791 173 | 4 319 947 607 | 3 634 788 206 |
134 217 728 | 34 308 652 911 | 10 767 479 096 | 10 808 237 981 | 10 046 959 831 | 10 735 480 888 |
268 435 456 | 78 795 861 504 | 23 323 416 514 | 23 280 238 633 | 18 613 868 128 | 19 604 131 740 |
536 870 912 | 191 424 874 237 | 55 481 062 948 | 58 530 032 212 | 48 347 285 488 | 35 887 147 074 |
Compiler: Clang 11.0.0
Transform implementation | simple.cpp | linearVector.cpp | linearVector.s | recurseVector.cpp | recurseVector.s |
---|---|---|---|---|---|
Compile options | -O3 -march=x86-64 | -O3 -march=native | -O2 -march=native | -O2 -march=native | -O2 -march=x86-64 |
FFT size | |||||
8 | 69 | 40 | 41 | 37 | 85 |
16 | 130 | 55 | 59 | 59 | 111 |
32 | 272 | 99 | 100 | 108 | 165 |
64 | 581 | 203 | 206 | 227 | 290 |
128 | 1 269 | 443 | 445 | 490 | 561 |
256 | 2 872 | 1 027 | 1 010 | 1 120 | 1 180 |
512 | 6 351 | 2 358 | 2 303 | 2 552 | 2 576 |
1 024 | 13 939 | 5 428 | 5 264 | 5 852 | 5 886 |
2 048 | 30 282 | 12 800 | 12 104 | 13 468 | 12 677 |
4 096 | 71 968 | 34 991 | 33 753 | 34 826 | 32 480 |
8 192 | 170 657 | 85 148 | 81 707 | 84 985 | 76 661 |
16 384 | 378 781 | 210 178 | 210 709 | 195 597 | 176 178 |
32 768 | 922 679 | 490 980 | 473 141 | 424 887 | 394 891 |
65 536 | 2 125 945 | 1 048 564 | 1 066 460 | 961 162 | 854 839 |
131 072 | 4 510 866 | 2 218 843 | 2 177 340 | 1 969 001 | 1 856 780 |
262 144 | 10 488 899 | 5 596 053 | 5 814 941 | 4 742 676 | 4 477 989 |
524 288 | 31 566 168 | 17 693 103 | 17 897 829 | 11 727 889 | 11 621 123 |
1 048 576 | 84 814 585 | 45 435 877 | 43 211 773 | 26 532 912 | 26 297 249 |
2 097 152 | 217 636 774 | 97 135 549 | 94 059 053 | 60 634 895 | 58 207 924 |
4 194 304 | 510 150 349 | 205 862 060 | 196 633 667 | 132 114 212 | 121 363 532 |
8 388 608 | 1 264 917 458 | 437 783 298 | 440 487 726 | 314 048 050 | 304 741 011 |
16 777 216 | 2 969 336 243 | 984 612 216 | 988 402 710 | 717 541 640 | 697 949 297 |
33 554 432 | 6 552 070 064 | 2 187 774 518 | 2 145 750 296 | 1 650 733 006 | 1 539 819 002 |
67 108 864 | 14 178 321 121 | 5 321 209 567 | 4 661 419 579 | 4 222 166 271 | 3 962 320 705 |
134 217 728 | 32 760 302 811 | 11 068 710 488 | 11 110 860 590 | 8 870 709 543 | 8 686 626 184 |
268 435 456 | 71 142 211 054 | 21 113 781 611 | 24 373 299 955 | 18 847 519 484 | 19 521 796 294 |
536 870 912 | 186 516 806 728 | 55 904 453 622 | 57 198 381 577 | 48 821 368 142 | 46 479 427 904 |
Methodology and observations:
All timing numbers were obtained on an Intel Core i5-4690 (Haswell) 3.50 GHz CPU (Turbo Boost disabled), running Ubuntu 20.10.
For the GCC compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -march=x86-64/native -mtune=generic/native -ftree-vectorize
For the Clang compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -Oz -march=x86-64/native
Although I tested many sets of compiler flags, for each transform core, I selected the set of flags that produced the fastest run times overall and published just those flags and numbers.
For the simple transform core, the best code produced by Clang is about 10% slower than GCC. For the other 4 cores, there is no clear winner between GCC and Clang; hence I consider both compilers to be competitive in the quality of their generated code.
The primary difference between the simple core and the vector cores is that the simple algorithm reads the complex exponential (trigonometric) table with varying strides, whereas the other cores read the table contiguously without skipping any elements. The result of this change is that all the other cores are at least 2× and up to 4× faster than the simple core.
For the linear vector transform core algorithm, the C++ implementation is about the same speed as the hand-written assembly. The compilers do a good job at vectorizing the code.
For the recursive vector transform core algorithm, the C++ implementation is a few percent slower than the hand-written assembly. The compilers still do a decent job at vectorization.
The linear vector core is faster than the recursive vector core up to around FFT size 1024. The recursive vector core is faster than the linear vector core around FFT size 4096 and up. Hence, the linear core is better at smaller sizes and the recursive core is better at larger sizes, which is what one would expect from the memory hierarchy and cache-oblivious algorithms.
The largest FFT size of 229 complex elements yielded significant variance in the timings (based on unpublished raw data), so I wouldn’t trust that row for drawing conclusions. This is strange because the memory consumption for that size, including both the data vector and precomputed tables, is about 20 GiB, which clearly fits in the machine’s 32 GiB of RAM without swapping.