Fourier Neural Operators Just Get Turbobo: UC Riverside Researchers introduce Turbofno, a fully fused FFT-GEMM-IFFT core that achieves up to 150% speeds on Pytorch

liralbes April 20, 2025

6 3 minutes read

Fourier Neural Operators Just Get Turbobo: UC Riverside Researchers introduce Turbofno, a fully fused FFT-GEMM-IFFT core that achieves up to 150% speeds on Pytorch

Fourier Neural Operator (FNO) is a powerful tool for operators to learn partial differential equation solutions, but lacks the optimization of assistive perception, with its Fourier layer execution, filtering, GEMM, zero padding and IFFT as separate stages, resulting in multiple core startups and multiple global memory traffic. FFT->gemm-> IFFT computing mode has insufficient attention to GPU core fusion and memory layout optimization. Current methods such as quantum espresso, octopus, and CP2K make separate calls to FFT and BLAS routines. However, they have three limitations: partial frequency utilization and other memory copy operations, the lack of natural frequency filtering capabilities in the cuffs, and excessive storage transactions between processing stages.

FNO implements a pipeline that starts with a forward FFT on the input feature map, applies spectral filtering, and reconstructs the output via inverse FFT. This process requires frequency domain truncation and zero-fill steps, and due to the limitations of cufft in native input/output pruning support, current frameworks like Pytorch are executed as separate memory copy kernels. Leading FFT libraries such as Cufft and VKFFT lack built-in data truncation. Traditional 2D FFTs apply two 1D-FFT stages along the spatial dimension, but FNO applies spectral weights on the channel size, indicating an opportunity to re-insert the second FFT stage by keeping the first 1D FFT along the spatial axis along the spatial axis.

Researchers at the University of California University of California, University of California, proposed Turbofno, the first fully fused FFT-GEMM-IFFT GPU core with built-in FFT optimization. This approach begins with developing FFT and GEMM kernels from scratch, which performs comparable or faster than state-of-the-art Cublas and Cuffts from closed sources. An FFT variant was introduced to efficiently fuse FFT and GEMM workloads, in which a thread block iterates over hidden dimensions, aligned with the K-Loop in GEMM. Additionally, when forwarding the FFT output to GEMM, it is designed to achieve 100% memory bank utilization and enable the IFFT to retrieve GEMM results directly from shared memory.

Turbofno integrates optimized implementations of FFT and CGEMM kernels to enable effective convergence and built-in FFT optimization. The kernel fusion strategy in Turbofno is performed through three levels: FFT-GEMM fusion, GEMM-IFFT fusion and complete FFT-GEMM-IFFT fusion. Each stage involves aligning the FFT workflow with the GEMM, resolving data layout mismatch and eliminating shared memory bank conflicts. Key technologies include modifying the FFT output layout to match Gemm’s input format, applying threads for conflict-free shared memory access, and integrating inverse FFT into the conclusion phase of CGEMM to bypass the intermediate global memory writing and enhance memory locality.

Turbofno performed well in both 1D and 2D FNO evaluations. In one-dimensional FNO tests, the optimized FFT-CGEMM-IFFT workflow reaches 100% speed compared to Pytorch, an average of 50% improvement. These benefits come from FFT pruning, which reduces the calculation by 25%-67.5%. The fully fused FFT-CGEMM-IFFT kernel provides up to 150% speeds over Pytorch and provides 10%-20% improvement over partial fusion strategies. Similarly, in 2D FNO, the optimized workflow is better than Pytorch with an average speed of more than 50%, with a maximum improvement of 100%. The fully fused core of 2D achieves 50%-105% speeds without performance degradation, although comparing the FFT workload layout with CGEMM DataFlow.

In this article, the researchers introduce Turbofno, the first fully fused GPU core that integrates FFT, CGEMM, and IFFT to accelerate Fourier neural operators. They developed a range of architecture-aware optimizations to overcome the inefficiencies of conventional FNO implementations such as excessive kernel startup and global memory traffic. These include a custom FFT core with built-in frequency filtering and zero-fill, a GEMM-compatible FFT variant that mimics K-Loop behavior, and a shared memory shock strategy that boosts bank utilization from 25% to 100%. Turbofno hits speeds up to 150% and maintains an average performance growth of 67% across all tested configurations.

This is Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.