Abstract
We investigate the utilization of vector registers (VRs) on reducing memory references for single instruction multiple data fast Fourier transform calculation. We propose to group the butterfly computations in several consecutive stages to maximize utilization of the available VRs and take the advantage of the symmetries in twiddle factors. All the butterflies sharing identical twiddle factors are clustered and computed together to further improve performance. The relationship between the number of fused stages and the number of available VRs is then examined. Experimental results on different platforms show that the proposed method is effective.