CLC number: TP302.7
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2011-11-04
Cited: 2
Clicked: 9106
Dan Wu, Xue-cheng Zou, Kui Dai, Jin-li Rao, Pan Chen, Zhao-xia Zheng. Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture[J]. Journal of Zhejiang University Science C, 2011, 12(12): 976-989.
@article{title="Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture",
author="Dan Wu, Xue-cheng Zou, Kui Dai, Jin-li Rao, Pan Chen, Zhao-xia Zheng",
journal="Journal of Zhejiang University Science C",
volume="12",
number="12",
pages="976-989",
year="2011",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1100027"
}
%0 Journal Article
%T Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture
%A Dan Wu
%A Xue-cheng Zou
%A Kui Dai
%A Jin-li Rao
%A Pan Chen
%A Zhao-xia Zheng
%J Journal of Zhejiang University SCIENCE C
%V 12
%N 12
%P 976-989
%@ 1869-1951
%D 2011
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1100027
TY - JOUR
T1 - Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture
A1 - Dan Wu
A1 - Xue-cheng Zou
A1 - Kui Dai
A1 - Jin-li Rao
A1 - Pan Chen
A1 - Zhao-xia Zheng
J0 - Journal of Zhejiang University Science C
VL - 12
IS - 12
SP - 976
EP - 989
%@ 1869-1951
Y1 - 2011
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1100027
Abstract: The fast Fourier transform (FFT) is a fundamental kernel of many computation-intensive scientific applications. This paper deals with an implementation of the FFT on the accelerator system, a heterogeneous multi-core architecture to accelerate computation-intensive parallel computing in scientific and engineering applications. The Engineering and Scientific Computation Accelerator (ESCA) consists of a control unit and a single instruction multiple data (SIMD) processing element (PE) array, in which PEs communicate with each other via a hierarchical two-level network-on-chip (NoC) with high bandwidth and low latency. We exploit the architecture features of ESCA to implement a parallel FFT algorithm efficiently. Experimental results show that both the proposed parallel FFT algorithm and the ESCA architecture are scalable. The 16-bit fixed-point parallel FFT performance of ESCA is compared with a published work to prove the superiority of the mapping algorithm and the hardware architecture. The floating-point parallel FFT performances of ESCA are evaluated and compared with those of the IBM Cell processor and GPU to demonstrate the computing power of the ESCA system for high performance applications.
[1]Agarwal, R.C., Gustavson, F.G., Zubair, M., 1994. A High Performance Parallel Algorithm for 1-D FFT. Proc. Supercomputing, p.34-40.
[2]Bahn, J.H., Yang, J., Bagherzadeh, N., 2008. Parallel FFT Algorithms on Network-on-Chips. 5th Int. Conf. on Information Technology: New Generation, p.1087-1093.
[3]Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C., 2008. Entering the Petaflop Era: the Architecture and Performance of Roadrunner. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.1-12.
[4]Barua, S., Thulasiram, R.K., Thulasiraman, P., 2004. Improving Data Locality in Parallel Fast Fourier Transform Algorithm for Pricing Financial Derivatives. Proc. 18th Int. Parallel and Distributed Processing Symp., p.235-240.
[5]Bellens, P., Perez, J.M., Badia, R.M., Labarta, J., 2006. CellSs: a Programming Model for the Cell BE Architecture. Proc. ACM/IEEE SC Conf., p.5-15.
[6]benchFFT, 2003. FFT Benchmark Methodology. Available from http://www.fftw.org/speed/method.html [Accessed on Jan. 16, 2011].
[7]Cooley, J.W., Tukey, J.W., 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19(90):297-301.
[8]Cvetanovic, Z., 1987. Performance analysis of the FFT algorithm on a shared-memory parallel architecture. IBM J. Res. Dev., 31(4):435-451.
[9]Deng, Y.D., Maly, W.P., 2010. 3-Dimensional VLSI: a 2.5-Dimensional Integration Scheme. Tsinghua University Publishing House, Beijing, China, p.144-158.
[10]Frigo, M., Johnson, S.G., 2005. The design and implementation of FFTW3. Proc. IEEE, 93(2):216-231.
[11]Frigo, M., Johnson, S.G., 2007. FFTW on the Cell Processor. Available from http://www.fftw.org/cell/index.html [Accessed on Jan. 16, 2011].
[12]IBM, 2005. The Cell Architecture. Available from http://www.research.ibm.com/cell/home.html [Accessed on Jan. 16, 2011].
[13]Joint Cell Competence Center, 2009. FFT Performance Results of IBM QS22 Cell Blade. Available from http://cell.icm.edu.pl/index.php/FFTW_on_Cell [Accessed on May 10, 2011].
[14]Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D., 2005. Introduction to the Cell multiprocessor. IBM J. Res. Dev., 49(4-5):589-604.
[15]Kistler, M., Gunnels, J., Brokenshire, D., Benton, B., 2009. Programming the Linpack benchmark for the IBM PowerXCell 8i processor. Sci. Progr., 17(1-2):43-57.
[16]Nishikawa, Y., Koibuchi, M., Yoshimi, M., Miura, K., Amano, H., 2007. Performance Improvement Methodology for ClearSpeed’s CSX600. Int. Conf. on Parallel Processing, p.77.
[17]NVIDIA, 2009. High Performance Computing — Supercomputing with Tesla GPUs. Available from http://www.nvidia.com/object/tesla_computing_solutions.html [Accessed on May 10, 2011].
[18]NVIDIA, 2010. Tesla C2050 Performance Benchmarks. Available from http://nvworld.ru/files/articles/calculations-on-gpu-advantages-fermi/fermipeformance.pdf [Accessed on May 10, 2011].
[19]Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Philips, J.C., 2008. GPU computing. Proc. IEEE, 96(5):879-899.
[20]Swarztrauber, P.N., 1984. FFT algorithms for vector computers. Parall. Comput., 1(1):45-63.
[21]Takahashi, D., 2000. High-Performance Parallel FFT Algorithms for the HITACHI SR8000. Proc. 4th Int. Conf./Exhibition on High Performance Computing in the Asia-Pacific Region, p.192-199.
[22]Takahashi, D., 2002. A blocking Algorithm for Parallel 1-D FFT on Shared-Memory Parallel Computers. 6th Int. Conf. of Applied Parallel Computing, Advanced Scientific Computing, p.380-389.
[23]Taylor, M.B., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A., Lee, W., Miller, J., et al., 2004. Evaluation of the Raw Microprocessor: an Exposed-Wire-Delay Architecture for ILP and Streams. Proc. 31st Annual Int. Symp. on Computer Architecture, p.2-13.
[24]Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K., 2006. The Potential of the Cell Processor for Scientific Computing. Proc. 3rd Conf. on Computing Frontiers, p.9-20.
[25]Wu, D., Dai, K., Zou, X.C., Rao, J.L., Chen, P., 2010. A High Efficient on-Chip Interconnect Network in SIMD CMPs. 10th Int. Conf. on Algorithms and Architecture for Parallel Processing, p.149-162.
Open peer comments: Debate/Discuss/Question/Opinion
<1>