CLC number: TP312
On-line Access: 2015-11-04
Received: 2015-01-30
Revision Accepted: 2015-06-30
Crosschecked: 2015-10-19
Cited: 0
Clicked: 6717
Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen. Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations[J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16(11): 899-916.
@article{title="Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations",
author="Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="16",
number="11",
pages="899-916",
year="2015",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1500032"
}
%0 Journal Article
%T Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
%A Mei Wen
%A Da-fei Huang
%A Chang-qing Xun
%A Dong Chen
%J Frontiers of Information Technology & Electronic Engineering
%V 16
%N 11
%P 899-916
%@ 2095-9184
%D 2015
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500032
TY - JOUR
T1 - Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
A1 - Mei Wen
A1 - Da-fei Huang
A1 - Chang-qing Xun
A1 - Dong Chen
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 16
IS - 11
SP - 899
EP - 916
%@ 2095-9184
Y1 - 2015
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500032
Abstract: openCL is an open heterogeneous programming framework. Although openCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific openCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific openCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of openCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific openCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new openCL runtime. Experiments show that the automated transformation can improve openCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
In this paper, the authors present a transformation approach for GPU-specific OpenCL kernels targeting multi-/many-core CPUs. In particular, they remove local memory usage (and the related synchronization) when found unnecessary, and introduce post-optimizations taking both vectorization and data locality into account. The experimental evaluation shows that their method leads to good performance compared to Intel’s OpenCL implementation and OpenMP.
[1]Allen, R., Kennedy, K., 2002. Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann, San Francisco.
[2]Balasundaram, V., Kennedy, K., 1989. A technique for summarizing data access and its use in parallelism enhancing transformations. ACM SIGPLAN Not., 24(7):41-53.
[3]Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc. 22nd Annual Int. Conf. on Supercomputing, p.225-234.
[4]Bastoul, C., 2004. Code generation in the polyhedral model is easier than you think. Proc. 13th Int. Conf. on Parallel Architectures and Compilation Techniques, p.7-16.
[5]Danalis, A., Marin, G., McCurdy, C., et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.
[6]Dong, H., Ghosh, D., Zafar, F., et al., 2012. Cross-platform OpenCL code and performance portability for CPU and GPU architectures investigated with a climate and weather physics model. Proc. 41st Int. Conf. on Parallel Processing Workshops, p.126-134.
[7]Du, P., Weber, R., Luszczek, P., et al., 2012. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parall. Comput., 38(8):391-407.
[8]Fang, J., Sips, H., Jaaskelainen, P., et al., 2014a. Grover: looking for performance improvement by disabling local memory usage in OpenCL kernels. Proc. 43rd Int. Conf. on Parallel Processing, p.162-171.
[9]Fang, J., Sips, H., Varbanescu, A.L., 2014b. Aristotle: a performance impact indicator for the OpenCL kernels using local memory. Sci. Progr., 22(3):239-257.
[10]Freeocl, 2012. FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs. Available from https://code.google.com/p/freeocl [Accessed on Apr. 13, 2014].
[11]Gummaraju, J., Morichetti, L., Houston, M., et al., 2010. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.205-216.
[12]Huang, D., Wen, M., Xun, C., et al., 2014. Automated transformation of GPU-specific OpenCL kernels targeting performance portability on multi-core/many-core CPUs. Proc. Euro-Par, p.210-221.
[13]Intel Corporation, 2012. A Guide to Vectorization with Intel C++ Compilers.
[14]Intel Corporation, 2013a. Intel C++ Intrinsic Reference. Available from https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf [Accessed on Feb. 9, 2014]
[15]Intel Corporation, 2013b. Intel SDK for OpenCL Applications XE 2013 Optimization Guide. Available from http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe/ [Accessed on Feb. 9, 2014]
[16]Jang, B., Schaa, D., Mistry, P., et al., 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parall. Distr. Syst., 22(1):105-118.
[17]Lattner, C., Adve, V., 2005. The LLVM compiler framework and infrastructure tutorial. In: Eigenmann, R., Li, Z.Y., Midkiff, S.P. (Eds.), Languages and Compilers for High Performance Computing. Springer, p.15-16.
[18]Lee, J., Kim, J., Seo, S., et al., 2010. An OpenCL framework for heterogeneous multicores with local memory. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.193-204.
[19]LLVM Team and others, 2012. Clang: a C language family frontend for LLVM. Available from http://clang.llvm.org/ [Accessed on Apr. 13, 2014].
[20]Munshi, A., 2011. The OpenCL specification. Available from http://www.khronos.org/opencl [Accessed on Apr. 12, 2014]
[21]Nvidia Corporation, 2011a. OpenCL Best Practices Guide. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Best_Practices_Guide.pdf [Accessed on Feb. 10, 2014].
[22]Nvidia Corporation, 2011b. OpenCL Programming Guide for the CUDA Architecture. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Programming_Guide.pdf [Accessed on Feb. 10, 2014].
[23]Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65-109.
[24]Pennycook, S.J., Hammond, S.D., Wright, S.A., et al., 2013. An investigation of the performance portability of OpenCL. J. Parall. Distr. Comput., 73(11):1439-1450.
[25]Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., et al., 2013. Portable performance on heterogeneous architectures. Proc. 18th Int. Conf. on Architechtural Support for Programming Languages and Operating Systems, p.431-444.
[26]Rul, S., Vandierendonck, H., D’Haene, J., et al., 2010. An experimental study on performance portability of OpenCL kernels. Symp. on Application Accelerators in High Performance Computing. Available from https://biblio.ugent.be/publication/1016024
[27]Shen, Z., Li, Z., Yew, P., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356-364.
[28]Steven, S.M., 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco.
[29]Stratton, J.A., Stone, S.S., Hwu, W.M.W., 2008. MCUDA: an effective implementation of CUDA kernels for multi-core CPUs. Proc. 21st Int. Workshop on Languages and Compilers for Parallel Computing, p.16-30.
[30]Stratton, J.A., Grover, V., Marathe, J., et al., 2010. Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. Proc. 8th Annual IEEE/ACM Int. Symp. on Code Generation and Optimization, p.111-119.
[31]Stratton, J.A., Kim, H., Jablin, T.B., et al., 2013. Performance portability in accelerated parallel kernels. Technical Report No. IMPACT-13-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, IL.
[32]TOP500.org, 2014. TOP500 lists: November 2014. Available from http://top500.org/lists/2014/11/ [Accessed on Nov. 29, 2014].
[33]Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. ACM SIGPLAN Not., 21(7):176-185.
Open peer comments: Debate/Discuss/Question/Opinion
<1>