Full Text:   <3471>

CLC number: TP393

On-line Access: 2013-11-06

Received: 2013-04-02

Revision Accepted: 2013-09-12

Crosschecked: 2013-10-15

Cited: 1

Clicked: 6690

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE C 2013 Vol.14 No.11 P.859-872


Efficient fine-grained shared buffer management for multiple OpenCL devices

Author(s):  Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang

Affiliation(s):  College of Computer, National University of Defense Technology, Changsha 410073, China; more

Corresponding email(s):   xunchangqing@nudt.edu.cn

Key Words:  Shared buffer, OpenCL, Heterogeneous programming, Fine grained

Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang. Efficient fine-grained shared buffer management for multiple OpenCL devices[J]. Journal of Zhejiang University Science C, 2013, 14(11): 859-872.

@article{title="Efficient fine-grained shared buffer management for multiple OpenCL devices",
author="Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang",
journal="Journal of Zhejiang University Science C",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Efficient fine-grained shared buffer management for multiple OpenCL devices
%A Chang-qing Xun
%A Dong Chen
%A Qiang Lan
%A Chun-yuan Zhang
%J Journal of Zhejiang University SCIENCE C
%V 14
%N 11
%P 859-872
%@ 1869-1951
%D 2013
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1300078

T1 - Efficient fine-grained shared buffer management for multiple OpenCL devices
A1 - Chang-qing Xun
A1 - Dong Chen
A1 - Qiang Lan
A1 - Chun-yuan Zhang
J0 - Journal of Zhejiang University Science C
VL - 14
IS - 11
SP - 859
EP - 872
%@ 1869-1951
Y1 - 2013
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1300078

openCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse openCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared openCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K., Yeung, D., 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Annual Int. Symp. on Computer Architecture, p.2-13.

[2]Bal, H.E., Tanenbaum, A.S., 1988. Distributed Programming with Shared Data. Proc. Int. Conf. on Computer Languages, p.82-91.

[3]Balasundaram, V., Kennedy, K., 1989. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.41-53.

[4]Bershad, B.N., Zekauskas, M.J., Sawdon, W.A., 1993. The Midway Distributed Shared Memory System. Compcon Spring, Digest of Papers, p.528-537.

[5]Cadar, C., Dunbar, D., Engler, D., 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Proc. 8th USENIX Conf. on Operating Systems Design and Implementation, p.209-224.

[6]Callahan, D., Kennedy, K., 1988. Analysis of interprocedural side effects in a parallel programming environment. J. Parall. Distr. Comput., 5(5):517-550.

[7]Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.

[8]Dantzig, G.B., Curtis, E.B., 1973. Fourier-Motzkin elimination and its dual. J. Comb. Theory A, 14(3):288-297.

[9]Dasgupta, P., LeBlanc, R.J.Jr., Ahamad, M., Ramachandran, U., 1991. The clouds distributed operating system. Computer, 24(11):34-44.

[10]Delp, G., Sethi, A., Farber, D., 1988. An Analysis of Memnet—an Experiment in High-Speed Shared-Memory Local Networking. Symp. Proc. on Communications architectures and protocols, p.165-174.

[11]Frank, S., Burkhardt, H., Rothnie, J., 1993. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon Spring, Digest of Papers, p.285-294.

[12]Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.W., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Proc. 15th ASPLOS on Architectural Support for Programming Languages and Operating Systems, p.347-358.

[13]Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I., 2011. Automatic CPU-GPU Communication Management and Optimization. Proc. 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.142-151.

[14]Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I., 2012. Dynamically Managed Data for CPU-GPU Architectures. Proc. 10th Int. Symp. on Code Generation and Optimization, p.165-174.

[15]Kim, J., Kim, H., Lee, J.H., Lee, J., 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proc. 16th ACM Symp. on Principles and Practice of Parallel Programming, p.277-288.

[16]Lattner, C., Adve, V., 2004. LLVM: a Compilation Framework for Lifelong Program Analysis & Transformation. Proc. Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p.75-87.

[17]Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al., 2010. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451-460.

[18]Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65-109.

[19]Pai, S., Govindarajan, R., Thazhuthaveetil, M.J., 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-Assisted Runtime Coherence Scheme. Proc. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, p.33-42.

[20]Pugh, W., 1992. A practical algorithm for exact array dependence analysis. ACM Commun., 35(8):102-114.

[21]Seo, S., Jo, G., Lee, J., 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. IEEE Int. Symp. on Workload Characterization, p.137-148.

[22]Shen, Z., Li, Z., Yew, P.C., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356-364.

[23]Stratton, J.A., Stone, S.S., Hwu, W.W., 2008. Languages and Compilers for Parallel Computing. Springer-Verlag Berlin Heidelberg, p.16-30.

[24]Stratton, J.A., Rodrigues, C., Sung, R., Obeid, N., Chang, L.W., Anssari, N., Liu, D., Hwu, W.W., 2012. Parboil: a Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report No. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA.

[25]Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. SIGPLAN Not., 21(7):176-185.

[26]Wilson, A.W.Jr., LaRowe, R.P.Jr., Teller, M.J., 1993. Hardware Assist for Distributed Shared Memory. Proc. 13th Int. Conf. on Distributed Computing Systems, p.246-255.

[27]Wolfe, M., 2010. Implementing the PGI Accelerator Model. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.43-50.

[28]Yan, Y., Grossman, M., Sarkar, V., 2009. JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. Proc. 15th Int. Euro-Par Conf. on Parallel Processing, p.887-899.

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE