CLC number: TP368.1
On-line Access: 2018-04-09
Received: 2017-01-19
Revision Accepted: 2017-08-29
Crosschecked: 2018-02-14
Cited: 0
Clicked: 6207
Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(2): 206-220.
@article{title="CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs",
author="Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="2",
pages="206-220",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700059"
}
%0 Journal Article
%T CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs
%A Yang Zhang
%A Zuo-cheng Xing
%A Cang Liu
%A Chuan Tang
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 2
%P 206-220
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700059
TY - JOUR
T1 - CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs
A1 - Yang Zhang
A1 - Zuo-cheng Xing
A1 - Cang Liu
A1 - Chuan Tang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 2
SP - 206
EP - 220
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700059
Abstract: As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU&x2019;s poor warp scheduling method. Thus, benefits of GPU&x2019;s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
[1]Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163-174.
[2]Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44-54.
[3]Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441-451.
[4]Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343-355.
[5]Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35.
[6]Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50-57.
[7]Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608-620.
[8]Gebhart M, Johnson D, Tarjan D, et al., 2011. Energy-efficient mechanisms for managing thread context in throughput processors. Proc 38th Annual Int Symp Computer Architecture, p.235-246.
[9]Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12.
[10]Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/linebreak maxwell-most-advanced-cuda-gpu-ever-made
[11]Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20th Int Symp on High Performance Computer Architecture, p.272-283.
[12]Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395-406.
[13]Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20th Int Symp on High Performance Computer Architecture, p.260-271.
[14]Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42nd Annual Int Symp on Computer Architecture, p.515-527.
[15]Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44th Annual IEEE/ACM Int Symp on Microarchitecture, p.308-317.
[16]Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20th Int Symp on High Performance Computer Architecture, p.37-48.
[17]NVIDIA, 2009. NVIDIA&x2019;s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/linebreak content/PDF/fermi_white_papers/NVIDIA_Fermi_linebreak Compute_Architecture_Whitepaper.pdf newpage
[18]NVIDIA, 2015. NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computing-linebreak documentation
[19]Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.86-98.
[20]Rogers T, O&x2019;Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45th Annual IEEE/ACM Int Symp on Microarchitecture, p.72-83.
[21]Rogers T, O&x2019;Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.99-110.
[22]Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21st Int Symp on High Performance Computer Architecture, p.174-185.
[23]Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516-523.
[24]Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21st Int Symp on High Performance Computer Architecture, p.76-88.
[25]Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395-406.
[26]Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524-1530.
[27]Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese).
Open peer comments: Debate/Discuss/Question/Opinion
<1>