Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Abstract: As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU&x2019;s poor warp scheduling method. Thus, benefits of GPU&x2019;s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.

Key words: Locality, Graphics processing unit (GPU), Cache allocation, Warp scheduling

Chinese Summary  <30> CWLP:一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略

概要:随着我们正在接近百亿亿次超级计算机的时代,一个拥有强大运算能力和低能耗的均衡的计算机系统变得越来越重要。GPUs是在最近投入运营的超级计算机中被广泛使用的加速器。它采用大规模多块程来隐藏长访存延迟,同时它拥有高能效。相对于其强大的运算能力,GPUs的每个流多核处理器只有几兆的片上资源。面向吞吐率的执行模型与它的高速缓存层次结构设计不匹配,使得GPUs缓存表现出较差的运行效率。由于片上存储器的严重缺少,受较差的缓存性能影响,GPU的计算能力急剧下降,限制了系统性能和能效。提出一种协同的线程束调度和局部性保护的缓存分配策略(CWLP),以充分利用数据局部性和隐藏延迟。首先,设计了一种基于指令PC的局部性保护方法(LPC)以提升GPU性能。使用一个基于PC的收集器收集每个高速缓存块的重用信息。在获取缓存块的动态重用信息后,采用一个智能缓存分配单元(PCAU),它结合了重用信息和LRU(最近最少使用)替换策略,以找到拥有最少局部性的缓存块并将其逐出。此外,局部性信息被线程束调度器用来实现一个智能的重排序策略,用以获取局部性和隐藏延迟。实验结果表明,CWLP能够提供高达19.8%的性能加速比和超过基准策略平均8.8%的性能提升。

关键词组:局部性GPU;cache分配;线程束调度


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1700059

CLC number:

TP368.1

Download Full Text:

Click Here

Downloaded:

2396

Download summary:

<Click Here> 

Downloaded:

1749

Clicked:

7121

Cited:

0

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2018-02-14

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE