Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit

Abstract: The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several times when determining cluster centers, yielding high computational complexity. In this paper, we focus on accelerating the time-consuming density peaks algorithm with a graphics processing unit (GPU). We analyze the principle of the algorithm to locate its computational bottlenecks, and evaluate its potential for parallelism. In light of our analysis, we propose an efficient parallel DP algorithm targeting on a GPU architecture and implement this parallel method with compute unified device architecture (CUDA), called the ‘CUDA-DP platform’. Specifically, we use shared memory to improve data locality, which reduces the amount of global memory access. To exploit the coalescing accessing mechanism of GPU, we convert the data structure of the CUDA-DP program from array of structures to structure of arrays. In addition, we introduce a binary search-and-sampling method to avoid sorting a large array. The results of the experiment show that CUDA-DP can achieve a 45-fold acceleration when compared to the central processing unit based density peaks implementation.

Key words: Density peak, Graphics processing unit, Parallel computing, Clustering

Chinese Summary  <28> 基于GPU的密度峰值并行聚类算法

概要:基于密度峰值的聚类方法DP(density peak)由于其新颖有效的特点而广泛应用于科学研究。然而,当确定集群中心时,DP会对每对数据点操作多次,从而导致较高的计算复杂度。在本文中,我们提出了一种基于GPU(graphics processing unit)的高效并行密度峰值算法。我们分析密度峰值聚类算法的原理来研究其计算瓶颈,并评估其并行的潜力。根据分析,我们提出了CUDA-DP(compute unified device architecture-DP),一种针对GPU架构的高效并行密度峰值聚类算法,并用CUDA实现了这种并行方法。具体来说,我们使用共享内存减少了全局内存访问量。更进一步,为了利用GPU的合并访问机制,我们将CUDA-DP程序的数据结构从AOS(array of structures)重构为SOA(structure of arrays)。另外,我们分别引入二进制搜索方法和采样方法,以避免对距离矩阵进行排序造成的计算开销。实验结果表明,与基于CPU的密度峰值实现相比,CUDA-DP可以实现超过45倍的加速。

关键词组:GPU;密度峰值;聚类;并行计算


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1601786

CLC number:

TP301.6

Download Full Text:

Click Here

Downloaded:

2640

Clicked:

7041

Cited:

0

On-line Access:

2017-07-31

Received:

2016-12-17

Revision Accepted:

2017-04-17

Crosschecked:

2017-07-14

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE