CLC number: TP319
On-line Access: 2020-06-12
Received: 2019-02-09
Revision Accepted: 2019-07-12
Crosschecked: 2020-05-09
Cited: 0
Clicked: 5410
Citations: Bibtex RefMan EndNote GB/T7714
Yi-shui Li, Xin-hai Chen, Jie Liu, Bo Yang, Chun-ye Gong, Xin-biao Gan, Sheng-guo Li, Han Xu. OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(6): 939-949.
@article{title="OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype",
author="Yi-shui Li, Xin-hai Chen, Jie Liu, Bo Yang, Chun-ye Gong, Xin-biao Gan, Sheng-guo Li, Han Xu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="6",
pages="939-949",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900075"
}
%0 Journal Article
%T OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype
%A Yi-shui Li
%A Xin-hai Chen
%A Jie Liu
%A Bo Yang
%A Chun-ye Gong
%A Xin-biao Gan
%A Sheng-guo Li
%A Han Xu
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 6
%P 939-949
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900075
TY - JOUR
T1 - OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype
A1 - Yi-shui Li
A1 - Xin-hai Chen
A1 - Jie Liu
A1 - Bo Yang
A1 - Chun-ye Gong
A1 - Xin-biao Gan
A1 - Sheng-guo Li
A1 - Han Xu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 6
SP - 939
EP - 949
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900075
Abstract: With the rapid increase of the size of applications and the complexity of the supercomputer architecture, topology-aware process mapping becomes increasingly important. High communication cost has become a dominant constraint of the performance of applications running on the supercomputer. To avoid a bad mapping strategy which can lead to terrible communication performance, we propose an optimized heuristic topology-aware mapping algorithm (OHTMA). The algorithm attempts to minimize the hop-byte metric that we use to measure the mapping results. OHTMA incorporates a new greedy heuristic method and pair-exchange-based optimization. It reduces the number of long-distance communications and effectively enhances the locality of the communication. Experimental results on the Tianhe-3 exascale supercomputer prototype indicate that OHTMA can significantly reduce the communication costs.
[1]Agarwal T, Sharma A, Laxmikant A, et al., 2006. Topology-aware task mapping for reducing communication contention on large parallel machines. Proc 20th IEEE Int Parallel & Distributed Processing Symp, p.1-10.
[2]Bailey DH, Barszcz E, Barton JT, et al., 1991. The NAS parallel benchmarks—summary and preliminary results. Proc ACM/IEEE Conf on Supercomputing, p.158-165.
[3]Bhatele A, 2010. Automating Topology Aware Mapping for Supercomputers. PhD Thesis, University of Illinois at Urbana-Champaign, Urbana, USA.
[4]Bhatele A, Laxmikant V, 2009. An evaluative study on the effect of contention on message latencies in large supercomputers. Proc IEEE Int Symp on Parallel & Distributed Processing, p.1-8.
[5]Brandfass B, Alrutz T, Gerhold T, 2013. Rank reordering for MPI communication optimization. Comput Fluid, 80:372-380.
[6]Chen X, Liu J, Li S, et al., 2018. TAMM: a new topology-aware mapping method for parallel applications on the Tianhe-2A supercomputer. Proc 18th Int Conf on Algorithms and Architectures for Parallel Processing, p.242-256.
[7]Deveci M, Kaya K, Uccar B, et al., 2015. Fast and high quality topology-aware task mapping. Proc IEEE Int Parallel and Distributed Processing Symp, p.197-206.
[8]Hoefler T, Snir M, 2011. Generic topology mapping strategies for large-scale parallel architectures. Proc Int Conf on Supercomputing, p.75-84.
[9]Hoefler T, Jeannot E, Mercier G, 2014. An overview of topology mapping algorithms and techniques in high-performance computing. In: Jeannot E, Žilinskas J (Eds.), High-Performance Computing on Complex Environments. Wiley, Hoboken, New Jersey, USA.
[10]Jeannot E, Mercier G, 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra P, Guarracino M, Talia D (Eds.), Euro-Par 2010 Parallel Processing. Springer Berlin Heidelberg, Germany, p.199-210.
[11]Jeannot E, Mercier G, Tessier F, 2014. Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parall Distrib Syst, 25(4):993-1002.
[12]Karypis G, Kumar V, 1998. METIS—A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Ordering of Sparse Matrices. Technical Report, University of Minnesota, Minneapolis, USA.
[13]Liao X, Pang Z, Wang K, et al., 2015. High performance interconnect network for Tianhe system. J Comput Sci Technol, 30(2):259-272.
[14]Mercier G, Clet-Ortega J, 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo M, Westerholm J, Dongarra J (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer Berlin Heidelberg, Germany, p.104-115.
[15]Mirsadeghi SH, Afsahi A, 2016. PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. Proc IEEE Int Parallel and Distributed Processing Symp Workshops, p.386-396.
[16]Pellegrini F, Roman J, 1996. SCOTCH: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. Proc Int Conf and Exhibition on High-Performance Computing and Networking, p.493-498.
[17]Rodrigues E, Madruga F, Navaux P, et al., 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. Int Symp on Computers and Communications, p.811-817.
[18]Sahni S, Gonzalez T, 1976. P-complete approximation problems. J ACM, 23(3):555-565.
[19]Sudheer CD, Srinivasan A, 2012. Optimization of the hop-byte metric for effective topology aware mapping. Proc 19th Int Conf on High Performance Computing, p.1-9.
[20]Tuncer O, Leung VJ, Coskun AK, 2015. PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. Proc 29th ACM on Int Conf on Supercomputing, p.37-46.
[21]Walshaw C, Cross M, 2007. JOSTLE–-parallel multilevel graph-partitioning software: an overview. In: Magoul‘es F (Ed.), Mesh Partitioning Techniques and Domain Decomposition Methods. Saxe-Coburg Publications, Stirlingshire, UK, p.22-58.
[22]Wang T, Qing P, Wei D, et al., 2015. Optimization of process-to-core mapping based on clustering analysis. Chin J Comput, 38(5):1044-1055 (in Chinese).
[23]Wylie BJN, Böhme D, Mohr B, et al., 2010. Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. Proc IEEE Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum, p.1-8.
[24]Zerr RJ, Baker RS, 2013. Snap: SN (Discrete Ordinates) Application Proxy-Proxy Description. Technical Report, LA-UR-13-21070, Los Alamos National Laboratory, Los Alamos, USA.
Open peer comments: Debate/Discuss/Question/Opinion
<1>