CLC number: TP181
On-line Access: 2025-02-10
Received: 2023-10-10
Revision Accepted: 2023-10-17
Crosschecked: 2025-02-18
Cited: 0
Clicked: 942
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0001-9743-2034
Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI. Automatic parallelism strategy generation with minimal memory redundancy[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(1): 109-118.
@article{title="Automatic parallelism strategy generation with minimal memory redundancy",
author="Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="1",
pages="109-118",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300684"
}
%0 Journal Article
%T Automatic parallelism strategy generation with minimal memory redundancy
%A Yanqi SHI
%A Peng LIANG
%A Hao ZHENG
%A Linbo QIAO
%A Dongsheng LI
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 1
%P 109-118
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300684
TY - JOUR
T1 - Automatic parallelism strategy generation with minimal memory redundancy
A1 - Yanqi SHI
A1 - Peng LIANG
A1 - Hao ZHENG
A1 - Linbo QIAO
A1 - Dongsheng LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 1
SP - 109
EP - 118
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300684
Abstract: Large-scale deep learning models are trained distributedly due to memory and computing resource limitations. Few existing strategy generation approaches take optimal memory minimization as the objective. To fill in this gap, we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory redundancy. We propose a novel redundant memory cost model to calculate the memory overhead of each operator in a given parallel strategy. To generate the optimal parallelism strategy, we formulate the parallelism strategy search problem into an integer linear programming problem and use an efficient solver to find minimal-memory intra-operator parallelism strategies. Furthermore, the proposed algorithm has been extended and implemented in a multi-dimensional parallel training framework and is characterized by high throughput and minimal memory redundancy. Experimental results demonstrate that our approach achieves memory savings of up to 67% compared to the latest Megatron-LM strategies; in contrast, the gap between the throughput of our approach and its counterparts is not large.
[1]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159.
[2]Cai ZK, Yan X, Ma KH, et al., 2022. TensorOpt: exploring the tradeoffs in distributed DNN training with auto-parallelism. IEEE Trans Parall Distrib Syst, 33(8):1967-1981.
[3]Chowdhery A, Narang S, Devlin J, et al., 2022. PaLM: scaling language modeling with pathways. https://arxiv.org/abs/2204.02311
[4]Dan YH, Lei ZK, Gu YY, et al., 2023. EduChat: a large-scale language model-based chatbot system for intelligent education. https://arxiv.org/abs/2308.02773
[5]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the 9th American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[6]Guan L, Sun T, Qiao LB, et al., 2020. An efficient parallel and distributed solution to nonconvex penalized linear SVMs. Front Inform Technol Electron Eng, 21(4):587-603.
[7]Harlap A, Narayanan D, Phanishayee A, et al., 2018. PipeDream: fast and efficient pipeline parallel DNN training. https://arxiv.org/abs/1806.03377
[8]He XB, Chen X, Guo H, et al., 2023. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. Front Inform Technol Electron Eng, 24(1):41-58.
[9]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33rd Int Conf on Neural Information Processing Systems, Article 10.
[10]Jia ZH, Lin SN, Qi CR, et al., 2018. Exploring hidden dimensions in accelerating convolutional neural networks. Proc 35th Int Conf on Machine Learning, p.2274-2283.
[11]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 26th Annual Conf on Neural Information Processing Systems, p.1106-1114.
[12]Lan Q, Qiao LB, Wang YJ, 2018. Stochastic extra-gradient based alternating direction methods for graph-guided regularized minimization. Front Inform Technol Electron Eng, 19(6):755-762.
[13]Li SG, Liu HX, Bian ZD, et al., 2023. Colossal-AI: a unified deep learning system for large-scale parallel training. Proc 52nd Int Conf on Parallel Processing, p.766-775.
[14]Liu YL, Li SG, Fang JR, et al., 2023. Colossal-Auto: unified automation of parallelization and activation checkpoint for large-scale models. https://arxiv.org/abs/2302.02599
[15]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Proc Int Conf for High Performance Computing, Networking, Article 56.
[16]Mo ZY, 2018. Extreme-scale parallel computing: bottlenecks and strategies. Front Inform Technol Electron Eng, 19(10):1251-1260.
[17]Narayanan D, Shoeybi M, Casper J, et al., 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 58.
[18]Naumov M, Mudigere D, Shi HJM, et al., 2019. Deep learning recommendation model for personalization and recommendation systems. https://arxiv.org/abs/1906.00091
[19]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 20.
[20]Shazeer N, Cheng YL, Parmar N, et al., 2018. Mesh-TensorFlow: deep learning for supercomputers. Proc 32nd Int Conf on Neural Information Processing Systems, p.10435-10444.
[21]Shoeybi M, Patwary M, Puri R, et al., 2019. Megatron-LM: training multi-billion parameter language models using model parallelism. https://arxiv.org/abs/1909.08053
[22]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[23]Wang MJ, Huang CC, Li JY, 2019. Supporting very large models using automatic dataflow graph partitioning. Proc 14th EuroSys Conf, Article 26.
[24]Zheng LM, Li ZH, Zhang H, et al., 2022. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. Proc 16th USENIX Symp on Operating Systems Design and Implementation, p.559-578.
[25]Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3-14.
Open peer comments: Debate/Discuss/Question/Opinion
<1>