CLC number: TP181
On-line Access: 2025-02-10
Received: 2023-10-10
Revision Accepted: 2023-10-17
Crosschecked: 2025-02-18
Cited: 0
Clicked: 1224
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0001-9743-2034
Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI. Automatic parallelism strategy generation with minimal memory redundancy[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300684 @article{title="Automatic parallelism strategy generation with minimal memory redundancy", %0 Journal Article TY - JOUR
最小化内存冗余的自动并行策略生成方法国防科技大学并行与分布处理国家重点实验室,中国长沙市,410000 摘要:受内存和计算资源限制,大规模深度学习模型通常以分布式方式训练。现有策略生成方法很少以最小化内存占用作为目标。为此,提出一种新算法,能够生成以最小化内存冗余为目标的自动并行策略。提出一种冗余内存代价模型来计算给定并行策略中每个算子的内存开销。为确保生成最优的并行策略,将并行策略搜索问题形式化为整数线性规划问题,使用高效求解器寻找具有最小内存占用的算子内并行策略。所提方法在多维并行训练框架中实现;实验结果表明,与最新Megatron-LM方法相比,可节省高达67%的内存开销,而吞吐量相差不大。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159. ![]() [2]Cai ZK, Yan X, Ma KH, et al., 2022. TensorOpt: exploring the tradeoffs in distributed DNN training with auto-parallelism. IEEE Trans Parall Distrib Syst, 33(8):1967-1981. ![]() [3]Chowdhery A, Narang S, Devlin J, et al., 2022. PaLM: scaling language modeling with pathways. https://arxiv.org/abs/2204.02311 ![]() [4]Dan YH, Lei ZK, Gu YY, et al., 2023. EduChat: a large-scale language model-based chatbot system for intelligent education. https://arxiv.org/abs/2308.02773 ![]() [5]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the 9th American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. ![]() [6]Guan L, Sun T, Qiao LB, et al., 2020. An efficient parallel and distributed solution to nonconvex penalized linear SVMs. Front Inform Technol Electron Eng, 21(4):587-603. ![]() [7]Harlap A, Narayanan D, Phanishayee A, et al., 2018. PipeDream: fast and efficient pipeline parallel DNN training. https://arxiv.org/abs/1806.03377 ![]() [8]He XB, Chen X, Guo H, et al., 2023. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. Front Inform Technol Electron Eng, 24(1):41-58. ![]() [9]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33rd Int Conf on Neural Information Processing Systems, Article 10. ![]() [10]Jia ZH, Lin SN, Qi CR, et al., 2018. Exploring hidden dimensions in accelerating convolutional neural networks. Proc 35th Int Conf on Machine Learning, p.2274-2283. ![]() [11]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 26th Annual Conf on Neural Information Processing Systems, p.1106-1114. ![]() [12]Lan Q, Qiao LB, Wang YJ, 2018. Stochastic extra-gradient based alternating direction methods for graph-guided regularized minimization. Front Inform Technol Electron Eng, 19(6):755-762. ![]() [13]Li SG, Liu HX, Bian ZD, et al., 2023. Colossal-AI: a unified deep learning system for large-scale parallel training. Proc 52nd Int Conf on Parallel Processing, p.766-775. ![]() [14]Liu YL, Li SG, Fang JR, et al., 2023. Colossal-Auto: unified automation of parallelization and activation checkpoint for large-scale models. https://arxiv.org/abs/2302.02599 ![]() [15]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Proc Int Conf for High Performance Computing, Networking, Article 56. ![]() [16]Mo ZY, 2018. Extreme-scale parallel computing: bottlenecks and strategies. Front Inform Technol Electron Eng, 19(10):1251-1260. ![]() [17]Narayanan D, Shoeybi M, Casper J, et al., 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 58. ![]() [18]Naumov M, Mudigere D, Shi HJM, et al., 2019. Deep learning recommendation model for personalization and recommendation systems. https://arxiv.org/abs/1906.00091 ![]() [19]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 20. ![]() [20]Shazeer N, Cheng YL, Parmar N, et al., 2018. Mesh-TensorFlow: deep learning for supercomputers. Proc 32nd Int Conf on Neural Information Processing Systems, p.10435-10444. ![]() [21]Shoeybi M, Patwary M, Puri R, et al., 2019. Megatron-LM: training multi-billion parameter language models using model parallelism. https://arxiv.org/abs/1909.08053 ![]() [22]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. ![]() [23]Wang MJ, Huang CC, Li JY, 2019. Supporting very large models using automatic dataflow graph partitioning. Proc 14th EuroSys Conf, Article 26. ![]() [24]Zheng LM, Li ZH, Zhang H, et al., 2022. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. Proc 16th USENIX Symp on Operating Systems Design and Implementation, p.559-578. ![]() [25]Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3-14. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>