CLC number: TP183
On-line Access: 2025-06-04
Received: 2024-07-17
Revision Accepted: 2025-02-23
Crosschecked: 2025-06-04
Cited: 0
Clicked: 114
Peng LIANG, Linbo QIAO, Yanqi SHI, Hao ZHENG, Yu TANG, Dongsheng LI. Memory-efficient tensor parallelism for long-sequence Transformer training[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400602 @article{title="Memory-efficient tensor parallelism for long-sequence Transformer training", %0 Journal Article TY - JOUR
面向长序列Transformer训练的内存高效张量并行方法国防科技大学计算机学院并行与分布计算全国重点实验室,中国长沙市,410073 摘要:近年来,基于Transformer架构的大语言模型(LLM)凭借卓越性能引发广泛关注。工业级LLM需处理长序列输入以提供优质服务。然而,内存消耗随序列长度呈平方级增长,制约长序列训练的扩展能力。现有并行方法在执行过程中产生冗余张量,存在内存优化空间;同时,张量并行(TP)无法实现计算与通信的有效重叠。针对上述问题,本文提出一种通用并行方法—内存高效张量并行(METP),专为Transformer训练核心计算单元(即两个连续矩阵乘法及其间可能存在的函数运算O=f(AB)C设计)。METP将计算O的子任务分配到多设备,采用点对点通信(send/recv)替代集合通信交换子矩阵完成计算,避免生成冗余张量。通过双缓冲技术实现计算与通信的深度重叠,并提出完全重叠的理论条件以指导长序列Transformer训练。理论分析表明:当并行度为p时,METP在未使用FlashAttention计算注意力时的内存开销为O(1/p3);在使用FlashAttention计算多头自注意力时,相比TP至少可节省41.7%内存。实验证明,基于8块A100 GPU的配置,METP可使序列长度较其他方法提升2.38–2.99倍。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Achiam J, Adler S, Agarwal S, et al., 2023. GPT-4 technical report. ![]() [2]Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document Transformer. ![]() [3]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159. ![]() [4]Chen TQ, Moreau T, Jiang ZH, et al., 2018. TVM: an automated end-to-end optimizing compiler for deep learning. 13th USENIX Symp on Operating Systems Design and Implementation, p.578-594. ![]() [5]Chowdhery A, Narang S, Devlin J, et al., 2022. PaLM: scaling language modeling with pathways. J Mach Learn Res, 24(1):240. ![]() [6]Dao T, 2024. FlashAttention-2: faster attention with better parallelism and work partitioning. Proc 12th Int Conf on Learning Representations. ![]() [7]Dao T, Fu DY, Ermon S, et al., 2022. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Proc 36th Int Conf on Neural Information Processing Systems, Article 1189. ![]() [8]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. ![]() [9]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33rd Int Conf on Neural Information Processing Systems, Article 10. ![]() [10]Huang YP, Xu JW, Jiang ZX, et al., 2023. Advancing Transformer architecture in long-context large language models: a comprehensive survey. ![]() [11]Jacobs SA, Tanaka M, Zhang CM, et al., 2024. System optimizations for enabling training of extreme long sequence Transformer models. Proc 43rd ACM Symp on Principles of Distributed Computing, p.121-130. ![]() [12]Kaddour J, Harris J, Mozes M, et al., 2023. Challenges and applications of large language models. ![]() [13]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations. ![]() [14]Korthikanti VA, Casper J, Lym S, et al., 2023. Reducing activation recomputation in large Transformer models. Proc 6th Conf on Machine Learning and Systems. ![]() [15]Lai ZQ, Li SW, Tang XD, et al., 2023. Merak: an efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans Parall Distrib Syst, 34(5):1466-1478. ![]() [16]Li A, Song SL, Chen JY, et al., 2020. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans Parall Distrib Syst, 31(1):94-110. ![]() [17]Li SG, Xue FZ, Baranwal C, et al., 2023. Sequence parallelism: long sequence training from system perspective. Proc 61st Annual Meeting of the Association for Computational Linguistics, p.2391-2404. ![]() [18]Liang P, Tang Y, Zhang XD, et al., 2023. A survey on auto-parallelism of large-scale deep learning training. IEEE Trans Parall Distrib Syst, 34(8):2377-2390. ![]() [19]Liu H, Zaharia M, Abbeel P, 2023. Ring Attention with blockwise Transformers for near-infinite context. ![]() [20]Liu YX, Zhang K, Li Y, et al., 2024. Sora: a review on background, technology, limitations, and opportunities of large vision models. ![]() [21]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 56. ![]() [22]Narayanan D, Shoeybi M, Casper J, et al., 2021a. Efficient large-scale language model training on GPU clusters using Megatron-LM. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 58. ![]() [23]Narayanan D, Phanishayee A, Shi KY, et al., 2021b. Memory-efficient pipeline-parallel DNN training. Proc 38th Int Conf on Machine Learning, p.7937-7947. ![]() [24]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 20. ![]() [25]Rasley J, Rajbhandari S, Ruwase O, et al., 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. Proc 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.3505-3506. ![]() [26]Shah J, Bikshandi G, Zhang Y, et al., 2024. FlashAttention-3: fast and accurate attention with asynchrony and low-precision. Proc 38th Int Conf on Neural Information Processing Systems. ![]() [27]Srivastava N, Hinton G, Krizhevsky A, et al., 2014. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 15(1):1929-1958. ![]() [28]Tarassow A, 2023. The potential of LLMs for coding with low-resource and domain-specific programming languages. ![]() [29]Touvron H, Lavril T, Izacard G, et al., 2023. LLaMA: open and efficient foundation language models. ![]() [30]Xu JJ, Sun X, Zhang ZY, et al., 2019. Understanding and improving layer normalization. Proc 33rd Int Conf on Neural Information Processing Systems, Article 394. ![]() [31]Zhao YL, Gu A, Varma R, et al., 2023. PyTorch FSDP: experiences on scaling fully sharded data parallel. Proc VLDB Endow, 16(12):3848-3860. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>