CLC number: TP183
On-line Access: 2025-06-04
Received: 2024-07-17
Revision Accepted: 2025-02-23
Crosschecked: 2025-06-04
Cited: 0
Clicked: 21
Peng LIANG, Linbo QIAO, Yanqi SHI, Hao ZHENG, Yu TANG, Dongsheng LI. Memory-efficient tensor parallelism for long-sequence Transformer training[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(5): 770-787.
@article{title="Memory-efficient tensor parallelism for long-sequence Transformer training",
author="Peng LIANG, Linbo QIAO, Yanqi SHI, Hao ZHENG, Yu TANG, Dongsheng LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="5",
pages="770-787",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400602"
}
%0 Journal Article
%T Memory-efficient tensor parallelism for long-sequence Transformer training
%A Peng LIANG
%A Linbo QIAO
%A Yanqi SHI
%A Hao ZHENG
%A Yu TANG
%A Dongsheng LI
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 5
%P 770-787
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400602
TY - JOUR
T1 - Memory-efficient tensor parallelism for long-sequence Transformer training
A1 - Peng LIANG
A1 - Linbo QIAO
A1 - Yanqi SHI
A1 - Hao ZHENG
A1 - Yu TANG
A1 - Dongsheng LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 5
SP - 770
EP - 787
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400602
Abstract: Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically with the increase of sequence length, posing challenges for scaling up long-sequence training. Current parallelism methods produce duplicated tensors during execution, leaving space for improving memory efficiency. Additionally, tensor parallelism (TP) cannot achieve effective overlap between computation and communication. To solve these weaknesses, we propose a general parallelism method called memory-efficient tensor parallelism (METP), designed for the computation of two consecutive matrix multiplications and a possible function between them (
[1]Achiam J, Adler S, Agarwal S, et al., 2023. GPT-4 technical report.
[2]Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document Transformer.
[3]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159.
[4]Chen TQ, Moreau T, Jiang ZH, et al., 2018. TVM: an automated end-to-end optimizing compiler for deep learning. 13th USENIX Symp on Operating Systems Design and Implementation, p.578-594.
[5]Chowdhery A, Narang S, Devlin J, et al., 2022. PaLM: scaling language modeling with pathways. J Mach Learn Res, 24(1):240.
[6]Dao T, 2024. FlashAttention-2: faster attention with better parallelism and work partitioning. Proc 12th Int Conf on Learning Representations.
[7]Dao T, Fu DY, Ermon S, et al., 2022. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Proc 36th Int Conf on Neural Information Processing Systems, Article 1189.
[8]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[9]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33rd Int Conf on Neural Information Processing Systems, Article 10.
[10]Huang YP, Xu JW, Jiang ZX, et al., 2023. Advancing Transformer architecture in long-context large language models: a comprehensive survey.
[11]Jacobs SA, Tanaka M, Zhang CM, et al., 2024. System optimizations for enabling training of extreme long sequence Transformer models. Proc 43rd ACM Symp on Principles of Distributed Computing, p.121-130.
[12]Kaddour J, Harris J, Mozes M, et al., 2023. Challenges and applications of large language models.
[13]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations.
[14]Korthikanti VA, Casper J, Lym S, et al., 2023. Reducing activation recomputation in large Transformer models. Proc 6th Conf on Machine Learning and Systems.
[15]Lai ZQ, Li SW, Tang XD, et al., 2023. Merak: an efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans Parall Distrib Syst, 34(5):1466-1478.
[16]Li A, Song SL, Chen JY, et al., 2020. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans Parall Distrib Syst, 31(1):94-110.
[17]Li SG, Xue FZ, Baranwal C, et al., 2023. Sequence parallelism: long sequence training from system perspective. Proc 61st Annual Meeting of the Association for Computational Linguistics, p.2391-2404.
[18]Liang P, Tang Y, Zhang XD, et al., 2023. A survey on auto-parallelism of large-scale deep learning training. IEEE Trans Parall Distrib Syst, 34(8):2377-2390.
[19]Liu H, Zaharia M, Abbeel P, 2023. Ring Attention with blockwise Transformers for near-infinite context.
[20]Liu YX, Zhang K, Li Y, et al., 2024. Sora: a review on background, technology, limitations, and opportunities of large vision models.
[21]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 56.
[22]Narayanan D, Shoeybi M, Casper J, et al., 2021a. Efficient large-scale language model training on GPU clusters using Megatron-LM. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 58.
[23]Narayanan D, Phanishayee A, Shi KY, et al., 2021b. Memory-efficient pipeline-parallel DNN training. Proc 38th Int Conf on Machine Learning, p.7937-7947.
[24]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 20.
[25]Rasley J, Rajbhandari S, Ruwase O, et al., 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. Proc 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.3505-3506.
[26]Shah J, Bikshandi G, Zhang Y, et al., 2024. FlashAttention-3: fast and accurate attention with asynchrony and low-precision. Proc 38th Int Conf on Neural Information Processing Systems.
[27]Srivastava N, Hinton G, Krizhevsky A, et al., 2014. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 15(1):1929-1958.
[28]Tarassow A, 2023. The potential of LLMs for coding with low-resource and domain-specific programming languages.
[29]Touvron H, Lavril T, Izacard G, et al., 2023. LLaMA: open and efficient foundation language models.
[30]Xu JJ, Sun X, Zhang ZY, et al., 2019. Understanding and improving layer normalization. Proc 33rd Int Conf on Neural Information Processing Systems, Article 394.
[31]Zhao YL, Gu A, Varma R, et al., 2023. PyTorch FSDP: experiences on scaling fully sharded data parallel. Proc VLDB Endow, 16(12):3848-3860.
Open peer comments: Debate/Discuss/Question/Opinion
<1>