|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.5 P.770-787
Memory-efficient tensor parallelism for long-sequence Transformer training
Abstract: Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically with the increase of sequence length, posing challenges for scaling up long-sequence training. Current parallelism methods produce duplicated tensors during execution, leaving space for improving memory efficiency. Additionally, tensor parallelism (TP) cannot achieve effective overlap between computation and communication. To solve these weaknesses, we propose a general parallelism method called memory-efficient tensor parallelism (METP), designed for the computation of two consecutive matrix multiplications and a possible function between them (
Key words: Distributed learning; Large language model (LLM); Long sequence; Machine learning system; Memory efficiency; Tensor parallelism
国防科技大学计算机学院并行与分布计算全国重点实验室,中国长沙市,410073
摘要:近年来,基于Transformer架构的大语言模型(LLM)凭借卓越性能引发广泛关注。工业级LLM需处理长序列输入以提供优质服务。然而,内存消耗随序列长度呈平方级增长,制约长序列训练的扩展能力。现有并行方法在执行过程中产生冗余张量,存在内存优化空间;同时,张量并行(TP)无法实现计算与通信的有效重叠。针对上述问题,本文提出一种通用并行方法—内存高效张量并行(METP),专为Transformer训练核心计算单元(即两个连续矩阵乘法及其间可能存在的函数运算O=f(AB)C设计)。METP将计算O的子任务分配到多设备,采用点对点通信(send/recv)替代集合通信交换子矩阵完成计算,避免生成冗余张量。通过双缓冲技术实现计算与通信的深度重叠,并提出完全重叠的理论条件以指导长序列Transformer训练。理论分析表明:当并行度为p时,METP在未使用FlashAttention计算注意力时的内存开销为O(1/p3);在使用FlashAttention计算多头自注意力时,相比TP至少可节省41.7%内存。实验证明,基于8块A100 GPU的配置,METP可使序列长度较其他方法提升2.38–2.99倍。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2400602
CLC number:
TP183
Download Full Text:
Downloaded:
23
Clicked:
28
Cited:
0
On-line Access:
2025-06-04
Received:
2024-07-17
Revision Accepted:
2025-02-23
Crosschecked:
2025-06-04