JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Training large-scale language models with limited GPU memory: a survey

Author(s): Yu TANG, Linbo QIAO, Lujia YIN, Peng LIANG, Ao SHEN, Zhilin YANG, Lizhi ZHANG, Dongsheng LI
Affiliation(s): National Key Laboratory of Parallel and Distributed Computing, College of Computer, National University of Defense Technology, Changsha 410073, China
Corresponding email(s): tangyu14@nudt.edu.cn, dsli@nudt.edu.cn
Key Words: Training techniques; Memory optimization; Model parameters; Model states; Model activations

Share this article to： More \|Next Paper >>>

Yu TANG, Linbo QIAO, Lujia YIN, Peng LIANG, Ao SHEN, Zhilin YANG, Lizhi ZHANG, Dongsheng LI. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300710

@article{title="Training large-scale language models with limited GPU memory: a survey",
author="Yu TANG, Linbo QIAO, Lujia YIN, Peng LIANG, Ao SHEN, Zhilin YANG, Lizhi ZHANG, Dongsheng LI",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2300710"
}

%0 Journal Article
%T Training large-scale language models with limited GPU memory: a survey
%A Yu TANG
%A Linbo QIAO
%A Lujia YIN
%A Peng LIANG
%A Ao SHEN
%A Zhilin YANG
%A Lizhi ZHANG
%A Dongsheng LI
%J Frontiers of Information Technology & Electronic Engineering
%P 309-331
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2300710"

TY - JOUR
T1 - Training large-scale language models with limited GPU memory: a survey
A1 - Yu TANG
A1 - Linbo QIAO
A1 - Lujia YIN
A1 - Peng LIANG
A1 - Ao SHEN
A1 - Zhilin YANG
A1 - Lizhi ZHANG
A1 - Dongsheng LI
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 309
EP - 331
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2300710"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Large-scale models have gained significant attention in a wide range of fields, such as computer vision and natural language processing, due to their effectiveness across various applications. However, a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units (GPUs). In this paper, we present a comprehensive survey focused on training large-scale models with limited GPU memory. The exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process, namely model parameters, model states, and model activations. Following this analysis, we present an in-depth overview of the relevant research work that addresses these aspects individually. Finally, the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models, emphasizing the necessity for continued research and innovation in this area. This survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

有限GPU显存下的大语言模型训练技术综述

唐宇，乔林波，尹路珈，梁鹏，沈奥，杨智琳，张立志，李东升
国防科技大学计算机学院并行与分布计算全国重点实验室，中国长沙市，410073
摘要：大模型凭借其在多领域应用中的卓越性能，已在计算机视觉、自然语言处理等领域获得广泛关注。然而，此类模型的训练面临图形处理器（GPU）显存容量的显著制约。本文系统梳理了有限GPU显存条件下大模型训练的优化技术体系。首先深入解析训练过程中GPU显存占用的三大核心要素--模型参数、模型状态和模型激活；继而从这三个维度对现有研究成果进行多角度评述；最后展望了该领域未来的发展方向，强调持续创新显存优化技术对推动大语言模型发展的重要性。本综述为研究人员理解大语言模型训练中的显存优化挑战与技术演进提供了系统参考。

关键词组：训练技术；显存优化；模型参数；模型状态；模型激活

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abadi M, Barham P, Chen JM, et al., 2016. TensorFlow: a system for large-scale machine learning. Proc 12^th USENIX Conf on Operating Systems Design and Implementation, p.265-283.

[2]Acun B, Murphy M, Wang XD, et al., 2021. Understanding training efficiency of deep learning recommendation models at scale. Proc IEEE Int Symp on High-Performance Computer Architecture, p.802-814.

[3]Ali Z, Kefalas P, Muhammad K, et al., 2020. Deep learning in citation recommendation models survey. Exp Syst Appl, 162:113790.

[4]Amari SI, 1993. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(4-5):185-196.

[5]Bachlechner T, Majumder BP, Mao H, et al., 2021. ReZero is all you need: fast convergence at large depth. Proc 37^th Conf on Uncertainty in Artificial Intelligence, p.1352-1361.

[6]Bae J, Lee J, Jin Y, et al., 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. Proc 19^th USENIX Conf on File and Storage Technologies, p.387-401.

[7]Banner R, Hubara I, Hoffer E, et al., 2018. Scalable methods for 8-bit training of neural networks. Proc 32^nd Int Conf on Neural Information Processing Systems, p.5151-5159.

[8]Bartan B, Li H, Teague H, et al., 2023. MOCCASIN: efficient tensor rematerialization for neural networks. Int Conf on Machine Learning, p.1826-1837.

[9]Beaumont O, Herrmann J, Pallez G, et al., 2020. Optimal memory-aware backpropagation of deep join networks. Phil Trans Roy Soc A, 378(2166):20190049.

[10]Beaumont O, Eyraud-Dubois L, Shilova A, 2021. Efficient combination of rematerialization and offloading for training DNNs. Proc 35^th Conf on Neural Information Processing Systems, p.23844-23857.

[11]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Conf on Neural Information Processing Systems, p.1877-1901.

[12]Chen AC, Zhang YM, Jia JH, et al., 2024. DeepZero: scaling up zeroth-order optimization for deep model training.

[13]Chen JF, Gai Y, Yao ZW, et al., 2020. A statistical framework for low-bitwidth training of deep neural networks. Proc 34^th Int Conf on Neural Information Processing Systems, Article 75.

[14]Chen JF, Zheng LM, Yao ZW, et al., 2021. ActNN: reducing training memory footprint via 2-bit activation compressed training. Proc 38^th Int Conf on Machine Learning, p.1803-1813.

[15]Chen JF, Li SG, Gun R, et al., 2023. AutoDDL: automatic distributed deep learning with asymptotically optimal communication.

[16]Chen TQ, Li M, Li YT, et al., 2015. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems.

[17]Chen TQ, Xu B, Zhang CY, et al., 2016. Training deep nets with sublinear memory cost.

[18]Cho K, van Merriënboer B, Bahdanau D, et al., 2014. On the properties of neural machine translation: encoder–decoder approaches. Proc 8^th Workshop on Syntax, Semantics and Structure in Statistical Translation, p.103-111.

[19]Choquette J, Gandhi W, Giroux O, et al., 2021. NVIDIA A100 tensor core GPU: performance and innovation. IEEE Micro, 41(2):29-35.

[20]Chowdhury GG, 2003. Natural language processing. Annu Rev Inform Sci Technol, 37(1):51-89.

[21]Cutkosky A, Mehta H, 2020. Momentum improves normalized SGD. Proc 37^th Int Conf on Machine Learning, p.2260-2268.

[22]Dao T, 2023. FlashAttention-2: faster attention with better parallelism and work partitioning.

[23]Dao T, Fu D, Ermon S, et al., 2022. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Proc 36^th Conf on Neural Information Processing Systems, p.16344-16359.

[24]Dean J, Corrado GS, Monga R, et al., 2012. Large scale distributed deep networks. Proc 25^th Int Conf on Neural Information Processing Systems, p.1223-1231.

[25]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[26]Dong L, Yang N, Wang WH, et al., 2019. Unified language model pre-training for natural language understanding and generation. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 1170.

[27]Fan SQ, Rong Y, Meng C, et al., 2021. DAPPLE: a pipelined data parallel approach for training large models. Proc 26^th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.431-445.

[28]Fang JR, Zhu ZL, Li SG, et al., 2023. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Trans Parall Distrib Syst, 34(1):304-315.

[29]Fedus W, Zoph B, Shazeer N, 2022. Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res, 23(1):120.

[30]Fu FC, Hu YZ, He YH, et al., 2020. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via TINYSCRIPT. Proc 37^th Int Conf on Machine Learning, Article 309.

[31]Gholami A, Yao ZW, Kim S, et al., 2024. AI and memory wall. IEEE Micro, 44(3):33-39.

[32]Guan L, Yin WT, Li DS, et al., 2019. XPipe: efficient pipeline model parallelism for multi-GPU DNN training.

[33]Gusak J, Cherniuk D, Shilova A, et al., 2022. Survey on large scale neural network training.

[34]Gustafson JL, 1988. Reevaluating Amdahl’s law. Commun ACM, 31(5):532-533.

[35]Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87-110.

[36]Han S, Mao HZ, Dally WJ, 2016. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding.

[37]He CY, Li S, Soltanolkotabi M, et al., 2021. PipeTransformer: automated elastic pipelining for distributed training of large-scale models. Proc 38^th Int Conf on Machine Learning, p.4150-4159.

[38]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[39]Herrmann J, Beaumont O, Eyraud-Dubois L, et al., 2019. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory.

[40]Hildebrand M, Khan J, Trika S, et al., 2020. AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming. Proc 25^th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.875-890.

[41]Holland JH, 1992. Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT Press, Cambridge, USA.

[42]Huang CC, Jin G, Li JY, 2020. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping. Proc 25^th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.1341-1355.

[43]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33^rd Conf on Neural Information Processing Systems, Article 10.

[44]Jain P, Jain A, Nrusimha A, et al., 2020. Checkmate: breaking the memory wall with optimal tensor rematerialization. Proc 3^rd Conf on Machine Learning and Systems, p.497-511.

[45]Ji SW, Xu W, Yang M, et al., 2013. 3D convolutional neural networks for human action recognition. IEEE Trans Patt Anal Mach Intell, 35(1):221-231.

[46]Jia Z, Maggioni M, Staiger B, et al., 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking.

[47]Jia ZH, Zaharia M, Aiken A, 2019. Beyond data and model parallelism for deep neural networks. Proc 2^nd Conf on Machine Learning and Systems, p.1-13.

[48]Kang WC, Cheng DZ, Yao TS, et al., 2021. Learning to embed categorical features without embedding tables for recommendation. Proc 27^th ACM SIGKDD Conf on Knowledge Discovery & Data Mining, p.840-850.

[49]Kim C, Lee H, Jeong M, et al., 2020. torchgpipe: on-the-fly pipeline parallelism for training giant models.

[50]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3^rd Int Conf on Learning Representations.

[51]Kirisame M, Lyubomirsky S, Haan A, et al., 2021. Dynamic tensor rematerialization. Proc 9^th Int Conf on Learning Representations.

[52]Kitaev N, Kaiser Ł, Levskaya A, 2020. Reformer: the efficient Transformer. Proc 8^th Int Conf on Learning Representations.

[53]Ko H, Lee S, Park Y, et al., 2022. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 11(1):141.

[54]Korthikanti V, Casper J, Lym S, et al., 2023. Reducing activation recomputation in large Transformer models. Proc 6^th Conf on Machine Learning and Systems, p.5.

[55]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25^th Int Conf on Neural Information Processing Systems, p.1097-1105.

[56]LeCun Y, Bengio Y, Hinton G, 2015. Deep learning. Nature, 521(7553):436-444.

[57]Lee D, Choi J, Kim JH, et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems, p.134-143.

[58]Li M, Andersen DG, Park JW, et al., 2014. Scaling distributed machine learning with the parameter server. Proc 11^th USENIX Conf on Operating Systems Design and Implementation, p.583-598.

[59]Li S, Zhao YL, Varma R, et al., 2020. PyTorch distributed: experiences on accelerating data parallel training. Proc VLDB Endow, 13(12):3005-3018.

[60]Li SG, Hoefler T, 2021. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-14.

[61]Li SG, Xue FZ, Baranwal C, et al., 2022. Sequence parallelism: long sequence training from system perspective.

[62]Li SG, Liu HX, Bian ZD, et al., 2023. Colossal-AI: a unified deep learning system for large-scale parallel training. Proc 52^nd Int Conf on Parallel Processing, p.766-775.

[63]Liang P, Tang Y, Zhang XD, et al., 2022. A survey on auto-parallelism of neural networks training.

[64]Lin YJ, Han S, Mao HZ, et al., 2018. Deep gradient compression: reducing the communication bandwidth for distributed training. Proc 6^th Int Conf on Learning Representations.

[65]Lin ZQ, Miao YS, Liu GD, et al., 2023. SuperScaler: supporting flexible DNN parallelization via a unified abstraction.

[66]Liu SJ, Kailkhura B, Chen PY, et al., 2018. Zeroth-order stochastic variance reduction for nonconvex optimization. Proc 32^nd Int Conf on Neural Information Processing Systems, p.3731-3741.

[67]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.

[68]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 56.

[69]Luo Y, Ren XZ, Zheng ZW, et al., 2023. CAME: confidence-guided adaptive memory efficient optimization. Proc 61^st Annual Meeting of the Association for Computational Linguistics, p.4442-4453.

[70]Ma ZX, He JA, Qiu JZ, et al., 2022. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. Proc 27^th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.192-204.

[71]Margot F, 2010. Symmetry in integer linear programming. In: Jünger M, Liebling TM, Naddef D, et al. (Eds.), 50 Years of Integer Programming 1958–2008: from the Early Years to the State-of-the-Art. Springer, Berlin, p.647-686.

[72]Micikevicius P, Narang S, Alben J, et al., 2018. Mixed precision training.

[73]Narayanan D, Harlap A, Phanishayee A, et al., 2019. PipeDream: generalized pipeline parallelism for DNN training. Proc 27^th ACM Symp on Operating Systems Principles, p.1-15.

[74]Neugebauer R, Antichi G, Zazo JF, et al., 2018. Understanding PCIe performance for end host networking. Proc Conf of the ACM Special Interest Group on Data Communication, p.327-341.

[75]Nie XN, Miao XP, Yang Z, et al., 2022. TSPLIT: fine-grained GPU memory management for efficient DNN training via tensor splitting. Proc IEEE 38^th Int Conf on Data Engineering, p.2615-2628.

[76]OpenAI, Achiam J, Adler S, et al., 2024. GPT-4 technical report.

[77]Park JH, Yun G, Yi CM, et al., 2020. HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. Proc USENIX Annual Technical Conf, p.307-321.

[78]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 721.

[79]Peng X, Shi XH, Dai HL, et al., 2020. Capuchin: tensor-based GPU memory management for deep learning. Proc 25^th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.891-905.

[80]Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.

[81]Pudipeddi B, Mesmakhosroshahi M, Xi JW, et al., 2020. Training large neural networks with constant memory using a new execution algorithm.

[82]Qiu XP, Sun TX, Xu YG, et al., 2020. Pre-trained models for natural language processing: a survey. Sci China Technol Sci, 63(10):1872-1897.

[83]Raffel C, Shazeer N, Roberts A, et al., 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. J Mach Learn Res, 21(1):140.

[84]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc SC20: Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-16.

[85]Rajbhandari S, Ruwase O, Rasley J, et al., 2021. ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 595.

[86]Rajbhandari S, Li CL, Yao ZW, et al., 2022. DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale. Proc 39^th Int Conf on Machine Learning, p.18332-18346.

[87]Rajpurkar P, Zhang J, Lopyrev K, et al., 2016. SQuAD: 100,000+ questions for machine comprehension of text. Proc Conf on Empirical Methods in Natural Language Processing, p.2383-2392.

[88]Rasley J, Rajbhandari S, Ruwase O, et al., 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. Proc 26^th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.3505-3506.

[89]Ren J, Rajbhandari S, Aminabadi RY, et al., 2021. ZeRO-Offload: democratizing billion-scale model training. Proc USENIX Annual Technical Conf, p.551-564.

[90]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 28^th Int Conf on Neural Information Processing Systems, p.91-99.

[91]Rhu M, Gimelshein N, Clemons J, et al., 2016. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. Proc 49^th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-13.

[92]Sergeev A, del Balso M, 2018. Horovod: fast and easy distributed deep learning in TensorFlow.

[93]Sethi G, Acun B, Agarwal N, et al., 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. Proc 27^th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.344-358.

[94]Shazeer N, Stern M, 2018. Adafactor: adaptive learning rates with sublinear memory cost. Proc 35^th Int Conf on Machine Learning, p.4596-4604.

[95]Shoeybi M, Patwary M, Puri R, et al., 2020. Megatron-LM: training multi-billion parameter language models using model parallelism.

[96]Sun X, Choi J, Chen CY, et al., 2019. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 441.

[97]Sun X, Wang NG, Chen CY, et al., 2020. Ultra-low precision 4-bit training of deep neural networks. Proc 34^th Int Conf on Neural Information Processing Systems, Article 152.

[98]Sun Y, Wang SH, Li YK, et al., 2019. ERNIE: enhanced representation through knowledge integration.

[99]Sun Y, Wang SH, Feng SK, et al., 2021. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation.

[100]Sutskever I, Martens J, Dahl G, et al., 2013. On the importance of initialization and momentum in deep learning. Proc 30^th Int Conf on Machine Learning, p.1139-1147.

[101]Sutskever I, Vinyals O, Le QV, 2014. Sequence to sequence learning with neural networks. Proc 27^th Int Conf on Neural Information Processing Systems, p.3104-3112.

[102]Tang Y, Wang CY, Zhang YF, et al., 2022. DELTA: dynamically optimizing GPU memory beyond tensor recomputation.

[103]Unger C, Jia ZH, Wu W, et al., 2022. Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization. Proc 16^th USENIX Symp on Operating Systems Design and Implementation, p.267-284.

[104]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[105]Wang LN, Ye JM, Zhao YY, et al., 2018. SuperNeurons: dynamic GPU memory management for training deep neural networks. Proc 23^rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.41-53.

[106]Wang NG, Choi J, Brand D, et al., 2018. Training deep neural networks with 8-bit floating point numbers. Proc 32^nd Int Conf on Neural Information Processing Systems, p.7686-7695.

[107]Wang YZ, Han X, Zhao WL, et al., 2023. H3T: efficient integration of memory optimization and parallelism for high-throughput Transformer training. Proc 37^th Conf on Neural Information Processing Systems.

[108]Weinberger K, Dasgupta A, Langford J, et al., 2009. Feature hashing for large scale multitask learning. Proc 26^th Annual Int Conf on Machine Learning, p.1113-1120.

[109]Xi HC, Li CH, Chen JF, et al., 2023. Training Transformers with 4-bit integers. Proc 37^th Conf on Neural Information Processing Systems.

[110]Xiong RB, Yang YC, He D, et al., 2020. On layer normalization in the Transformer architecture. Proc 37^th Int Conf on Machine Learning, p.10524-10533.

[111]Xu QM, Siyamwala H, Ghosh M, et al., 2015. Performance analysis of NVMe SSDs and their implication on real world databases. Proc 8^th ACM Int Systems and Storage Conf, Article 6.

[112]Yao ZW, Aminabadi RY, Ruwase O, et al., 2023. DeepSpeed-Chat: easy, fast and affordable RLHF training of ChatGPT-like models at all scales.

[113]You Y, Li J, Reddi SJ, et al., 2020. Large batch optimization for deep learning: training BERT in 76 minutes. Proc 8^th Int Conf on Learning Representations.

[114]Yuan JH, Li XQ, Cheng C, et al., 2022. OneFlow: redesign the distributed deep learning framework from scratch.

[115]Ze HG, Senior A, Schuster M, 2013. Statistical parametric speech synthesis using deep neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7962-7966.

[116]Zellers R, Bisk Y, Schwartz R, et al., 2018. SWAG: a large-scale adversarial dataset for grounded commonsense inference. Proc Conf on Empirical Methods in Natural Language Processing, p.93-104.

[117]Zeng W, Ren XZ, Su T, et al., 2021. PanGu-α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation.

[118]Zhang DQ, Yang JL, Ye DQ, et al., 2018. LQ-Nets: learned quantization for highly accurate and compact deep neural networks. Proc 15^th European Conf on Computer Vision, p.373-390.

[119]Zhang JH, Ma SH, Liu PH, et al., 2023. Coop: memory is not a commodity. Proc 34^th Conf on Neural Information Processing Systems.

[120]Zhang ZY, Han X, Zhou H, et al., 2021. CPM: a large-scale generative Chinese pre-trained language model. AI Open, 2:93-99.

[121]Zhao XY, Le Hellard T, Eyraud-Dubois L, et al., 2023. Rockmate: an efficient, fast, automatic and generic tool for re-materialization in PyTorch. Proc 40^th Int Conf on Machine Learning, p.42018-42045.

[122]Zhao YL, Gu A, Varma R, et al., 2023. PyTorch FSDP: experiences on scaling fully sharded Data Parallel. Proc VLDB Endow, 16(12):3848-3860.

[123]Zheng LM, Li ZH, Zhang H, et al., 2022. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. Proc 16^th USENIX Symp on Operating Systems Design and Implementation, p.559-578.

[124]Zhong Y, Zhu JJ, Yang PC, et al., 2023. MQSP: micro-query sequence parallelism for linearly scaling long sequence Transformer. https://openreview.net/forum?id=gfr5yILQc7_ [Accessed on Sept. 17, 2023].

[125]Zhou J, Ke P, Qiu XP, et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng 25(1):6-11.

[126]Zhou SC, Wu YX, Ni ZK, et al., 2018. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients.

[127]Zhuang ZX, Liu MR, Cutkosky A, et al., 2022. Understanding AdamW through proximal methods and scale-freeness. https://arxiv.org/abs/2202.00089

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

有限GPU显存下的大语言模型训练技术综述

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference