CLC number: TP389.1
On-line Access: 2025-04-03
Received: 2023-10-17
Revision Accepted: 2024-03-31
Crosschecked: 2025-04-07
Cited: 0
Clicked: 1451
Citations: Bibtex RefMan EndNote GB/T7714
Yu TANG, Linbo QIAO, Lujia YIN, Peng LIANG, Ao SHEN, Zhilin YANG, Lizhi ZHANG, Dongsheng LI. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(3): 309-331.
@article{title="Training large-scale language models with limited GPU memory: a survey",
author="Yu TANG, Linbo QIAO, Lujia YIN, Peng LIANG, Ao SHEN, Zhilin YANG, Lizhi ZHANG, Dongsheng LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="3",
pages="309-331",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300710"
}
%0 Journal Article
%T Training large-scale language models with limited GPU memory: a survey
%A Yu TANG
%A Linbo QIAO
%A Lujia YIN
%A Peng LIANG
%A Ao SHEN
%A Zhilin YANG
%A Lizhi ZHANG
%A Dongsheng LI
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 3
%P 309-331
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300710
TY - JOUR
T1 - Training large-scale language models with limited GPU memory: a survey
A1 - Yu TANG
A1 - Linbo QIAO
A1 - Lujia YIN
A1 - Peng LIANG
A1 - Ao SHEN
A1 - Zhilin YANG
A1 - Lizhi ZHANG
A1 - Dongsheng LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 3
SP - 309
EP - 331
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300710
Abstract: Large-scale models have gained significant attention in a wide range of fields, such as computer vision and natural language processing, due to their effectiveness across various applications. However, a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units (GPUs). In this paper, we present a comprehensive survey focused on training large-scale models with limited GPU memory. The exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process, namely model parameters, model states, and model activations. Following this analysis, we present an in-depth overview of the relevant research work that addresses these aspects individually. Finally, the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models, emphasizing the necessity for continued research and innovation in this area. This survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.
[1]Abadi M, Barham P, Chen JM, et al., 2016. TensorFlow: a system for large-scale machine learning. Proc 12th USENIX Conf on Operating Systems Design and Implementation, p.265-283.
[2]Acun B, Murphy M, Wang XD, et al., 2021. Understanding training efficiency of deep learning recommendation models at scale. Proc IEEE Int Symp on High-Performance Computer Architecture, p.802-814.
[3]Ali Z, Kefalas P, Muhammad K, et al., 2020. Deep learning in citation recommendation models survey. Exp Syst Appl, 162:113790.
[4]Amari SI, 1993. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(4-5):185-196.
[5]Bachlechner T, Majumder BP, Mao H, et al., 2021. ReZero is all you need: fast convergence at large depth. Proc 37th Conf on Uncertainty in Artificial Intelligence, p.1352-1361.
[6]Bae J, Lee J, Jin Y, et al., 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. Proc 19th USENIX Conf on File and Storage Technologies, p.387-401.
[7]Banner R, Hubara I, Hoffer E, et al., 2018. Scalable methods for 8-bit training of neural networks. Proc 32nd Int Conf on Neural Information Processing Systems, p.5151-5159.
[8]Bartan B, Li H, Teague H, et al., 2023. MOCCASIN: efficient tensor rematerialization for neural networks. Int Conf on Machine Learning, p.1826-1837.
[9]Beaumont O, Herrmann J, Pallez G, et al., 2020. Optimal memory-aware backpropagation of deep join networks. Phil Trans Roy Soc A, 378(2166):20190049.
[10]Beaumont O, Eyraud-Dubois L, Shilova A, 2021. Efficient combination of rematerialization and offloading for training DNNs. Proc 35th Conf on Neural Information Processing Systems, p.23844-23857.
[11]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Conf on Neural Information Processing Systems, p.1877-1901.
[12]Chen AC, Zhang YM, Jia JH, et al., 2024. DeepZero: scaling up zeroth-order optimization for deep model training.
[13]Chen JF, Gai Y, Yao ZW, et al., 2020. A statistical framework for low-bitwidth training of deep neural networks. Proc 34th Int Conf on Neural Information Processing Systems, Article 75.
[14]Chen JF, Zheng LM, Yao ZW, et al., 2021. ActNN: reducing training memory footprint via 2-bit activation compressed training. Proc 38th Int Conf on Machine Learning, p.1803-1813.
[15]Chen JF, Li SG, Gun R, et al., 2023. AutoDDL: automatic distributed deep learning with asymptotically optimal communication.
[16]Chen TQ, Li M, Li YT, et al., 2015. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems.
[17]Chen TQ, Xu B, Zhang CY, et al., 2016. Training deep nets with sublinear memory cost.
[18]Cho K, van Merriënboer B, Bahdanau D, et al., 2014. On the properties of neural machine translation: encoder–decoder approaches. Proc 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, p.103-111.
[19]Choquette J, Gandhi W, Giroux O, et al., 2021. NVIDIA A100 tensor core GPU: performance and innovation. IEEE Micro, 41(2):29-35.
[20]Chowdhury GG, 2003. Natural language processing. Annu Rev Inform Sci Technol, 37(1):51-89.
[21]Cutkosky A, Mehta H, 2020. Momentum improves normalized SGD. Proc 37th Int Conf on Machine Learning, p.2260-2268.
[22]Dao T, 2023. FlashAttention-2: faster attention with better parallelism and work partitioning.
[23]Dao T, Fu D, Ermon S, et al., 2022. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Proc 36th Conf on Neural Information Processing Systems, p.16344-16359.
[24]Dean J, Corrado GS, Monga R, et al., 2012. Large scale distributed deep networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1223-1231.
[25]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[26]Dong L, Yang N, Wang WH, et al., 2019. Unified language model pre-training for natural language understanding and generation. Proc 33rd Int Conf on Neural Information Processing Systems, Article 1170.
[27]Fan SQ, Rong Y, Meng C, et al., 2021. DAPPLE: a pipelined data parallel approach for training large models. Proc 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.431-445.
[28]Fang JR, Zhu ZL, Li SG, et al., 2023. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Trans Parall Distrib Syst, 34(1):304-315.
[29]Fedus W, Zoph B, Shazeer N, 2022. Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res, 23(1):120.
[30]Fu FC, Hu YZ, He YH, et al., 2020. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via TINYSCRIPT. Proc 37th Int Conf on Machine Learning, Article 309.
[31]Gholami A, Yao ZW, Kim S, et al., 2024. AI and memory wall. IEEE Micro, 44(3):33-39.
[32]Guan L, Yin WT, Li DS, et al., 2019. XPipe: efficient pipeline model parallelism for multi-GPU DNN training.
[33]Gusak J, Cherniuk D, Shilova A, et al., 2022. Survey on large scale neural network training.
[34]Gustafson JL, 1988. Reevaluating Amdahl’s law. Commun ACM, 31(5):532-533.
[35]Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87-110.
[36]Han S, Mao HZ, Dally WJ, 2016. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding.
[37]He CY, Li S, Soltanolkotabi M, et al., 2021. PipeTransformer: automated elastic pipelining for distributed training of large-scale models. Proc 38th Int Conf on Machine Learning, p.4150-4159.
[38]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[39]Herrmann J, Beaumont O, Eyraud-Dubois L, et al., 2019. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory.
[40]Hildebrand M, Khan J, Trika S, et al., 2020. AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming. Proc 25th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.875-890.
[41]Holland JH, 1992. Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT Press, Cambridge, USA.
[42]Huang CC, Jin G, Li JY, 2020. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping. Proc 25th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.1341-1355.
[43]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33rd Conf on Neural Information Processing Systems, Article 10.
[44]Jain P, Jain A, Nrusimha A, et al., 2020. Checkmate: breaking the memory wall with optimal tensor rematerialization. Proc 3rd Conf on Machine Learning and Systems, p.497-511.
[45]Ji SW, Xu W, Yang M, et al., 2013. 3D convolutional neural networks for human action recognition. IEEE Trans Patt Anal Mach Intell, 35(1):221-231.
[46]Jia Z, Maggioni M, Staiger B, et al., 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking.
[47]Jia ZH, Zaharia M, Aiken A, 2019. Beyond data and model parallelism for deep neural networks. Proc 2nd Conf on Machine Learning and Systems, p.1-13.
[48]Kang WC, Cheng DZ, Yao TS, et al., 2021. Learning to embed categorical features without embedding tables for recommendation. Proc 27th ACM SIGKDD Conf on Knowledge Discovery & Data Mining, p.840-850.
[49]Kim C, Lee H, Jeong M, et al., 2020. torchgpipe: on-the-fly pipeline parallelism for training giant models.
[50]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations.
[51]Kirisame M, Lyubomirsky S, Haan A, et al., 2021. Dynamic tensor rematerialization. Proc 9th Int Conf on Learning Representations.
[52]Kitaev N, Kaiser Ł, Levskaya A, 2020. Reformer: the efficient Transformer. Proc 8th Int Conf on Learning Representations.
[53]Ko H, Lee S, Park Y, et al., 2022. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 11(1):141.
[54]Korthikanti V, Casper J, Lym S, et al., 2023. Reducing activation recomputation in large Transformer models. Proc 6th Conf on Machine Learning and Systems, p.5.
[55]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1097-1105.
[56]LeCun Y, Bengio Y, Hinton G, 2015. Deep learning. Nature, 521(7553):436-444.
[57]Lee D, Choi J, Kim JH, et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems, p.134-143.
[58]Li M, Andersen DG, Park JW, et al., 2014. Scaling distributed machine learning with the parameter server. Proc 11th USENIX Conf on Operating Systems Design and Implementation, p.583-598.
[59]Li S, Zhao YL, Varma R, et al., 2020. PyTorch distributed: experiences on accelerating data parallel training. Proc VLDB Endow, 13(12):3005-3018.
[60]Li SG, Hoefler T, 2021. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-14.
[61]Li SG, Xue FZ, Baranwal C, et al., 2022. Sequence parallelism: long sequence training from system perspective.
[62]Li SG, Liu HX, Bian ZD, et al., 2023. Colossal-AI: a unified deep learning system for large-scale parallel training. Proc 52nd Int Conf on Parallel Processing, p.766-775.
[63]Liang P, Tang Y, Zhang XD, et al., 2022. A survey on auto-parallelism of neural networks training.
[64]Lin YJ, Han S, Mao HZ, et al., 2018. Deep gradient compression: reducing the communication bandwidth for distributed training. Proc 6th Int Conf on Learning Representations.
[65]Lin ZQ, Miao YS, Liu GD, et al., 2023. SuperScaler: supporting flexible DNN parallelization via a unified abstraction.
[66]Liu SJ, Kailkhura B, Chen PY, et al., 2018. Zeroth-order stochastic variance reduction for nonconvex optimization. Proc 32nd Int Conf on Neural Information Processing Systems, p.3731-3741.
[67]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.
[68]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 56.
[69]Luo Y, Ren XZ, Zheng ZW, et al., 2023. CAME: confidence-guided adaptive memory efficient optimization. Proc 61st Annual Meeting of the Association for Computational Linguistics, p.4442-4453.
[70]Ma ZX, He JA, Qiu JZ, et al., 2022. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. Proc 27th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.192-204.
[71]Margot F, 2010. Symmetry in integer linear programming. In: Jünger M, Liebling TM, Naddef D, et al. (Eds.), 50 Years of Integer Programming 1958–2008: from the Early Years to the State-of-the-Art. Springer, Berlin, p.647-686.
[72]Micikevicius P, Narang S, Alben J, et al., 2018. Mixed precision training.
[73]Narayanan D, Harlap A, Phanishayee A, et al., 2019. PipeDream: generalized pipeline parallelism for DNN training. Proc 27th ACM Symp on Operating Systems Principles, p.1-15.
[74]Neugebauer R, Antichi G, Zazo JF, et al., 2018. Understanding PCIe performance for end host networking. Proc Conf of the ACM Special Interest Group on Data Communication, p.327-341.
[75]Nie XN, Miao XP, Yang Z, et al., 2022. TSPLIT: fine-grained GPU memory management for efficient DNN training via tensor splitting. Proc IEEE 38th Int Conf on Data Engineering, p.2615-2628.
[76]OpenAI, Achiam J, Adler S, et al., 2024. GPT-4 technical report.
[77]Park JH, Yun G, Yi CM, et al., 2020. HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. Proc USENIX Annual Technical Conf, p.307-321.
[78]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. Proc 33rd Int Conf on Neural Information Processing Systems, Article 721.
[79]Peng X, Shi XH, Dai HL, et al., 2020. Capuchin: tensor-based GPU memory management for deep learning. Proc 25th Int Conf on Architectural Support for Programming Languages and Operating Systems, p.891-905.
[80]Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.
[81]Pudipeddi B, Mesmakhosroshahi M, Xi JW, et al., 2020. Training large neural networks with constant memory using a new execution algorithm.
[82]Qiu XP, Sun TX, Xu YG, et al., 2020. Pre-trained models for natural language processing: a survey. Sci China Technol Sci, 63(10):1872-1897.
[83]Raffel C, Shazeer N, Roberts A, et al., 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. J Mach Learn Res, 21(1):140.
[84]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc SC20: Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-16.
[85]Rajbhandari S, Ruwase O, Rasley J, et al., 2021. ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 595.
[86]Rajbhandari S, Li CL, Yao ZW, et al., 2022. DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale. Proc 39th Int Conf on Machine Learning, p.18332-18346.
[87]Rajpurkar P, Zhang J, Lopyrev K, et al., 2016. SQuAD: 100,000+ questions for machine comprehension of text. Proc Conf on Empirical Methods in Natural Language Processing, p.2383-2392.
[88]Rasley J, Rajbhandari S, Ruwase O, et al., 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. Proc 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.3505-3506.
[89]Ren J, Rajbhandari S, Aminabadi RY, et al., 2021. ZeRO-Offload: democratizing billion-scale model training. Proc USENIX Annual Technical Conf, p.551-564.
[90]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 28th Int Conf on Neural Information Processing Systems, p.91-99.
[91]Rhu M, Gimelshein N, Clemons J, et al., 2016. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. Proc 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-13.
[92]Sergeev A, del Balso M, 2018. Horovod: fast and easy distributed deep learning in TensorFlow.
[93]Sethi G, Acun B, Agarwal N, et al., 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. Proc 27th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.344-358.
[94]Shazeer N, Stern M, 2018. Adafactor: adaptive learning rates with sublinear memory cost. Proc 35th Int Conf on Machine Learning, p.4596-4604.
[95]Shoeybi M, Patwary M, Puri R, et al., 2020. Megatron-LM: training multi-billion parameter language models using model parallelism.
[96]Sun X, Choi J, Chen CY, et al., 2019. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Proc 33rd Int Conf on Neural Information Processing Systems, Article 441.
[97]Sun X, Wang NG, Chen CY, et al., 2020. Ultra-low precision 4-bit training of deep neural networks. Proc 34th Int Conf on Neural Information Processing Systems, Article 152.
[98]Sun Y, Wang SH, Li YK, et al., 2019. ERNIE: enhanced representation through knowledge integration.
[99]Sun Y, Wang SH, Feng SK, et al., 2021. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation.
[100]Sutskever I, Martens J, Dahl G, et al., 2013. On the importance of initialization and momentum in deep learning. Proc 30th Int Conf on Machine Learning, p.1139-1147.
[101]Sutskever I, Vinyals O, Le QV, 2014. Sequence to sequence learning with neural networks. Proc 27th Int Conf on Neural Information Processing Systems, p.3104-3112.
[102]Tang Y, Wang CY, Zhang YF, et al., 2022. DELTA: dynamically optimizing GPU memory beyond tensor recomputation.
[103]Unger C, Jia ZH, Wu W, et al., 2022. Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization. Proc 16th USENIX Symp on Operating Systems Design and Implementation, p.267-284.
[104]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[105]Wang LN, Ye JM, Zhao YY, et al., 2018. SuperNeurons: dynamic GPU memory management for training deep neural networks. Proc 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.41-53.
[106]Wang NG, Choi J, Brand D, et al., 2018. Training deep neural networks with 8-bit floating point numbers. Proc 32nd Int Conf on Neural Information Processing Systems, p.7686-7695.
[107]Wang YZ, Han X, Zhao WL, et al., 2023. H3T: efficient integration of memory optimization and parallelism for high-throughput Transformer training. Proc 37th Conf on Neural Information Processing Systems.
[108]Weinberger K, Dasgupta A, Langford J, et al., 2009. Feature hashing for large scale multitask learning. Proc 26th Annual Int Conf on Machine Learning, p.1113-1120.
[109]Xi HC, Li CH, Chen JF, et al., 2023. Training Transformers with 4-bit integers. Proc 37th Conf on Neural Information Processing Systems.
[110]Xiong RB, Yang YC, He D, et al., 2020. On layer normalization in the Transformer architecture. Proc 37th Int Conf on Machine Learning, p.10524-10533.
[111]Xu QM, Siyamwala H, Ghosh M, et al., 2015. Performance analysis of NVMe SSDs and their implication on real world databases. Proc 8th ACM Int Systems and Storage Conf, Article 6.
[112]Yao ZW, Aminabadi RY, Ruwase O, et al., 2023. DeepSpeed-Chat: easy, fast and affordable RLHF training of ChatGPT-like models at all scales.
[113]You Y, Li J, Reddi SJ, et al., 2020. Large batch optimization for deep learning: training BERT in 76 minutes. Proc 8th Int Conf on Learning Representations.
[114]Yuan JH, Li XQ, Cheng C, et al., 2022. OneFlow: redesign the distributed deep learning framework from scratch.
[115]Ze HG, Senior A, Schuster M, 2013. Statistical parametric speech synthesis using deep neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7962-7966.
[116]Zellers R, Bisk Y, Schwartz R, et al., 2018. SWAG: a large-scale adversarial dataset for grounded commonsense inference. Proc Conf on Empirical Methods in Natural Language Processing, p.93-104.
[117]Zeng W, Ren XZ, Su T, et al., 2021. PanGu-α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation.
[118]Zhang DQ, Yang JL, Ye DQ, et al., 2018. LQ-Nets: learned quantization for highly accurate and compact deep neural networks. Proc 15th European Conf on Computer Vision, p.373-390.
[119]Zhang JH, Ma SH, Liu PH, et al., 2023. Coop: memory is not a commodity. Proc 34th Conf on Neural Information Processing Systems.
[120]Zhang ZY, Han X, Zhou H, et al., 2021. CPM: a large-scale generative Chinese pre-trained language model. AI Open, 2:93-99.
[121]Zhao XY, Le Hellard T, Eyraud-Dubois L, et al., 2023. Rockmate: an efficient, fast, automatic and generic tool for re-materialization in PyTorch. Proc 40th Int Conf on Machine Learning, p.42018-42045.
[122]Zhao YL, Gu A, Varma R, et al., 2023. PyTorch FSDP: experiences on scaling fully sharded Data Parallel. Proc VLDB Endow, 16(12):3848-3860.
[123]Zheng LM, Li ZH, Zhang H, et al., 2022. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. Proc 16th USENIX Symp on Operating Systems Design and Implementation, p.559-578.
[124]Zhong Y, Zhu JJ, Yang PC, et al., 2023. MQSP: micro-query sequence parallelism for linearly scaling long sequence Transformer. https://openreview.net/forum?id=gfr5yILQc7_ [Accessed on Sept. 17, 2023].
[125]Zhou J, Ke P, Qiu XP, et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng 25(1):6-11.
[126]Zhou SC, Wu YX, Ni ZK, et al., 2018. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients.
[127]Zhuang ZX, Liu MR, Cutkosky A, et al., 2022. Understanding AdamW through proximal methods and scale-freeness. https://arxiv.org/abs/2202.00089
Open peer comments: Debate/Discuss/Question/Opinion
<1>