Journal of Zhejiang University

ENGINEERING Information Technology & Electronic Engineering 2026 Vol.27 No.2 P.1-13

http://doi.org/10.1631/ENG.ITEE.2025.0034

FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression

Author(s): Yun TENG, Dawei SUN, Shipeng HU, Zhiyue LI, Guangyan ZHANG, Haidong TIAN, Rui CHANG
Affiliation(s): 1. School of Artificial Intelligence, China University of Geosciences Beijing, Beijing 100083, China more
Corresponding email(s): gyzh@tsinghua.edu.cn
Key Words: Deep neural network models, Critical failures, Parallel transmission, Data compression, Checkpointing and recovery

Share this article to： More <<< Previous Article \|Next Article >>>

Yun TENG, Dawei SUN, Shipeng HU, Zhiyue LI, Guangyan ZHANG, Haidong TIAN, Rui CHANG. FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression[J]. Journal of Zhejiang University Science C, 2026, 27(2): 1-13.

@article{title="FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression",
author="Yun TENG, Dawei SUN, Shipeng HU, Zhiyue LI, Guangyan ZHANG, Haidong TIAN, Rui CHANG",
journal="Journal of Zhejiang University Science C",
volume="27",
number="2",
pages="1-13",
year="2026",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0034"
}

%0 Journal Article
%T FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression
%A Yun TENG
%A Dawei SUN
%A Shipeng HU
%A Zhiyue LI
%A Guangyan ZHANG
%A Haidong TIAN
%A Rui CHANG
%J Frontiers of Information Technology & Electronic Engineering
%V 27
%N 2
%P 1-13
%@ 1869-1951
%D 2026
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0034

TY - JOUR
T1 - FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression
A1 - Yun TENG
A1 - Dawei SUN
A1 - Shipeng HU
A1 - Zhiyue LI
A1 - Guangyan ZHANG
A1 - Haidong TIAN
A1 - Rui CHANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 27
IS - 2
SP - 1
EP - 13
%@ 1869-1951
Y1 - 2026
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0034

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. Existing checkpoint–recovery solutions either cannot tolerate such critical failures or suffer from slow checkpointing and recovery due to constrained input/output bandwidth. In this paper, we propose FastCheck, a checkpoint–recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression. First, FastCheck partitions checkpoints into shards and leverages multiple nodes for parallel checkpointing and recovery. Second, it further reduces checkpoint size and overhead with delta compression for weights and index compression for momentum. Third, FastCheck employs lightweight and consistent health status maintenance that accurately tracks node health, preventing checkpoint transmission to failed nodes. We implement FastCheck in PyTorch and evaluate it on multiple DNN models against two baselines. Experimental results show that FastCheck reduces the checkpointing time by up to 78.42% and the recovery time by up to 77.41%, while consistently improving efficiency across different training stages.

FastCheck:一种基于并行传输与定制化压缩的深度神经网络训练检查点快速保存与恢复方法

滕云¹，孙大为¹，胡世鹏²，李之悦²，张广艳²，田海东³，常锐³
¹中国地质大学（北京）人工智能学院，中国北京市，100083
²清华大学计算机科学与技术系，中国北京市，100084
³中兴通讯股份有限公司移动网络和移动多媒体技术国家重点实验室，中国深圳市，518057
摘要：大规模深度神经网络训练常面临软硬件故障问题，出现关键故障时往往需要整个机器重启，极大延长了训练时间。现有检查点保存与恢复方案有的无法应对此类关键故障，有的受限于输入/输出带宽导致检查点保存与恢复速度缓慢。提出FastCheck框架，通过并行传输与定制化压缩技术加速检查点保存与恢复过程。首先，FastCheck将检查点数据分片，利用多节点并行执行检查点保存与恢复过程；其次，通过权重增量压缩与动量索引压缩进一步减小检查点规模与开销；最后，采用轻量级共识协议精准追踪节点健康状态，避免向故障节点传输检查点数据。在PyTorch中实现了FastCheck，并在多种深度神经网络模型上与两种基线方案进行了对比评估。实验结果表明，FastCheck将检查点保存时间最高降低了78.42%，恢复时间最高减少了77.41%，且在不同训练阶段均能持续提升系统效率。

关键词：深度神经网络模型；关键故障；并行传输；数据压缩；检查点保存与恢复

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agrawal A, Reddy S, Bhattamishra S, et al., 2024. Inshrinkerator: compressing deep learning training checkpoints via dynamic quantization. Proc ACM Symp on Cloud Computing, p.1012-1031.

[2]Bigscience-workshop, 2022. BLOOM-176B: Large Multilingual Language Model Training. https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md [Accessed on Feb. 2, 2026].

[3]Cai W, Chen H, Zhuo Z, et al., 2022. Flexible supervision system: a fast fault-tolerance strategy for cloud applications in cloud-edge collaborative environments. IFIP Int Conf on Network and Parallel Computing, p.108-113.

[4]Chen Y, Liu Z, Ren B, et al., 2020. On efficient constructions of checkpoints. https://arxiv.org/pdf/2009.13003

[5]Chorey S, Sahu N, 2024. Rapid recover map reduce (RR-MR): boosting failure recovery in big data applications. J Integr Sci Technol, 12(3):773.

[6]Chowdhery A, Narang S, Devlin J, et al., 2023. PaLM: scaling language modeling with pathways. J Mach Learn Res, 24(1):113.

[7]Danchev V, Nikoulina V, Laippala V, et al., 2023. BLOOM: a 176B-parameter open-access multilingual language model.

[8]Deutsch P, 1996. GZIP file format specification version 4.3. RFC, 1952:1-12.

[9]Eisenman A, Matam KK, Ingram S, et al., 2022. Check-N-Run: a checkpointing system for training deep learning recommendation models. 19^th USENIX Symp on Networked Systems Design and Implementation, p.929-943.

[10]Gill P, Jain N, Nagappan N, 2011. Understanding network failures in data centers: measurement, analysis, and implications. Proc ACM SIGCOMM 2011 Conf, p.350-361.

[11]Han S, Mao H, Dally WJ, 2015. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding.

[12]Hu Q, Ye Z, Wang Z, et al., 2024. Characterization of large language model development in the datacenter. 21^st USENIX Symp on Networked Systems Design and Implementation, p.709-729.

[13]Hu Z, Zou X, Xia W, et al., 2020. Delta-DNN: efficiently compressing deep neural networks via exploiting floats similarity. Proc 49^th Int Conf on Parallel Processing, p.1-12.

[14]Huang G, Liu Z, Van Der Maaten L, et al., 2017. Densely connected convolutional networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4700-4708.

[15]Huawei, 2024. FusionCube A3000. https://support.huawei.com/enterprise/zh/distributed-storage/fusioncube-a3000-pid-261115115 [Accessed on Feb. 2, 2026].

[16]Huffman DA, 1952. A method for the construction of minimum-redundancy codes. Proc IRE, 40(9):1098-1101.

[17]Jeon M, Venkataraman S, Phanishayee A, et al., 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. USENIX Annual Technical Conf, p.947-960.

[18]Krizhevsky A, Sutskever I, Hinton GE, 2017. ImageNet classification with deep convolutional neural networks. Commun ACM, 60(6):84-90.

[19]Li W, Chen X, Shu H, et al., 2024. ExCP: extreme LLM checkpoint compression via weight-momentum joint shrinking.

[20]Lian X, Jacobs SA, Kurilenko L, et al., 2025. Universal checkpointing: a flexible and efficient distributed checkpointing system for large-scale DNN training with reconfigurable parallelism. USENIX Annual Technical Conf, p.1519-1534.

[21]Lin J, Gan C, Han S, 2019. TSM: temporal shift module for efficient video understanding. Proc IEEE/CVF Int Conf on Computer Vision, p.7083-7093.

[22]Liu Z, Mao H, Wu CY, et al., 2022. A ConvNet for the 2020s. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11976-11986.

[23]Maeng K, Bharuka S, Gao I, et al., 2021. Understanding and improving failure tolerant training for deep learning recommendation with partial recovery. Proc Mach Learn Syst, 3:637-651.

[24]Meta AI LLaMA Team, 2024. The Llama 3 Herd of Models. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ [Accessed on Feb. 2, 2026].

[25]MLPerf, 2020. MLPerf Training Results v0.7. https://github.com/mlperf/training_results_v0.7 [Accessed on Feb. 2, 2026].

[26]Mohan J, Phanishayee A, Chidambaram V, 2021. CheckFreq: frequent, fine-grained DNN checkpointing. 19^th USENIX Conf on File and Storage Technologies, p.203-216.

[27]Nicolae B, Li J, Wozniak JM, et al., 2020. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. 20^th IEEE/ACM Int Symp on Cluster, Cloud and Internet Computing, p.172-181.

[28]Nishtala R, Fugal H, Grimm S, et al., 2013. Scaling memcache at facebook. 10^th USENIX Symp on Networked Systems Design and Implementation, p.385-398.

[29]Ongaro D, Ousterhout JK, 2014. In search of an understandable consensus algorithm. USENIX Annual Technical Conf, p.305-319.

[30]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. 33^rd Int Conf on Neural Information Processing Systems, p.8026-8037.

[31]Qi G, Li Z, Wu C, et al., 2025. ECCheck: enhancing in-memory checkpoint with erasure coding in distributed DNN training. IEEE 45^th Int Conf on Distributed Computing Systems, p.36-46.

[32]Radford A, Wu J, Child R, et al., 2019. Language Models Are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Accessed on Feb. 4, 2026].

[33]Rao KR, Hwang JJ, 1996. Techniques and Standards for Image, Video, and Audio Coding. Prentice-Hall, Inc, UK.

[34]Rashmi K, Chowdhury M, Kosaian J, et al., 2016. EC-Cache: load-balanced, low-latency cluster caching with online erasure coding. 12^th USENIX Symp on Operating Systems Design and Implementation, p.401-417.

[35]Shoeybi M, Patwary M, Puri R, et al., 2019. Megatron-LM: training multi-billion parameter language models using model parallelism.

[36]Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition.

[37]Sterne J, 2020. MP3: The Meaning of a Format. Duke University Press, USA.

[38]Strati F, Friedman M, Klimovic A, 2025. PCcheck: persistent concurrent checkpointing for ML. Proc 30^th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.811-827.

[39]Tan C, Jin Z, Guo C, et al., 2019. NetBouncer: active device and link failure localization in data center networks. 16^th USENIX Symp on Networked Systems Design and Implementation, p.599-614.

[40]Tang X, Zhai J, Yu B, et al., 2017. An efficient in-memory checkpoint method and its practice on fault-tolerant HPL. IEEE Trans Parall Distrib Syst, 29(4):758-771.

[41]Tiwari D, Gupta S, Rogers J, et al., 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. IEEE 21^st Int Symp on High Performance Computer Architecture, p.331-342.

[42]Wang Z, Jia Z, Zheng S, et al., 2023. Gemini: fast failure recovery in distributed training with in-memory checkpoints. Proc 29^th Symp on Operating Systems Principles, p.364-381.

[43]Xiao G, Lin J, Seznec M, et al., 2023. SmoothQuant: accurate and efficient post-training quantization for large language models. Int Conf on Machine Learning, p.38087-38099.

[44]Zhang B, Ainsworth S, Mukhanov L, et al., 2025. Parallaft: runtime-based CPU fault tolerance via heterogeneous parallelism. Proc 23^rd ACM/IEEE Int Symp on Code Generation and Optimization, p.584-599.

[45]Zhang S, Wu D, Jin H, et al., 2021. QD-Compressor: a quantization-based delta compression framework for deep neural networks. IEEE 39^th Int Conf on Computer Design, p.542-550.

[46]Zhang S, Roller S, Goyal N, et al., 2022. OPT: open pre-trained transformer language models.

[47]Zhang T, Liu K, Kosaian J, et al., 2023. Efficient fault tolerance for recommendation model training via erasure coding. Proc VLDB Endowm, 16(11):3137-3150.

Open peer comments: Debate/Discuss/Question/Opinion

<1>