Full Text:   <0>

CLC number: 

On-line Access: 2026-02-02

Received: 2025-09-12

Revision Accepted: 2026-01-23

Crosschecked: 0000-00-00

Cited: 0

Clicked: 5

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
Open peer comments

Journal of Zhejiang University SCIENCE C 1998 Vol.-1 No.-1 P.

http://doi.org/10.1631/ENG.ITEE.2025.0034


FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression


Author(s):  Yun Teng1, Dawei Sun1, Shipeng Hu2, Zhiyue Li2, Guangyan Zhang2, Haidong Tian3, Rui Chang3

Affiliation(s):  1School of Artificial Intelligence, China University of Geosciences Beijing, Beijing 100083, China 2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 3State Key Laboratory of Mobile Network and Mobile Multimedia Technology, ZTE Corporation, Shenzhen 518057, China

Corresponding email(s):   gyzh@tsinghua.edu.cn

Key Words:  Deep neural network (DNN) models, Critical failures, Parallel transmission, Data compression, Checkpointing and recovery


Share this article to: More

Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3. FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression[J]. Journal of Zhejiang University Science C, 1998, -1(-1): .

@article{title="FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression",
author="Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3",
journal="Journal of Zhejiang University Science C",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0034"
}

%0 Journal Article
%T FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
%A Yun Teng1
%A Dawei Sun1
%A Shipeng Hu2
%A Zhiyue Li2
%A Guangyan Zhang2
%A Haidong Tian3
%A Rui Chang3
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 1869-1951
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0034

TY - JOUR
T1 - FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
A1 - Yun Teng1
A1 - Dawei Sun1
A1 - Shipeng Hu2
A1 - Zhiyue Li2
A1 - Guangyan Zhang2
A1 - Haidong Tian3
A1 - Rui Chang3
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 1869-1951
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0034


Abstract: 
Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. Existing checkpoint-recovery solutions either cannot tolerate such critical failures or suffer from slow checkpointing and recovery due to constrained input/output (I/O) bandwidth.In this paper, we propose FastCheck, a checkpoint-recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression. First, FastCheck partitions checkpoints into shards and leverages multiple nodes for parallel checkpointing and recovery. Second, it further reduces checkpoint size and overhead with delta compression for weights and index compression for momentum. Third, FastCheck applies a lightweight consensus protocol that accurately tracks node health, preventing checkpoint transmission to failed nodes. We implement FastCheck in PyTorch and evaluate it on multiple DNN models against two baselines. Experimental results show that FastCheck reduces checkpointing time by up to 78.42% and recovery time by up to 77.41%, while consistently improving efficiency across different training stages.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE