
CLC number:
On-line Access: 2026-02-02
Received: 2025-09-12
Revision Accepted: 2026-01-23
Crosschecked: 0000-00-00
Cited: 0
Clicked: 5
Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3. FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression[J]. Journal of Zhejiang University Science C, 1998, -1(-1): .
@article{title="FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression",
author="Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3",
journal="Journal of Zhejiang University Science C",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0034"
}
%0 Journal Article
%T FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
%A Yun Teng1
%A Dawei Sun1
%A Shipeng Hu2
%A Zhiyue Li2
%A Guangyan Zhang2
%A Haidong Tian3
%A Rui Chang3
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 1869-1951
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0034
TY - JOUR
T1 - FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
A1 - Yun Teng1
A1 - Dawei Sun1
A1 - Shipeng Hu2
A1 - Zhiyue Li2
A1 - Guangyan Zhang2
A1 - Haidong Tian3
A1 - Rui Chang3
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 1869-1951
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0034
Abstract: Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. Existing checkpoint-recovery solutions either cannot tolerate such critical failures or suffer from slow checkpointing and recovery due to constrained input/output (I/O) bandwidth.In this paper, we propose FastCheck, a checkpoint-recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression. First, FastCheck partitions checkpoints into shards and leverages multiple nodes for parallel checkpointing and recovery. Second, it further reduces checkpoint size and overhead with delta compression for weights and index compression for momentum. Third, FastCheck applies a lightweight consensus protocol that accurately tracks node health, preventing checkpoint transmission to failed nodes. We implement FastCheck in PyTorch and evaluate it on multiple DNN models against two baselines. Experimental results show that FastCheck reduces checkpointing time by up to 78.42% and recovery time by up to 77.41%, while consistently improving efficiency across different training stages.
Open peer comments: Debate/Discuss/Question/Opinion
<1>