JZUS - Journal of Zhejiang University SCIENCE

Journal of Zhejiang University SCIENCE C 1998 Vol.-1 No.-1 P.

http://doi.org/10.1631/ENG.ITEE.2025.0034

FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression

Author(s): Yun Teng¹, Dawei Sun¹, Shipeng Hu², Zhiyue Li², Guangyan Zhang², Haidong Tian³, Rui Chang³
Affiliation(s): ¹School of Artificial Intelligence, China University of Geosciences Beijing, Beijing 100083, China ²Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China ³State Key Laboratory of Mobile Network and Mobile Multimedia Technology, ZTE Corporation, Shenzhen 518057, China
Corresponding email(s): gyzh@tsinghua.edu.cn
Key Words: Deep neural network (DNN) models, Critical failures, Parallel transmission, Data compression, Checkpointing and recovery

Share this article to： More

Yun Teng¹,Dawei Sun¹,Shipeng Hu²,Zhiyue Li²,Guangyan Zhang²,Haidong Tian³,Rui Chang³. FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression[J]. Journal of Zhejiang University Science C, 1998, -1(-1): .

@article{title="FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression",
author="Yun Teng¹,Dawei Sun¹,Shipeng Hu²,Zhiyue Li²,Guangyan Zhang²,Haidong Tian³,Rui Chang³",
journal="Journal of Zhejiang University Science C",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0034"
}

%0 Journal Article
%T FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
%A Yun Teng¹
%A Dawei Sun¹
%A Shipeng Hu²
%A Zhiyue Li²
%A Guangyan Zhang²
%A Haidong Tian³
%A Rui Chang³
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 1869-1951
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0034

TY - JOUR
T1 - FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression
A1 - Yun Teng¹
A1 - Dawei Sun¹
A1 - Shipeng Hu²
A1 - Zhiyue Li²
A1 - Guangyan Zhang²
A1 - Haidong Tian³
A1 - Rui Chang³
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 1869-1951
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0034

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. Existing checkpoint-recovery solutions either cannot tolerate such critical failures or suffer from slow checkpointing and recovery due to constrained input/output (I/O) bandwidth.In this paper, we propose FastCheck, a checkpoint-recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression. First, FastCheck partitions checkpoints into shards and leverages multiple nodes for parallel checkpointing and recovery. Second, it further reduces checkpoint size and overhead with delta compression for weights and index compression for momentum. Third, FastCheck applies a lightweight consensus protocol that accurately tracks node health, preventing checkpoint transmission to failed nodes. We implement FastCheck in PyTorch and evaluate it on multiple DNN models against two baselines. Experimental results show that FastCheck reduces checkpointing time by up to 78.42% and recovery time by up to 77.41%, while consistently improving efficiency across different training stages.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article