Affiliation(s): 1School of Artificial Intelligence, China University of Geosciences Beijing, Beijing 100083, China
2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
3State Key Laboratory of Mobile Network and Mobile Multimedia Technology, ZTE Corporation, Shenzhen 518057, China
Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3. FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression[J]. Journal of Zhejiang University Science ,in press.Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/ENG.ITEE.2025.0034
@article{title="FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression", author="Yun Teng1,Dawei Sun1,Shipeng Hu2,Zhiyue Li2,Guangyan Zhang2,Haidong Tian3,Rui Chang3", journal="Journal of Zhejiang University Science ", year="in press", publisher="Zhejiang University Press & Springer", doi="https://doi.org/10.1631/ENG.ITEE.2025.0034" }
%0 Journal Article %T FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression %A Yun Teng1 %A Dawei Sun1 %A Shipeng Hu2 %A Zhiyue Li2 %A Guangyan Zhang2 %A Haidong Tian3 %A Rui Chang3 %J Journal of Zhejiang University SCIENCE %P %@ 2095-9184 %D in press %I Zhejiang University Press & Springer doi="https://doi.org/10.1631/ENG.ITEE.2025.0034"
TY - JOUR T1 - FastCheck : Fast Checkpointing and Recovery for DNN Training via Parallel Transmission and Compression A1 - Yun Teng1 A1 - Dawei Sun1 A1 - Shipeng Hu2 A1 - Zhiyue Li2 A1 - Guangyan Zhang2 A1 - Haidong Tian3 A1 - Rui Chang3 J0 - Journal of Zhejiang University Science SP - EP - %@ 2095-9184 Y1 - in press PB - Zhejiang University Press & Springer ER - doi="https://doi.org/10.1631/ENG.ITEE.2025.0034"
Abstract: Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. Existing checkpoint-recovery solutions either cannot tolerate such critical failures or suffer from slow checkpointing and recovery due to constrained input/output (I/O) bandwidth.In this paper, we propose FastCheck, a checkpoint-recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression. First, FastCheck partitions checkpoints into shards and leverages multiple nodes for parallel checkpointing and recovery. Second, it further reduces checkpoint size and overhead with delta compression for weights and index compression for momentum. Third, FastCheck applies a lightweight consensus protocol that accurately tracks node health, preventing checkpoint transmission to failed nodes. We implement FastCheck in PyTorch and evaluate it on multiple DNN models against two baselines. Experimental results show that FastCheck reduces checkpointing time by up to 78.42% and recovery time by up to 77.41%, while consistently improving efficiency across different training stages.
Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference
Open peer comments: Debate/Discuss/Question/Opinion
Open peer comments: Debate/Discuss/Question/Opinion
<1>