CLC number: O242
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2012-11-12
Cited: 0
Clicked: 6639
Xi-sheng Xiao, Ying-ping Huang, Xi-hui Zhang. Optimizing checkpoint for scientific simulations[J]. Journal of Zhejiang University Science C, 2012, 13(12): 891-900.
@article{title="Optimizing checkpoint for scientific simulations",
author="Xi-sheng Xiao, Ying-ping Huang, Xi-hui Zhang",
journal="Journal of Zhejiang University Science C",
volume="13",
number="12",
pages="891-900",
year="2012",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1200135"
}
%0 Journal Article
%T Optimizing checkpoint for scientific simulations
%A Xi-sheng Xiao
%A Ying-ping Huang
%A Xi-hui Zhang
%J Journal of Zhejiang University SCIENCE C
%V 13
%N 12
%P 891-900
%@ 1869-1951
%D 2012
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1200135
TY - JOUR
T1 - Optimizing checkpoint for scientific simulations
A1 - Xi-sheng Xiao
A1 - Ying-ping Huang
A1 - Xi-hui Zhang
J0 - Journal of Zhejiang University Science C
VL - 13
IS - 12
SP - 891
EP - 900
%@ 1869-1951
Y1 - 2012
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1200135
Abstract: It is extremely time-consuming to restart a long-running simulation from the beginning when a failure occurs. checkpointing is a viable solution that enables simulations to be resumed from the point of failure. We study three models to determine the optimal checkpoint interval between contiguous checkpoints so that the total execution time is minimized and we demonstrate that optimal checkpointing can facilitate self-optimizing. This study greatly advances our knowledge of and practice in optimizing long-running scientific simulations.
[1]Cao, T., Vaz Salles, M., Sowell, B., Yue, Y., Demers, A., Gehrke, J., White, W., 2011. Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications. Proc. ACM SIGMOD Int. Conf. on Management of data, p.265-276.
[2]Chandy, K., 1975. A survey of analytic models for rollback and recovery strategies. Computer, 8(5):40-47.
[3]Duda, A., 1983. The effects of checkpointing on program execution times. Inf. Process. Lett., 16(5):221-229.
[4]Gelenbe, E., Hernandez, M., 1990. Optimum checkpoints with age dependent failures. Acta Inf., 27(6):519-531.
[5]Grassi, V., Donatiello, L., Tucci, S., 1992. On the optimal checkpointing of critical task and transaction-oriented systems. IEEE Trans. Software Eng., 18(1):72-77.
[6]Huang, Y., Madey, G., 2005. Autonomic Web-Based Simulations. Proc. 38th Annual Simulation Symp., p.160-167.
[7]Huang, Y., Xiang, X., Madey, G., 2004. A Self Manageable Infrastructure for Supporting Web-Based Simulations. Proc. 37th Annual Simulation Symp., p.149-156.
[8]Ji, Y., Jiang, H., Chaudhary, V., 2011. A heuristic checkpoint placement algorithm for adaptive application-level checkpointing. Int. J. Appl. Sci. Technol., 1(6):50-61.
[9]Kohl, J., Papadopoulas, P., 1998. Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS. Proc. SIGMETRICS Symp. on Parallel and Distributed Tools, p.60-71.
[10]Kulkarni, V.G., Nicola, V.F., Trivedi, K.S., 1990. Effects of checkpointing and queuing on program performance. Commun. Stat. Stoch. Models, 6(4):615-648.
[11]Kwak, S., Yang, J., 2012. Optimal checkpoint placement on real-time tasks with harmonic periods. J. Comput. Sci. Technol., 27(1):105-112.
[12]Kwak, S.W., Chio, B.J., Kim, B.K., 2001. An optimal checkpointing strategy for real time control systems under transient faults. IEEE Trans. Reliab., 50(3):293-301.
[13]Ling, Y., Mi, J., Lin, X., 2001. A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput., 50(7):699-708.
[14]Nicola, V., 1995. Checkpointing and the Modeling of Program Execution Time. In: Lyu, M.R. (Ed.), Software Fault Tolerance. John Wiley & Sons, Chichester, England, p.167-188.
[15]Shin, K.G., Lin, T., Lee, Y., 1987. Optimal checkpointing of real-time tasks. IEEE Trans. Comput., 36(11):519-531.
[16]Tantawi, A.N., Ruschitzka, M., 1983. Performance Analysis of Checkpointing Strategies. Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, p.129.
[17]Young, J.W., 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530-531.
Open peer comments: Debate/Discuss/Question/Opinion
<1>