CLC number: TP391
On-line Access: 2025-07-28
Received: 2024-06-26
Revision Accepted: 2024-09-18
Crosschecked: 2025-07-30
Cited: 0
Clicked: 969
Jiajia JIAO, Ran WEN, Hong YANG. An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(7): 1099-1114.
@article{title="An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults",
author="Jiajia JIAO, Ran WEN, Hong YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="7",
pages="1099-1114",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400547"
}
%0 Journal Article
%T An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults
%A Jiajia JIAO
%A Ran WEN
%A Hong YANG
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 7
%P 1099-1114
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400547
TY - JOUR
T1 - An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults
A1 - Jiajia JIAO
A1 - Ran WEN
A1 - Hong YANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 7
SP - 1099
EP - 1114
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400547
Abstract: hardware transient faults are proven to have a significant impact on deep neural networks (DNNs), whose safety-critical misclassification (SCM) in autonomous vehicles, healthcare, and space applications is increased up to four times. However, the inaccuracy evaluation using accurate fault injection is time-consuming and requires several hours and even a couple of days on a complete simulation platform. To accelerate the evaluation of hardware transient faults on DNNs, we design a unified and end-to-end automatic methodology, A-Mean, using the silent data corruption (SDC) rate of basic operations (such as convolution, addition, multiply, ReLU, and max-pooling) and a static two-level mean calculation mechanism to rapidly compute the overall SDC rate, for estimating the general classification metric accuracy and application-specific metric SCM. More importantly, a max-policy is used to determine the SDC boundary of non-sequential structures in DNNs. Then, the worst-case scheme is used to further calculate the enlarged SCM and halved accuracy under transient faults, via merging the static results of SDC with the original data from one-time dynamic fault-free execution. Furthermore, all of the steps mentioned above have been implemented automatically, so that this easy-to-use automatic tool can be employed for prompt evaluation of transient faults on diverse DNNs. Meanwhile, a novel metric "fault sensitivity" is defined to characterize the variation of transient fault-induced higher SCM and lower accuracy. The comparative results with a state-of-the-art fault injection method TensorFI+ on five DNN models and four datasets show that our proposed estimation method A-Mean achieves up to 922.80 times speedup, with just 4.20% SCM loss and 0.77% accuracy loss on average. The artifact of A-Mean is publicly available at https://github.com/breatrice321/A-Meanhttps://github.com/breatrice321/A-Mean.
[1]Adam K, Mohamed II, Ibrahim Y, 2021. A selective mitigation technique of soft errors for DNN models used in healthcare applications: DenseNet201 case study. IEEE Access, 9:65803-65823.
[2]Ahmadilivani MH, Taheri M, Raik J, et al., 2023. DeepVigor: vulnerability value RanGes and FactORs for DNNs’ reliability assessment. IEEE European Test Symp, p.1-6.
[3]Al-haj Ahmad H, Sedaghat Y, 2022. CAFI: a configurable location-aware fault injection technique for software reliability assessment against soft errors. Microproc Microsyst, 94: 104648.
[4]Belčević NM, Stojanović ZN, 2022. Using voltage signals for transient fault detection on overhead lines. Int J Electr Power Energy Syst, 137: 107824.
[5]Camponogara Viera R, Bastos RP, Dutertre JM, et al., 2017. Method for evaluation of transient-fault detection techniques. Microelectron Reliab, 76-77:68-74.
[6]Chen ZT, Narayanan N, Fang B, et al., 2020. TensorFI: a flexible fault injection framework for TensorFlow applications. IEEE 31st Int Symp on Software Reliability Engineering, p.426-435.
[7]Dietrich C, Thomas TM, Mnich M, 2023. Checkpoint placement for systematic fault-injection campaigns. IEEE/ACM Int Conf on Computer Aided Design, p.1-9.
[8]Du Y, 2022. The influence and application of computer technology on architectural design. 8th Annual Int Conf on Network and Information Systems for Computers, p.851-854.
[9]Eeckhout L, 2022. A first-order model to assess computer architecture sustainability. IEEE Comput Archit Lett, 21(2):137-140.
[10]Farjaminezhad R, Safari S, Moghadam AME, 2021a. Recurrent neural networks models for analyzing single and multiple transient faults in combinational circuits. Microelectron J, 112: 104993.
[11]Farjaminezhad R, Safari S, Moghadam AM, 2021b. Modeling of single/multiple-bit upset effects on logic circuits applying recurrent neural network. Microelectron J, 117: 105249.
[12]Gavarini G, Ruospo A, Sanchez E, 2023. SCI-FI: a smart, accurate and unintrusive fault-injector for deep neural networks. IEEE European Test Symp, p.1-6.
[13]Jha S, Banerjee S, Tsai T, et al., 2019. ML-based fault injection for autonomous vehicles: a case for Bayesian fault injection. 49th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.112-124.
[14]Jiao JJ, Marculescu D, Juan DC, et al., 2016. A two-level approximate model driven framework for characterizing multi-cell upsets impacts on processors. Microelectron J, 48:7-17.
[15]Jooshaki M, Karimi-Arpanahi S, Millar RJ, et al., 2023. On the MILP modeling of remote-controlled switch and field circuit breaker malfunctions in distribution system switch placement. IEEE Access, 11:40905-40915.
[16]Jung J, Ko Y, So H, et al., 2022. Root cause analysis of soft-error-induced failures from hardware and software perspectives. J Syst Archit, 130: 102652.
[17]Laskar S, Rahman H, Zhang BH, et al., 2022. Characterizing deep learning neural network failures between algorithmic inaccuracy and transient hardware faults. IEEE 27th Pacific Rim Int Symp on Dependable Computing, p.54-67.
[18]Li GP, Hari SKS, Sullivan M, et al., 2017. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 8.
[19]Li PW, Zhen L, Li XJ, et al., 2021. Radiation hardness assurance of single event effects on components for space application. 4th Int Conf on Radiation Effects of Electronic Devices, p.1-6.
[20]Liang JH, Li YJ, Yin GD, et al., 2023. A MAS-based hierarchical architecture for the cooperation control of connected and automated vehicles. IEEE Trans Veh Technol, 72(2):1559-1573.
[21]Mukherjee S, 2008. Architecture Design for Soft Errors. Morgan Kaufmann, Burlington, USA.
[22]Papadimitriou G, Gizopoulos D, 2021. Demystifying the system vulnerability stack: transient fault effects across the layers. ACM/IEEE 48th Annual Int Symp on Computer Architecture, p.902-915.
[23]Papadimitriou G, Gizopoulos D, Dixit HD, et al., 2023. Silent data corruptions: the stealthy saboteurs of digital integrity. IEEE 29th Int Symp on On-Line Testing and Robust System Design, p.1-7.
[24]Ping LQ, Tan JWJ, Yan KG, 2020. SERN: modeling and analyzing the soft error reliability of convolutional neural networks. Proc Great Lakes Symp on VLSI, p.445-450.
[25]Raj S, Singh V, Rajalwal NK, et al., 2020. Reliability prediction of a distribution protection scheme using Markov model. 8th Int Conf on Reliability, Infocom Technologies and Optimization, p.868-872.
[26]Ramzanpour M, Ludwig SA, 2020. Association rule mining based algorithm for recovery of silent data corruption in convolutional neural network data storage. IEEE Symp Series on Computational Intelligence, p.3057-3064.
[27]Ruospo A, Gavarini G, Bragaglia I, et al., 2022. Selective hardening of critical neurons in deep neural networks. 25th Int Symp on Design and Diagnostics of Electronic Circuits and Systems, p.136-141.
[28]Ruospo A, Sanchez E, Luza LM, et al., 2023. A survey on deep learning resilience assessment methodologies. Computer, 56(2):57-66.
[29]Sangchoolie B, Pattabiraman K, Karlsson J, 2017. One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. 47th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.97-108.
[30]Shi YD, Tong Z, Zhang ZX, et al., 2023. Research on jamming decision technology based on hierarchical architecture. 4th Int Conf on Electronic Communication and Artificial Intelligence, p.52-56.
[31]Sun RH, Qiu PF, Lyu YQ, et al., 2021. Lightning: striking the secure isolation on GPU clouds with transient hardware faults. https://arxiv.org/abs/2112.03662
[32]Taheri M, Ahmadilivani MH, Jenihhin M, et al., 2023. APPRAISER: DNN fault resilience analysis employing approximation errors. 26th Int Symp on Design and Diagnostics of Electronic Circuits and Systems, p.124-127.
[33]Tan JWJ, Ping LP, Wang QX, et al., 2023a. saca-AVF: a quantitative approach to analyze the architectural vulnerability factors of CNN accelerators. IEEE Trans Comput, 72(11):3042-3056.
[34]Tan JWJ, Wang QX, Yan KG, et al., 2023b. saca-FI: a microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Fut Gener Comput Syst, 147:251-264.
[35]Venkatesha S, Parthasarathi R, 2022. One shot system based reliability modelling and analysis for low-cost fault-tolerant computing system comprising of one instruction cores. Int Conf on Smart Generation Computing, Communication and Networking, p.1-9.
[36]Yao GP, Yang W, Liu H, 2022. The design of the operational monitoring system of the state grid on the Internet based on the computer architecture. World Automation Congress, p.608-613.
[37]Zheng Y, Feng ZY, Hu Z, et al., 2021. MindFI: a fault injection tool for reliability assessment of MindSpore applicacions. IEEE Int Symp on Software Reliability Engineering Workshops, p.235-238.
[38]Zhou Q, Luo ZH, Ouyang X, et al., 2020. Analysis of the influence of optical fiber layout on the internal electric field of power transformer. IEEE Int Conf on High Voltage Engineering and Application, p.1-4.
Open peer comments: Debate/Discuss/Question/Opinion
<1>