JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.7 P.1099-1114

An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults

Author(s): Jiajia JIAO, Ran WEN, Hong YANG
Affiliation(s): College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
Corresponding email(s): jiaojiajia@shmtu.edu.cn
Key Words: Analytical model, Deep neural networks, Hardware transient faults, Fast evaluation, Automatic evaluation tool

Share this article to： More <<< Previous Article \|Next Article >>>

Jiajia JIAO, Ran WEN, Hong YANG. An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(7): 1099-1114.

@article{title="An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults",
author="Jiajia JIAO, Ran WEN, Hong YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="7",
pages="1099-1114",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400547"
}

%0 Journal Article
%T An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults
%A Jiajia JIAO
%A Ran WEN
%A Hong YANG
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 7
%P 1099-1114
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400547

TY - JOUR
T1 - An end-to-end automatic methodology to accelerate the accuracy evaluation of deep neural networks under hardware transient faults
A1 - Jiajia JIAO
A1 - Ran WEN
A1 - Hong YANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 7
SP - 1099
EP - 1114
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400547

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: hardware transient faults are proven to have a significant impact on deep neural networks (DNNs), whose safety-critical misclassification (SCM) in autonomous vehicles, healthcare, and space applications is increased up to four times. However, the inaccuracy evaluation using accurate fault injection is time-consuming and requires several hours and even a couple of days on a complete simulation platform. To accelerate the evaluation of hardware transient faults on DNNs, we design a unified and end-to-end automatic methodology, A-Mean, using the silent data corruption (SDC) rate of basic operations (such as convolution, addition, multiply, ReLU, and max-pooling) and a static two-level mean calculation mechanism to rapidly compute the overall SDC rate, for estimating the general classification metric accuracy and application-specific metric SCM. More importantly, a max-policy is used to determine the SDC boundary of non-sequential structures in DNNs. Then, the worst-case scheme is used to further calculate the enlarged SCM and halved accuracy under transient faults, via merging the static results of SDC with the original data from one-time dynamic fault-free execution. Furthermore, all of the steps mentioned above have been implemented automatically, so that this easy-to-use automatic tool can be employed for prompt evaluation of transient faults on diverse DNNs. Meanwhile, a novel metric "fault sensitivity" is defined to characterize the variation of transient fault-induced higher SCM and lower accuracy. The comparative results with a state-of-the-art fault injection method TensorFI+ on five DNN models and four datasets show that our proposed estimation method A-Mean achieves up to 922.80 times speedup, with just 4.20% SCM loss and 0.77% accuracy loss on average. The artifact of A-Mean is publicly available at https://github.com/breatrice321/A-Meanhttps://github.com/breatrice321/A-Mean.

加速深度神经网络在硬件瞬态故障下准确性评估的端到端自动化方法

焦佳佳，闻然，杨洪
上海海事大学信息工程学院，中国上海市，201306
摘要：硬件瞬态故障已被证实会对深度神经网络产生显著影响，尤其在自动驾驶汽车、医疗保健和航天应用中，其安全关键性误分类概率增加多达4倍。然而，使用准确的故障注入方法进行不准确性评估非常耗时，在完整的仿真平台可能需要几个小时甚至几天时间。为加快对深度神经网络上硬件瞬态故障的评估，设计了一种统一的端到端自动化方法--A-Mean，该方法利用基本操作（如卷积、加法、乘法、激活函数、最大池化等）的静默数据损失率以及静态两级均值计算机制，快速计算整体静默数据损失率，以估算一般分类指标准确性和特定应用指标安全关键性误分类。更重要的是，采用最大策略确定深度神经网络中非顺序结构的静默数据损失边界。然后，将静态安全关键性误分类结果与一次动态无故障执行的原始数据合并，采用最坏情况方案进一步计算瞬态故障下放大的安全关键性误分类和降半的准确性。此外，以上所有步骤均已实现自动化，以便该易于使用的自动化工具可以用于快速评估多种深度神经网络上的瞬态故障。同时，定义一种新指标"故障敏感性"以表征瞬态故障导致的安全关键性误分类升高和准确率降低的变化。与最先进的故障注入方法TensorFI+在5个深度神经网络模型和4个数据集上的比较结果表明，本文提出的评估方法A-Mean实现了高达922.80倍的加速，同时其平均安全关键性误分类损失和准确率损失仅为4.20%和0.77%。A-Mean的相关结果可通过https://github.com/breatrice321/A-Mean获取。

关键词：分析模型；深度神经网络；硬件瞬态故障；快速评估；自动化评估工具

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Adam K, Mohamed II, Ibrahim Y, 2021. A selective mitigation technique of soft errors for DNN models used in healthcare applications: DenseNet201 case study. IEEE Access, 9:65803-65823.

[2]Ahmadilivani MH, Taheri M, Raik J, et al., 2023. DeepVigor: vulnerability value RanGes and FactORs for DNNs’ reliability assessment. IEEE European Test Symp, p.1-6.

[3]Al-haj Ahmad H, Sedaghat Y, 2022. CAFI: a configurable location-aware fault injection technique for software reliability assessment against soft errors. Microproc Microsyst, 94: 104648.

[4]Belčević NM, Stojanović ZN, 2022. Using voltage signals for transient fault detection on overhead lines. Int J Electr Power Energy Syst, 137: 107824.

[5]Camponogara Viera R, Bastos RP, Dutertre JM, et al., 2017. Method for evaluation of transient-fault detection techniques. Microelectron Reliab, 76-77:68-74.

[6]Chen ZT, Narayanan N, Fang B, et al., 2020. TensorFI: a flexible fault injection framework for TensorFlow applications. IEEE 31^st Int Symp on Software Reliability Engineering, p.426-435.

[7]Dietrich C, Thomas TM, Mnich M, 2023. Checkpoint placement for systematic fault-injection campaigns. IEEE/ACM Int Conf on Computer Aided Design, p.1-9.

[8]Du Y, 2022. The influence and application of computer technology on architectural design. 8^th Annual Int Conf on Network and Information Systems for Computers, p.851-854.

[9]Eeckhout L, 2022. A first-order model to assess computer architecture sustainability. IEEE Comput Archit Lett, 21(2):137-140.

[10]Farjaminezhad R, Safari S, Moghadam AME, 2021a. Recurrent neural networks models for analyzing single and multiple transient faults in combinational circuits. Microelectron J, 112: 104993.

[11]Farjaminezhad R, Safari S, Moghadam AM, 2021b. Modeling of single/multiple-bit upset effects on logic circuits applying recurrent neural network. Microelectron J, 117: 105249.

[12]Gavarini G, Ruospo A, Sanchez E, 2023. SCI-FI: a smart, accurate and unintrusive fault-injector for deep neural networks. IEEE European Test Symp, p.1-6.

[13]Jha S, Banerjee S, Tsai T, et al., 2019. ML-based fault injection for autonomous vehicles: a case for Bayesian fault injection. 49^th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.112-124.

[14]Jiao JJ, Marculescu D, Juan DC, et al., 2016. A two-level approximate model driven framework for characterizing multi-cell upsets impacts on processors. Microelectron J, 48:7-17.

[15]Jooshaki M, Karimi-Arpanahi S, Millar RJ, et al., 2023. On the MILP modeling of remote-controlled switch and field circuit breaker malfunctions in distribution system switch placement. IEEE Access, 11:40905-40915.

[16]Jung J, Ko Y, So H, et al., 2022. Root cause analysis of soft-error-induced failures from hardware and software perspectives. J Syst Archit, 130: 102652.

[17]Laskar S, Rahman H, Zhang BH, et al., 2022. Characterizing deep learning neural network failures between algorithmic inaccuracy and transient hardware faults. IEEE 27^th Pacific Rim Int Symp on Dependable Computing, p.54-67.

[18]Li GP, Hari SKS, Sullivan M, et al., 2017. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 8.

[19]Li PW, Zhen L, Li XJ, et al., 2021. Radiation hardness assurance of single event effects on components for space application. 4^th Int Conf on Radiation Effects of Electronic Devices, p.1-6.

[20]Liang JH, Li YJ, Yin GD, et al., 2023. A MAS-based hierarchical architecture for the cooperation control of connected and automated vehicles. IEEE Trans Veh Technol, 72(2):1559-1573.

[21]Mukherjee S, 2008. Architecture Design for Soft Errors. Morgan Kaufmann, Burlington, USA.

[22]Papadimitriou G, Gizopoulos D, 2021. Demystifying the system vulnerability stack: transient fault effects across the layers. ACM/IEEE 48^th Annual Int Symp on Computer Architecture, p.902-915.

[23]Papadimitriou G, Gizopoulos D, Dixit HD, et al., 2023. Silent data corruptions: the stealthy saboteurs of digital integrity. IEEE 29^th Int Symp on On-Line Testing and Robust System Design, p.1-7.

[24]Ping LQ, Tan JWJ, Yan KG, 2020. SERN: modeling and analyzing the soft error reliability of convolutional neural networks. Proc Great Lakes Symp on VLSI, p.445-450.

[25]Raj S, Singh V, Rajalwal NK, et al., 2020. Reliability prediction of a distribution protection scheme using Markov model. 8^th Int Conf on Reliability, Infocom Technologies and Optimization, p.868-872.

[26]Ramzanpour M, Ludwig SA, 2020. Association rule mining based algorithm for recovery of silent data corruption in convolutional neural network data storage. IEEE Symp Series on Computational Intelligence, p.3057-3064.

[27]Ruospo A, Gavarini G, Bragaglia I, et al., 2022. Selective hardening of critical neurons in deep neural networks. 25^th Int Symp on Design and Diagnostics of Electronic Circuits and Systems, p.136-141.

[28]Ruospo A, Sanchez E, Luza LM, et al., 2023. A survey on deep learning resilience assessment methodologies. Computer, 56(2):57-66.

[29]Sangchoolie B, Pattabiraman K, Karlsson J, 2017. One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. 47^th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.97-108.

[30]Shi YD, Tong Z, Zhang ZX, et al., 2023. Research on jamming decision technology based on hierarchical architecture. 4^th Int Conf on Electronic Communication and Artificial Intelligence, p.52-56.

[31]Sun RH, Qiu PF, Lyu YQ, et al., 2021. Lightning: striking the secure isolation on GPU clouds with transient hardware faults. https://arxiv.org/abs/2112.03662

[32]Taheri M, Ahmadilivani MH, Jenihhin M, et al., 2023. APPRAISER: DNN fault resilience analysis employing approximation errors. 26^th Int Symp on Design and Diagnostics of Electronic Circuits and Systems, p.124-127.

[33]Tan JWJ, Ping LP, Wang QX, et al., 2023a. saca-AVF: a quantitative approach to analyze the architectural vulnerability factors of CNN accelerators. IEEE Trans Comput, 72(11):3042-3056.

[34]Tan JWJ, Wang QX, Yan KG, et al., 2023b. saca-FI: a microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Fut Gener Comput Syst, 147:251-264.

[35]Venkatesha S, Parthasarathi R, 2022. One shot system based reliability modelling and analysis for low-cost fault-tolerant computing system comprising of one instruction cores. Int Conf on Smart Generation Computing, Communication and Networking, p.1-9.

[36]Yao GP, Yang W, Liu H, 2022. The design of the operational monitoring system of the state grid on the Internet based on the computer architecture. World Automation Congress, p.608-613.

[37]Zheng Y, Feng ZY, Hu Z, et al., 2021. MindFI: a fault injection tool for reliability assessment of MindSpore applicacions. IEEE Int Symp on Software Reliability Engineering Workshops, p.235-238.

[38]Zhou Q, Luo ZH, Ouyang X, et al., 2020. Analysis of the influence of optical fiber layout on the internal electric field of power transformer. IEEE Int Conf on High Voltage Engineering and Application, p.1-4.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

加速深度神经网络在硬件瞬态故障下准确性评估的端到端自动化方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference