Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.12 P.2511-2528

Mind the Gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning

Author(s): Shicheng ZHOU, Jingju LIU, Yuliang LU, Jiahai YANG, Yue ZHANG, Jie CHEN
Affiliation(s): 1. College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China more
Corresponding email(s): zhoushicheng@nudt.edu.cn, liujingju17@nudt.edu.cn, luyuliang@nudt.edu.cn, zhangyue@nudt.edu.cn
Key Words: Cybersecurity, Penetration testing, Reinforcement learning, Domain randomization, Meta-reinforcement learning, Large language model

Share this article to： More <<< Previous Article \|Next Article >>>

Shicheng ZHOU, Jingju LIU, Yuliang LU, Jiahai YANG, Yue ZHANG, Jie CHEN. Mind the Gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(12): 2511-2528.

@article{title="Mind the Gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning",
author="Shicheng ZHOU, Jingju LIU, Yuliang LU, Jiahai YANG, Yue ZHANG, Jie CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="12",
pages="2511-2528",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500100"
}

%0 Journal Article
%T Mind the Gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning
%A Shicheng ZHOU
%A Jingju LIU
%A Yuliang LU
%A Jiahai YANG
%A Yue ZHANG
%A Jie CHEN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 12
%P 2511-2528
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500100

TY - JOUR
T1 - Mind the Gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning
A1 - Shicheng ZHOU
A1 - Jingju LIU
A1 - Yuliang LU
A1 - Jiahai YANG
A1 - Yue ZHANG
A1 - Jie CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 12
SP - 2511
EP - 2528
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500100

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: With the increasing number of vulnerabilities exposed on the Internet, autonomous penetration testing (pentesting) has emerged as a promising research area. reinforcement learning (RL) is a natural fit for studying this topic. However, two key challenges limit the applicability of RL-based autonomous pentesting in real-world scenarios: the training environment dilemma—training agents in simulated environments is sample-efficient while ensuring that their realism remains challenging; poor generalization ability—agents’ policies often perform poorly when transferred to unseen scenarios, with even slight changes potentially causing a significant generalization gap. To address both challenges, we propose a generalizable autonomous pentesting framework termed GAP, which aims to achieve efficient policy training in realistic environments and train generalizable agents capable of drawing inferences about other cases from one instance. GAP introduces a real-to-sim-to-real pipeline that enables end-to-end policy learning in unknown real environments while constructing realistic simulations and improves agents’ generalization ability by leveraging domain randomization and meta-RL learning. We are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve agents’ generalization ability in unseen environments by leveraging synthetic environments. Combining the two methods effectively bridges the generalization gap and improves agents’ policy adaptation performance. Simulations are conducted on various vulnerable virtual machines, with results showing that GAP can enable policy learning in various realistic environments, achieve zero-shot policy transfer in similar environments, and achieve rapid policy adaptation in dissimilar environments.

注意差距：通过域随机化和元强化学习实现可泛化的自动化渗透测试

周仕承^1,2，刘京菊^1,2,3，陆余良^1,2，杨家海³，张悦⁴，陈杰¹
¹国防科技大学电子对抗学院，中国合肥市，230037
²网络空间安全态势感知与评估安徽省重点实验室，中国合肥市，230037
³清华大学网络科学与网络空间研究院，中国北京市，100084
⁴国防科技大学计算机学院，中国长沙市，410073
摘要：随着暴露在互联网上的漏洞数量不断增加，自动化渗透测试已成为一个极具前景的研究领域，而强化学习的特性使其可以很好地适用于该领域的研究。然而，基于强化学习的自动化渗透测试在实际场景中应用时通常面临两大关键挑战：一是训练环境困境，即在模拟环境中训练智能体虽能保证较高采样效率和学习效率，却难以确保环境的真实性；二是泛化能力不足，即当将智能体的策略迁移至未知场景时往往表现不佳，即便环境仅发生微小变化，也可能导致显著的泛化差距。为解决上述两大挑战，本文提出一种可泛化的自动化渗透测试框架GAP，其核心目标是使智能体可在真实环境中高效训练，并提高智能体的策略泛化能力使其具备举一反三的能力。GAP引入一种"真实-模拟-真实"的工作流，既能使智能体在未知真实环境中实现端到端的策略学习，同时可构建逼真的模拟环境，还通过域随机化与元强化学相结合的方式，提升了智能体的泛化能力。本文首次将域随机化应用于自动化渗透测试领域，提出一种基于大型语言模型的域随机化方法，用于生成合成环境。在生成的合成环境基础上，通过元强化学习提升智能体在未知场景中的泛化能力。这两种方法的结合有效弥合了泛化差距，显著提高了智能体的策略适应能力。本文在基于虚拟化技术构建的漏洞靶机上进行了实验，结果表明GAP框架可使智能体在多种真实环境中实现策略学习，在相似环境中实现零样本策略迁移，并在不相似环境中实现快速的策略适应。

关键词：网络安全；渗透测试；强化学习；域随机化；元强化学习；大语言模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Beck J, Vuorio R, Liu EZ, et al., 2023. A survey of meta-reinforcement learning.

[2]Bo L, Zhang TZ, Zhang HX, et al., 2024. 3D UAV path planning in unknown environment: a transfer reinforcement learning method based on low-rank adaption. Adv Eng Inform, 62:102920.

[3]Chen J, Wu DD, Xie RY, 2023. Artificial intelligence algorithms for cyberspace security applications: a technological and status review. Front Inform Technol Electron Eng, 24(8):1117-1142.

[4]Chen JY, Hu SL, Zheng HB, et al., 2023. GAIL-PT: an intelligent penetration testing framework with generative adversarial imitation learning. Comput Secur, 126:103055.

[5]Chen XY, Hu JC, Jin C, et al., 2022. Understanding domain randomization for sim-to-real transfer. Proc 10^th Int Conf on Learning Representations, p.1-28.

[6]Cobbe K, Klimov O, Hesse C, et al., 2019. Quantifying generalization in reinforcement learning. https://arxiv.org/abs/1812.02341

[7]Feng S, Sun HW, Yan XT, et al., 2023. Dense reinforcement learning for safety validation of autonomous vehicles. Nature, 615:620-627.

[8]Finn C, Abbeel P, Levine S, 2017. Model-agnostic meta-learning for fast adaptation of deep networks. Proc 34^th Int Conf on Machine Learning, p.1126-1135.

[9]Guo X, Chen YQ, 2024. Generative AI for synthetic data generation: methods, challenges and the future.

[10]Holm H, 2023. Lore a red team emulation tool. IEEE Trans Depend Secur Comput, 20(2):1596-1608.

[11]Horváth D, Erdös G, Istenes Z, et al., 2023. Object detection using sim2real domain randomization for robotic applications. IEEE Trans Robotics, 39(2):1225-1243.

[12]Hospedales TM, Antoniou A, Micaelli P, et al., 2022. Meta-learning in neural networks: a survey. IEEE Trans Pattern Anal Mach Intell, 44(9):5149-5169.

[13]Huang HC, Ye DH, Shen L, et al., 2023. Curriculum-based asymmetric multi-task reinforcement learning. IEEE Trans Pattern Anal Mach Intell, 45(6):7258-7269.

[14]Ilic N, Dasic D, Vucetic M, et al., 2024. Distributed web hacking by adaptive consensus-based reinforcement learning. Artif Intell, 326:104032.

[15]Jonathon S, Hanna K, 2019. NetworkAttactSimulator. https://github.com/Jjschwartz/NetworksttackSimulator [Accessed on Feb. 16, 2025].

[16]Kirk R, Zhang A, Grefenstette E, et al., 2023. A survey of zero-shot generalisation in deep reinforcement learning. J Artif Intell Res, 76:201-264.

[17]Li QY, Wang RP, Li D, et al., 2024. DynPen: automated penetration testing in dynamic network scenarios using deep reinforcement learning. IEEE Trans Inform Forens Secur, 19:8966-8981.

[18]Li ZY, Zhu HX, Lu ZR, et al., 2023. Synthetic data generation with large language models for text classification: potential and limitations. Proc Conf on Empirical Methods in Natural Language Processing, p.10443-10461.

[19]Lyle C, Rowland M, Dabney W, et al., 2022. Learning dynamics and generalization in deep reinforcement learning. Proc Int Conf on Machine Learning, p.14560-14581.

[20]Maeda R, Mimura M, 2021. Automating post-exploitation with deep reinforcement learning. Comput Secur, 100:102108.

[21]Metelli AM, 2024. Recent advancements in inverse reinforcement learning. Proc 38^th AAAI Conf on Artificial Intelligence, p.22680.

[22]Microsoft Defender Research Team, 2021. CyberBattleSim. https://github.com/microsoft/cyberbattlesim [Accessed on Feb. 16, 2025].

[23]Nguyen HPT, Hasegawa K, Fukushima K, et al., 2025. PenGym: realistic training environment for reinforcement learning pentesting agents. Comput Secur, 148:104140.

[24]Parisi GI, Kemker R, Part JL, et al., 2019. Continual lifelong learning with neural networks: a review. Neur Netw, 113:54-71.

[25]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms.

[26]Shuster K, Poff S, Chen MY, et al., 2021. Retrieval augmentation reduces hallucination in conversation. Proc Findings of the Association for Computational Linguistics, p.3784-3803.

[27]Takaesu I, 2018. DeepExploit. https://github.com/13o-bbr-bbq/machine_learning_security/blob/master/DeepExploit [Accessed on Feb. 16, 2025].

[28]Team GLM, 2024. ChatGLM: a family of large language models from GLM-130B to GLM-4 All Tools.

[29]Tobin J, Fong R, Ray A, et al., 2017. Domain randomization for transferring deep neural networks from simulation to the real world. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.23-30.

[30]Tran K, Akella A, Standen M, et al., 2021. Deep hierarchical reinforcement agents for automated penetration testing.

[31]Wang KX, Kang BY, Shao J, et al., 2020. Improving generalization in reinforcement learning with mixture regularization. Proc Annual Conf on Neural Information Processing Systems, p.7968-7978.

[32]Wang KX, Reimers N, Gurevych I, 2021. TSDAE: using Transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. Proc Findings of the Association for Computational Linguistics, p.671-688.

[33]Yang YZ, Chen MX, Fu HH, et al., 2023. SetTron: towards better generalisation in penetration testing with reinforcement learning. Proc IEEE Global Communications Conf, p.4662-4667.

[34]Yang YZ, Chen LD, Liu S, et al., 2025. Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach. Front Comput Sci, 19(3):193309.

[35]Ye DY, Zhu TQ, Gao K, et al., 2024. Defending against label-only attacks via meta-reinforcement learning. IEEE Trans Inform Forens Secur, 19:3295-3308.

[36]Zhao WS, Queralta JP, Westerlund T, 2020. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. Proc IEEE Symp Series on Computational Intelligence, p.737-744.

[37]Zhou SC, Liu JJ, Lu YL, et al., 2024. APRIL: towards scalable and transferable autonomous penetration testing in large action space via action embedding. IEEE Trans Depend Secur Comput, 22(3):2443-2459.

[38]Zhu ZD, Lin KX, Jain AK, et al., 2023. Transfer learning in deep reinforcement learning: a survey. IEEE Trans Pattern Anal Mach Intell, 45(11):13344-13362.

Open peer comments: Debate/Discuss/Question/Opinion

<1>