|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.5 P.671-694
A survey of binary code representation technology
Abstract: Binary analysis, as an important foundational technology, provides support for numerous applications in the fields of software engineering and security research. With the continuous expansion of software scale and the complex evolution of software architecture, binary analysis technology is facing new challenges. To break through existing bottlenecks, researchers have applied artificial intelligence (AI) technology to the understanding and analysis of binary code. The core lies in characterizing binary code, i.e., how to use intelligent methods to generate representation vectors containing semantic information for binary code, and apply them to multiple downstream tasks of binary analysis. In this paper, we provide a comprehensive survey of recent advances in binary code representation technology, and introduce the workflow of existing research in two parts, i.e., binary code feature selection methods and binary code feature embedding methods. The feature selection section includes mainly two parts: definition and classification of features, and feature construction. First, the abstract definition and classification of features are systematically explained, and second, the process of constructing specific representations of features is introduced in detail. In the feature embedding section, based on the different intelligent semantic understanding models used, the embedding methods are classified into four categories based on the usage of text-embedding models and graph-embedding models. Finally, we summarize the overall development of existing research and provide prospects for some potential research directions related to binary code representation technology.
Key words: Binary analysis; Binary code representation; Binary code feature selection; Binary code feature embedding
1国防科技大学电子对抗学院,中国合肥市,230037
2网络空间安全态势感知与评估安徽省重点实验室,中国合肥市,230037
摘要:二进制分析作为一项重要的基础技术,为软件工程与安全研究领域的众多应用提供支撑。随着软件规模的不断扩大与软件体系架构的复杂演进,二进制分析技术面临全新挑战。为突破现有瓶颈,研究人员将人工智能技术应用于二进制代码理解与分析,其核心在于如何对二进制代码进行表征,即如何使用智能化方法为二进制代码生成含有语义信息的表征向量,从而应用于多种二进制分析下游任务。本文围绕现阶段二进制代码表征技术的研究最新进展进行调研与分析,将现有相关研究的工作流程分为二进制代码特征提取方法与二进制代码特征嵌入方法两部分予以介绍。特征提取部分主要包含特征定义与分类以及特征构造。首先系统性阐述特征的抽象定义与分类,其次详细介绍构建特征具体表征的过程。在特征嵌入部分,根据所用的不同智能语义理解模型,以文本嵌入模型与图嵌入模型的使用情况作为分类依据,将嵌入方法分为4类并予以介绍。最后总结现有研究的整体发展思路,并对二进制代码表征技术相关的一些潜在研究方向进行展望。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2400088
CLC number:
TP312
Download Full Text:
Downloaded:
722
Clicked:
935
Cited:
0
On-line Access:
2025-06-04
Received:
2024-02-06
Revision Accepted:
2024-06-24
Crosschecked:
2025-06-04