Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A survey of binary code representation technology

Abstract: Binary analysis, as an important foundational technology, provides support for numerous applications in the fields of software engineering and security research. With the continuous expansion of software scale and the complex evolution of software architecture, binary analysis technology is facing new challenges. To break through existing bottlenecks, researchers have applied artificial intelligence (AI) technology to the understanding and analysis of binary code. The core lies in characterizing binary code, i.e., how to use intelligent methods to generate representation vectors containing semantic information for binary code, and apply them to multiple downstream tasks of binary analysis. In this paper, we provide a comprehensive survey of recent advances in binary code representation technology, and introduce the workflow of existing research in two parts, i.e., binary code feature selection methods and binary code feature embedding methods. The feature selection section includes mainly two parts: definition and classification of features, and feature construction. First, the abstract definition and classification of features are systematically explained, and second, the process of constructing specific representations of features is introduced in detail. In the feature embedding section, based on the different intelligent semantic understanding models used, the embedding methods are classified into four categories based on the usage of text-embedding models and graph-embedding models. Finally, we summarize the overall development of existing research and provide prospects for some potential research directions related to binary code representation technology.

Key words: Binary analysis; Binary code representation; Binary code feature selection; Binary code feature embedding

Chinese Summary  <1> 二进制代码表征技术研究进展综述

王泰彦1,2,谢清松1,2,于璐1,2,潘祖烈1,2,张旻1,2
1国防科技大学电子对抗学院,中国合肥市,230037
2网络空间安全态势感知与评估安徽省重点实验室,中国合肥市,230037
摘要:二进制分析作为一项重要的基础技术,为软件工程与安全研究领域的众多应用提供支撑。随着软件规模的不断扩大与软件体系架构的复杂演进,二进制分析技术面临全新挑战。为突破现有瓶颈,研究人员将人工智能技术应用于二进制代码理解与分析,其核心在于如何对二进制代码进行表征,即如何使用智能化方法为二进制代码生成含有语义信息的表征向量,从而应用于多种二进制分析下游任务。本文围绕现阶段二进制代码表征技术的研究最新进展进行调研与分析,将现有相关研究的工作流程分为二进制代码特征提取方法与二进制代码特征嵌入方法两部分予以介绍。特征提取部分主要包含特征定义与分类以及特征构造。首先系统性阐述特征的抽象定义与分类,其次详细介绍构建特征具体表征的过程。在特征嵌入部分,根据所用的不同智能语义理解模型,以文本嵌入模型与图嵌入模型的使用情况作为分类依据,将嵌入方法分为4类并予以介绍。最后总结现有研究的整体发展思路,并对二进制代码表征技术相关的一些潜在研究方向进行展望。

关键词组:二进制分析;二进制代码表征;二进制代码特征提取;二进制代码特征嵌入


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2400088

CLC number:

TP312

Download Full Text:

Click Here

Downloaded:

722

Clicked:

935

Cited:

0

On-line Access:

2025-06-04

Received:

2024-02-06

Revision Accepted:

2024-06-24

Crosschecked:

2025-06-04

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE