Journal of Zhejiang University

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

A survey of binary code representation technology

Author(s): Taiyan WANG, Qingsong XIE, Lu YU, Zulie PAN, Min ZHANG
Affiliation(s): College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China; more
Corresponding email(s): zhangmindy@nudt.edu.cn
Key Words: Binary analysis; Binary code representation; Binary code feature selection; Binary code feature embedding

Share this article to： More \|Next Paper >>>

Taiyan WANG, Qingsong XIE, Lu YU, Zulie PAN, Min ZHANG. A survey of binary code representation technology[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400088

@article{title="A survey of binary code representation technology",
author="Taiyan WANG, Qingsong XIE, Lu YU, Zulie PAN, Min ZHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400088"
}

%0 Journal Article
%T A survey of binary code representation technology
%A Taiyan WANG
%A Qingsong XIE
%A Lu YU
%A Zulie PAN
%A Min ZHANG
%J Frontiers of Information Technology & Electronic Engineering
%P 671-694
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400088"

TY - JOUR
T1 - A survey of binary code representation technology
A1 - Taiyan WANG
A1 - Qingsong XIE
A1 - Lu YU
A1 - Zulie PAN
A1 - Min ZHANG
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 671
EP - 694
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400088"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Binary analysis, as an important foundational technology, provides support for numerous applications in the fields of software engineering and security research. With the continuous expansion of software scale and the complex evolution of software architecture, binary analysis technology is facing new challenges. To break through existing bottlenecks, researchers have applied artificial intelligence (AI) technology to the understanding and analysis of binary code. The core lies in characterizing binary code, i.e., how to use intelligent methods to generate representation vectors containing semantic information for binary code, and apply them to multiple downstream tasks of binary analysis. In this paper, we provide a comprehensive survey of recent advances in binary code representation technology, and introduce the workflow of existing research in two parts, i.e., binary code feature selection methods and binary code feature embedding methods. The feature selection section includes mainly two parts: definition and classification of features, and feature construction. First, the abstract definition and classification of features are systematically explained, and second, the process of constructing specific representations of features is introduced in detail. In the feature embedding section, based on the different intelligent semantic understanding models used, the embedding methods are classified into four categories based on the usage of text-embedding models and graph-embedding models. Finally, we summarize the overall development of existing research and provide prospects for some potential research directions related to binary code representation technology.

二进制代码表征技术研究进展综述

王泰彦^1,2，谢清松^1,2，于璐^1,2，潘祖烈^1,2，张旻^1,2
¹国防科技大学电子对抗学院，中国合肥市，230037
²网络空间安全态势感知与评估安徽省重点实验室，中国合肥市，230037
摘要：二进制分析作为一项重要的基础技术，为软件工程与安全研究领域的众多应用提供支撑。随着软件规模的不断扩大与软件体系架构的复杂演进，二进制分析技术面临全新挑战。为突破现有瓶颈，研究人员将人工智能技术应用于二进制代码理解与分析，其核心在于如何对二进制代码进行表征，即如何使用智能化方法为二进制代码生成含有语义信息的表征向量，从而应用于多种二进制分析下游任务。本文围绕现阶段二进制代码表征技术的研究最新进展进行调研与分析，将现有相关研究的工作流程分为二进制代码特征提取方法与二进制代码特征嵌入方法两部分予以介绍。特征提取部分主要包含特征定义与分类以及特征构造。首先系统性阐述特征的抽象定义与分类，其次详细介绍构建特征具体表征的过程。在特征嵌入部分，根据所用的不同智能语义理解模型，以文本嵌入模型与图嵌入模型的使用情况作为分类依据，将嵌入方法分为4类并予以介绍。最后总结现有研究的整体发展思路，并对二进制代码表征技术相关的一些潜在研究方向进行展望。

关键词组：二进制分析；二进制代码表征；二进制代码特征提取；二进制代码特征嵌入

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ahn S, Ahn S, Koo H, et al., 2022. Practical binary code similarity detection with BERT-based transferable similarity learning. Proc 38^th Annual Computer Security Applications Conf, p.361-374.

[2]Allamanis M, Barr ET, Ducousso S, et al., 2020. Typilus: neural type hints. Proc 41^st ACM SIGPLAN Conf on Programming Language Design and Implementation, p.91-105.

[3]Bengio Y, Courville A, Vincent P, 2013. Representation learning: a review and new perspectives. IEEE Trans Patt Anal Mach Intell, 35(8):1798-1828.

[4]Chaganti R, Ravi V, Pham TD, 2022. Deep learning based cross architecture Internet of Things malware detection and classification. Comput Secur, 120:102779.

[5]Chandramohan M, Xue YX, Xu ZZ, et al., 2016. BinGo: cross-architecture cross-OS binary search. Proc 24^th ACM SIGSOFT Int Symp on Foundations of Software Engineering, p.678-689.

[6]Chen LG, He ZL, Mao B, 2020. CATI: context-assisted type inference from stripped binaries. Proc 50^th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.88-98.

[7]Chen QB, Lacomis J, Schwartz EJ, et al., 2022. Augmenting decompiler output with learned variable names and types. Proc 31^st USENIX Security Symp, p.4327-4343.

[8]Chu QF, Liu GS, Zhu X, 2020. Visualization feature and CNN based homology classification of malicious code. Chin J Electron, 29(1):154-160.

[9]Chua ZL, Shen SQ, Saxena P, et al., 2017. Neural nets can learn function type signatures from binaries. Proc 26^th USENIX Conf on Security Symp, p.99-116.

[10]Dai HJ, Dai B, Song L, 2016. Discriminative embeddings of latent variable models for structured data. Proc 33^rd Int Conf on Machine Learning, p.2702-2711.

[11]David Y, Yahav E, 2014. Tracelet-based code search in executables. Proc 35^th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.349-360.

[12]David Y, Partush N, Yahav E, 2016. Statistical similarity of binaries. ACM SIGPLAN Not, 51(6):266-280.

[13]David Y, Partush N, Yahav E, 2017. Similarity of binaries through re-optimization. Proc 38^th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.79-94.

[14]David Y, Partush N, Yahav E, 2018. FirmUp: precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Not, 53(2):392-404.

[15]David Y, Alon U, Yahav E, 2020. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proc ACM Program Lang, 4(OOPSLA):225.

[16]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[17]Ding SHH, Fung BCM, Charland P, 2019. Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. Proc IEEE Symp on Security and Privacy, p.472-489.

[18]Duan Y, Li XZX, Wang JH, et al., 2020. DeepBinDiff: learning program-wide code representations for binary diffing. Network and Distributed Systems Security Symp, p.1-16.

[19]Feng Q, Zhou RD, Xu CC, et al., 2016. Scalable graph-based bug search for firmware images. Proc ACM SIGSAC Conf on Computer and Communications Security, p.480-491.

[20]Gao H, Cheng SY, Xue YX, et al., 2021. A lightweight framework for function name reassignment based on large-scale stripped binaries. Proc 30^th ACM SIGSOFT Int Symp on Software Testing and Analysis, p.607-619.

[21]Gao J, Yang X, Fu Y, et al., 2018a. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. Proc 33^rd ACM/IEEE Int Conf on Automated Software Engineering, p.896-899.

[22]Gao J, Yang X, Fu Y, et al., 2018b. VulSeeker-Pro: enhanced semantic learning based binary vulnerability seeker with emulation. Proc 26^th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering, p.803-808.

[23]Gao J, Jiang Y, Liu Z, et al., 2021. Semantic learning and emulation based cross-platform binary vulnerability seeker. IEEE Trans Softw Eng, 47(11):2575-2589.

[24]Giaretta L, Lekssays A, Carminati B, et al., 2021. LiMNet: early-stage detection of IoT botnets with lightweight memory networks. Proc 26^th European Symp on Research in Computer Security, p.605-625.

[25]Gilmer J, Schoenholz SS, Riley PF, et al., 2017. Neural message passing for quantum chemistry. Proc 34^th Int Conf on Machine Learning, p.1263-1272.

[26]Grover A, Leskovec J, 2016. node2vec: scalable feature learning for networks. Proc 22^nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.855-864.

[27]Guo WB, Mu DL, Xing XY, et al., 2019. DEEPVSA: facilitating value-set analysis with deep learning for postmortem program analysis. Proc 28^th USENIX Conf on Security Symp, p.1787-1804.

[28]Guo XX, Cai RJ, Yin XK, et al., 2023. Searching open-source vulnerability function based on software modularization. Appl Sci, 13(2):701.

[29]Guo YX, Li PC, Luo YW, et al., 2022. Exploring GNN based program embedding technologies for binary related tasks. Proc 30^th IEEE/ACM Int Conf on Program Comprehension, p.366-377.

[30]Haq IU, Caballero J, 2021. A survey of binary code similarity. ACM Comput Surv, 54(3):51.

[31]Hou XY, Zhao YJ, Liu Y, et al., 2024. Large language models for software engineering: a systematic literature review.

[32]Houlsby N, Giurgiu A, Jastrzebski S, et al., 2019. Parameter-efficient transfer learning for NLP. Proc 36^th Int Conf on Machine Learning, p.2790-2799.

[33]Huang X, Li JD, Hu X, 2017. Accelerated attributed network embedding. Proc SIAM Int Conf on Data Mining, p.633-641.

[34]Ji YD, Cui L, Huang HH, 2021. BugGraph: differentiating source-binary code similarity with graph triplet-loss network. Proc ACM Asia Conf on Computer and Communications Security, p.702-715.

[35]Jin X, Pei KX, Won JY, et al., 2022. SymLM: predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. Proc ACM SIGSAC Conf on Computer and Communications Security, p.1631-1645.

[36]Kim D, Kim E, Cha SK, et al., 2023. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Trans Softw Eng, 49(4):1661-1682.

[37]Kim G, Hong S, Franz M, et al., 2022. Improving cross-platform binary analysis using representation learning via graph alignment. Proc 31^st ACM SIGSOFT Int Symp on Software Testing and Analysis, p.151-163.

[38]Kim H, Bak J, Cho K, et al., 2023. A Transformer-based function symbol name inference model from an assembly language for binary reversing. Proc ACM Asia Conf on Computer and Communications Security, p.951-965.

[39]Kipf TN, Welling M, 2016. Semi-supervised classification with graph convolutional networks. Proc 5^th Int Conf on Learning Representations.

[40]Lafferty JD, McCallum A, Pereira FCN, 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proc 18^th Int Conf on Machine Learning, p.282-289.

[41]Lattner C, Adve V, 2004. LLVM: a compilation framework for lifelong program analysis & transformation. Proc Int Symp on Code Generation and Optimization, p.75-86.

[42]Li CF, Shen GM, Sun W, 2021. Cross-architecture Internet-of-Things malware detection based on graph neural network. Proc Int Joint Conf on Neural Networks, p.1-7.

[43]Li XZX, Qu Y, Yin H, 2021. PalmTree: learning an assembly language model for instruction embedding. Proc ACM SIGSAC Conf on Computer and Communications Security, p.3236-3251.

[44]Li YC, Wang BY, Hu BJ, 2020. Semantically find similar binary codes with mixed key instruction sequence. Inform Softw Technol, 125:106320.

[45]Li YJ, Tarlow D, Brockschmidt M, et al., 2015. Gated graph sequence neural networks. Proc 4^th Int Conf on Learning Representations.

[46]Li YJ, Gu CJ, Dullien T, et al., 2019. Graph matching networks for learning the similarity of graph structured objects. Proc 36^th Int Conf on Machine Learning, p.3835-3845.

[47]Liu BC, Huo W, Zhang C, et al., 2018. αDiff: cross-version binary code similarity detection with DNN. Proc 33^rd IEEE/ACM Int Conf on Automated Software Engineering, p.667-678.

[48]Liu QX, Liu JX, Jin Z, et al., 2023. Survey of artificial intelligence based IoT malware detection. J Comput Res Dev, 60(10):2234-2254 (in Chinese).

[49]Liu YH, Ott M, Goyal N, et al., 2019. RoBERTa: a robustly optimized BERT pretraining approach.

[50]Liu ZA, 2021. Binary code similarity detection. Proc 36^th IEEE/ACM Int Conf on Automated Software Engineering, p.1056-1060.

[51]Liu ZM, Kitouni O, Nolte N, et al., 2022. Towards understanding grokking: an effective theory of representation learning. Proc 36^th Conf on Neural Information Processing Systems, p.34651-34663.

[52]Lu XD, Duan ZM, Qian YK, et al., 2020. Malicious code classification method based on deep forest. J Softw, 31(5):1454.

[53]Lu YL, Yu L, Zhao JZ, 2023. Survey of software vulnerability mining methods based on machine learning. Inform Counterm Technol, 2(2):1-19 (in Chinese).

[54]Luo ZH, Wang PW, Wang BS, et al., 2023. VulHawk: cross-architecture vulnerability detection with entropy-based binary code search. Proc 30^th Annual Network and Distributed System Security Symp.

[55]Marcelli A, Graziano M, Ugarte-Pedrero X, et al., 2022. How machine learning is solving the binary function similarity problem. Proc 31^st USENIX Security Symp, p.2099-2116.

[56]Massarelli L, Di Luna GA, Petroni F, et al., 2019a. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. Proc Workshop on Binary Analysis Research, p.1-11.

[57]Massarelli L, Di Luna GA, Petroni F, et al., 2019b. SAFE: self-attentive function embeddings for binary similarity. Proc 16^th Int Conf on Detection of Intrusions and Malware, and Vulnerability Assessment, p.309-329.

[58]Mikolov T, Chen K, Corrado G, et al., 2013. Efficient estimation of word representations in vector space. Proc 1^st Int Conf on Learning Representations.

[59]Nethercote N, Seward J, 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. Proc 28^th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.89-100.

[60]Nitin V, Saieva A, Ray B, et al., 2021. DIRECT: a transformer-based model for decompiled identifier renaming. Proc 1^st Workshop on Natural Language Processing for Programming, p.48-57.

[61]Patrick-Evans J, Dannehl M, Kinder J, 2023. XFL: naming functions in binaries with extreme multi-label learning. Proc IEEE Symp on Security and Privacy, p.2375-2390.

[62]Pei KX, Guan J, Broughton M, et al., 2021. StateFormer: fine-grained type recovery from binaries using generative state modeling. Proc 29^th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering, p.690-702.

[63]Pei KX, Xuan Z, Yang JF, et al., 2023. Learning approximate execution semantics from traces for binary function similarity. IEEE Trans Softw Eng, 49(4):2776-2790.

[64]Peng DL, Zheng SX, Li YT, et al., 2021. How could neural networks understand programs? Proc 38^th Int Conf on Machine Learning, p.8476-8486.

[65]Pham DP, Marion D, Mastio M, et al., 2021. Obfuscation revealed: leveraging electromagnetic signals for obfuscated malware classification. Proc 37^th Annual Computer Security Applications Conf, p.706-719.

[66]Power A, Burda Y, Edwards H, et al., 2022. Grokking: generalization beyond overfitting on small algorithmic datasets.

[67]Qasem A, Debbabi M, Lebel B, et al., 2023. Binary function clone search in the presence of code obfuscation and optimization over multi-CPU architectures. Proc ACM Asia Conf on Computer and Communications Security, p.443-456.

[68]Qiao YC, Zhang WZ, Du XJ, et al., 2021. Malware classification based on multilayer perception and Word2Vec for IoT security. ACM Trans Int Technol, 22(1):10.

[69]Ramos DA, Engler D, 2015. Under-constrained symbolic execution: correctness checking for real code. Proc 24^th USENIX Conf on Security Symp, p.49-64.

[70]Redmond K, Luo LN, Zeng Q, 2019. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. Proc Workshop on Binary Analysis Research, p.1-8.

[71]Shalev N, Partush N, 2018. Binary similarity detection using machine learning. Proc 13^th Workshop on Programming Languages and Analysis for Security, p.42-47.

[72]Sun PF, Garcia L, Salles-Loustau G, et al., 2020. Hybrid firmware analysis for known mobile and IoT security vulnerabilities. Proc 50^th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.373-384.

[73]Tai KS, Socher R, Manning CD, 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc 53^rd Annual Meeting of the Association for Computational Linguistics and 7^th Int Joint Conf on Natural Language Processing, p.1556-1566.

[74]Tang J, Qu M, Wang MZ, et al., 2015. LINE: large-scale information network embedding. Proc 24^th Int Conf on World Wide Web, p.1067-1077.

[75]Ullah S, Oh H, 2022. BinDiff_NN: learning distributed representation of assembly for robust binary diffing against semantic differences. IEEE Trans Softw Eng, 48(9):3442-3466.

[76]Vasan D, Alazab M, Wassan S, et al., 2020a. IMCFN: image-based malware classification using fine-tuned convolutional neural network architecture. Comput Netw, 171:107138.

[77]Vasan D, Alazab M, Venkatraman S, et al., 2020b. MTHAEL: cross-architecture IoT malware detection based on neural network advanced ensemble learning. IEEE Trans Comput, 69(11):1654-1667.

[78]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[79]Vinyals O, Bengio S, Kudlur M, 2015. Order Matters: sequence to sequence for sets. Proc 4^th Int Conf on Learning Representations.

[80]Wang H, Qu WJ, Katz G, et al., 2022. jTrans: jump-aware Transformer for binary code similarity detection. Proc 31^st ACM SIGSOFT Int Symp on Software Testing and Analysis, p.1-13.

[81]Wang HJ, Ma PC, Yuan YY, et al., 2023a. Enhancing DNN-based binary code function search with low-cost equivalence checking. IEEE Trans Softw Eng, 49(1):226-250.

[82]Wang HJ, Ma PC, Wang S, et al., 2023b. sem2vec: semantics-aware assembly tracelet embedding. ACM Trans Softw Eng Methodol, 32(4):90.

[83]Wang JJ, Huang YC, Chen CY, et al., 2024. Software testing with large language model: survey, landscape, and vision. IEEE Trans Softw Eng, 50(4):911-936.

[84]Wang JW, Chen ZJ, Xie X, et al., 2023. Review of malware detection and classification visualization techniques. Chin J Netw Inform Secur, 9(5):1 (in Chinese).

[85]Wu CY, Ban T, Cheng SM, et al., 2023. IoT malware classification based on reinterpreted function-call graphs. Comput Secur, 125:103060.

[86]Xu MJ, 2021. Understanding graph embedding methods and their applications. SIAM Rev, 63:825-853.

[87]Xu XJ, Liu C, Feng Q, et al., 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. Proc ACM SIGSAC Conf on Computer and Communications Security, p.363-376.

[88]Xu XZ, Feng SW, Ye YP, et al., 2023. Improving binary code similarity Transformer models by semantics-driven instruction deemphasis. Proc 32^nd ACM SIGSOFT Int Symp on Software Testing and Analysis, p.1106-1118.

[89]Yang C, Liu ZY, Zhao DL, et al., 2015. Network representation learning with rich text information. Proc 24^th Int Conf on Artificial Intelligence, p.2111-2117.

[90]Yang J, Fu C, Liu XY, et al., 2022. Codee: a tensor embedding scheme for binary code search. IEEE Trans Softw Eng, 48(7):2224-2244.

[91]Yang SG, Cheng L, Zheng YC, et al., 2021. Asteria: deep learning-based AST-encoding for cross-platform binary code similarity detection. 51^st Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.224-236.

[92]Yang SG, Dong CP, Xiao Y, et al., 2023. Asteria-Pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol, 33(1):1.

[93]Yu SY, Achamyeleh YG, Wang CH, et al., 2023. CFG2VEC: hierarchical graph neural network for cross-architectural software reverse engineering. Proc IEEE/ACM 45^th Int Conf on Software Engineering: Software Engineering in Practice, p.281-291.

[94]Yu YC, Gan ST, Qiu JY, et al., 2022. Binary code similarity analysis and its applications on embedded device firmware vulnerability search. J Softw, 33(11):4137-4172.

[95]Yu ZP, Zheng WX, Wang JQ, et al., 2020a. CodeCMR: cross-modal retrieval for function-level binary source code matching. 34^th Conf on Neural Information Processing Systems, p.1-3.

[96]Yu ZP, Cao R, Tang QY, et al., 2020b. Order Matters: semantic-aware neural networks for binary code similarity detection. Proc 34^th AAAI Conf on Artificial Intelligence, p.1145-1152.

[97]Yumlembam R, Issac B, Jacob SM, et al., 2023. IoT-based Android malware detection using graph neural network with adversarial defense. IEEE Int Things J, 10(10):8432-8444.

[98]Zhang XC, Sun WJ, Pang JM, et al., 2020. Similarity metric method for binary basic blocks of cross-instruction set architecture. Proc Workshop on Binary Analysis Research, p.1-12.

[99]Zhang YF, Huang C, Zhang YK, et al., 2022. Pre-training representations of binary code using contrastive learning.

[100]Zhang Z, Ye YP, You W, et al., 2021. OSPREY: recovery of variable and data structure via probabilistic analysis for stripped binary. Proc IEEE Symp on Security and Privacy, p.813-832.

[101]Zuo F, Li XP, Zhang Z, et al., 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. https://arxiv.org/pdf/1808.04706

Open peer comments: Debate/Discuss/Question/Opinion

<1>