CLC number: TP312
On-line Access: 2025-06-04
Received: 2024-02-06
Revision Accepted: 2024-06-24
Crosschecked: 2025-06-04
Cited: 0
Clicked: 975
Taiyan WANG, Qingsong XIE, Lu YU, Zulie PAN, Min ZHANG. A survey of binary code representation technology[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(5): 671-694.
@article{title="A survey of binary code representation technology",
author="Taiyan WANG, Qingsong XIE, Lu YU, Zulie PAN, Min ZHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="5",
pages="671-694",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400088"
}
%0 Journal Article
%T A survey of binary code representation technology
%A Taiyan WANG
%A Qingsong XIE
%A Lu YU
%A Zulie PAN
%A Min ZHANG
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 5
%P 671-694
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400088
TY - JOUR
T1 - A survey of binary code representation technology
A1 - Taiyan WANG
A1 - Qingsong XIE
A1 - Lu YU
A1 - Zulie PAN
A1 - Min ZHANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 5
SP - 671
EP - 694
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400088
Abstract: binary analysis, as an important foundational technology, provides support for numerous applications in the fields of software engineering and security research. With the continuous expansion of software scale and the complex evolution of software architecture, binary analysis technology is facing new challenges. To break through existing bottlenecks, researchers have applied artificial intelligence (AI) technology to the understanding and analysis of binary code. The core lies in characterizing binary code, i.e., how to use intelligent methods to generate representation vectors containing semantic information for binary code, and apply them to multiple downstream tasks of binary analysis. In this paper, we provide a comprehensive survey of recent advances in binary code representation technology, and introduce the workflow of existing research in two parts, i.e., binary code feature selection methods and binary code feature embedding methods. The feature selection section includes mainly two parts: definition and classification of features, and feature construction. First, the abstract definition and classification of features are systematically explained, and second, the process of constructing specific representations of features is introduced in detail. In the feature embedding section, based on the different intelligent semantic understanding models used, the embedding methods are classified into four categories based on the usage of text-embedding models and graph-embedding models. Finally, we summarize the overall development of existing research and provide prospects for some potential research directions related to binary code representation technology.
[1]Ahn S, Ahn S, Koo H, et al., 2022. Practical binary code similarity detection with BERT-based transferable similarity learning. Proc 38th Annual Computer Security Applications Conf, p.361-374.
[2]Allamanis M, Barr ET, Ducousso S, et al., 2020. Typilus: neural type hints. Proc 41st ACM SIGPLAN Conf on Programming Language Design and Implementation, p.91-105.
[3]Bengio Y, Courville A, Vincent P, 2013. Representation learning: a review and new perspectives. IEEE Trans Patt Anal Mach Intell, 35(8):1798-1828.
[4]Chaganti R, Ravi V, Pham TD, 2022. Deep learning based cross architecture Internet of Things malware detection and classification. Comput Secur, 120:102779.
[5]Chandramohan M, Xue YX, Xu ZZ, et al., 2016. BinGo: cross-architecture cross-OS binary search. Proc 24th ACM SIGSOFT Int Symp on Foundations of Software Engineering, p.678-689.
[6]Chen LG, He ZL, Mao B, 2020. CATI: context-assisted type inference from stripped binaries. Proc 50th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.88-98.
[7]Chen QB, Lacomis J, Schwartz EJ, et al., 2022. Augmenting decompiler output with learned variable names and types. Proc 31st USENIX Security Symp, p.4327-4343.
[8]Chu QF, Liu GS, Zhu X, 2020. Visualization feature and CNN based homology classification of malicious code. Chin J Electron, 29(1):154-160.
[9]Chua ZL, Shen SQ, Saxena P, et al., 2017. Neural nets can learn function type signatures from binaries. Proc 26th USENIX Conf on Security Symp, p.99-116.
[10]Dai HJ, Dai B, Song L, 2016. Discriminative embeddings of latent variable models for structured data. Proc 33rd Int Conf on Machine Learning, p.2702-2711.
[11]David Y, Yahav E, 2014. Tracelet-based code search in executables. Proc 35th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.349-360.
[12]David Y, Partush N, Yahav E, 2016. Statistical similarity of binaries. ACM SIGPLAN Not, 51(6):266-280.
[13]David Y, Partush N, Yahav E, 2017. Similarity of binaries through re-optimization. Proc 38th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.79-94.
[14]David Y, Partush N, Yahav E, 2018. FirmUp: precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Not, 53(2):392-404.
[15]David Y, Alon U, Yahav E, 2020. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proc ACM Program Lang, 4(OOPSLA):225.
[16]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[17]Ding SHH, Fung BCM, Charland P, 2019. Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. Proc IEEE Symp on Security and Privacy, p.472-489.
[18]Duan Y, Li XZX, Wang JH, et al., 2020. DeepBinDiff: learning program-wide code representations for binary diffing. Network and Distributed Systems Security Symp, p.1-16.
[19]Feng Q, Zhou RD, Xu CC, et al., 2016. Scalable graph-based bug search for firmware images. Proc ACM SIGSAC Conf on Computer and Communications Security, p.480-491.
[20]Gao H, Cheng SY, Xue YX, et al., 2021. A lightweight framework for function name reassignment based on large-scale stripped binaries. Proc 30th ACM SIGSOFT Int Symp on Software Testing and Analysis, p.607-619.
[21]Gao J, Yang X, Fu Y, et al., 2018a. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. Proc 33rd ACM/IEEE Int Conf on Automated Software Engineering, p.896-899.
[22]Gao J, Yang X, Fu Y, et al., 2018b. VulSeeker-Pro: enhanced semantic learning based binary vulnerability seeker with emulation. Proc 26th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering, p.803-808.
[23]Gao J, Jiang Y, Liu Z, et al., 2021. Semantic learning and emulation based cross-platform binary vulnerability seeker. IEEE Trans Softw Eng, 47(11):2575-2589.
[24]Giaretta L, Lekssays A, Carminati B, et al., 2021. LiMNet: early-stage detection of IoT botnets with lightweight memory networks. Proc 26th European Symp on Research in Computer Security, p.605-625.
[25]Gilmer J, Schoenholz SS, Riley PF, et al., 2017. Neural message passing for quantum chemistry. Proc 34th Int Conf on Machine Learning, p.1263-1272.
[26]Grover A, Leskovec J, 2016. node2vec: scalable feature learning for networks. Proc 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.855-864.
[27]Guo WB, Mu DL, Xing XY, et al., 2019. DEEPVSA: facilitating value-set analysis with deep learning for postmortem program analysis. Proc 28th USENIX Conf on Security Symp, p.1787-1804.
[28]Guo XX, Cai RJ, Yin XK, et al., 2023. Searching open-source vulnerability function based on software modularization. Appl Sci, 13(2):701.
[29]Guo YX, Li PC, Luo YW, et al., 2022. Exploring GNN based program embedding technologies for binary related tasks. Proc 30th IEEE/ACM Int Conf on Program Comprehension, p.366-377.
[30]Haq IU, Caballero J, 2021. A survey of binary code similarity. ACM Comput Surv, 54(3):51.
[31]Hou XY, Zhao YJ, Liu Y, et al., 2024. Large language models for software engineering: a systematic literature review.
[32]Houlsby N, Giurgiu A, Jastrzebski S, et al., 2019. Parameter-efficient transfer learning for NLP. Proc 36th Int Conf on Machine Learning, p.2790-2799.
[33]Huang X, Li JD, Hu X, 2017. Accelerated attributed network embedding. Proc SIAM Int Conf on Data Mining, p.633-641.
[34]Ji YD, Cui L, Huang HH, 2021. BugGraph: differentiating source-binary code similarity with graph triplet-loss network. Proc ACM Asia Conf on Computer and Communications Security, p.702-715.
[35]Jin X, Pei KX, Won JY, et al., 2022. SymLM: predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. Proc ACM SIGSAC Conf on Computer and Communications Security, p.1631-1645.
[36]Kim D, Kim E, Cha SK, et al., 2023. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Trans Softw Eng, 49(4):1661-1682.
[37]Kim G, Hong S, Franz M, et al., 2022. Improving cross-platform binary analysis using representation learning via graph alignment. Proc 31st ACM SIGSOFT Int Symp on Software Testing and Analysis, p.151-163.
[38]Kim H, Bak J, Cho K, et al., 2023. A Transformer-based function symbol name inference model from an assembly language for binary reversing. Proc ACM Asia Conf on Computer and Communications Security, p.951-965.
[39]Kipf TN, Welling M, 2016. Semi-supervised classification with graph convolutional networks. Proc 5th Int Conf on Learning Representations.
[40]Lafferty JD, McCallum A, Pereira FCN, 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proc 18th Int Conf on Machine Learning, p.282-289.
[41]Lattner C, Adve V, 2004. LLVM: a compilation framework for lifelong program analysis & transformation. Proc Int Symp on Code Generation and Optimization, p.75-86.
[42]Li CF, Shen GM, Sun W, 2021. Cross-architecture Internet-of-Things malware detection based on graph neural network. Proc Int Joint Conf on Neural Networks, p.1-7.
[43]Li XZX, Qu Y, Yin H, 2021. PalmTree: learning an assembly language model for instruction embedding. Proc ACM SIGSAC Conf on Computer and Communications Security, p.3236-3251.
[44]Li YC, Wang BY, Hu BJ, 2020. Semantically find similar binary codes with mixed key instruction sequence. Inform Softw Technol, 125:106320.
[45]Li YJ, Tarlow D, Brockschmidt M, et al., 2015. Gated graph sequence neural networks. Proc 4th Int Conf on Learning Representations.
[46]Li YJ, Gu CJ, Dullien T, et al., 2019. Graph matching networks for learning the similarity of graph structured objects. Proc 36th Int Conf on Machine Learning, p.3835-3845.
[47]Liu BC, Huo W, Zhang C, et al., 2018. αDiff: cross-version binary code similarity detection with DNN. Proc 33rd IEEE/ACM Int Conf on Automated Software Engineering, p.667-678.
[48]Liu QX, Liu JX, Jin Z, et al., 2023. Survey of artificial intelligence based IoT malware detection. J Comput Res Dev, 60(10):2234-2254 (in Chinese).
[49]Liu YH, Ott M, Goyal N, et al., 2019. RoBERTa: a robustly optimized BERT pretraining approach.
[50]Liu ZA, 2021. Binary code similarity detection. Proc 36th IEEE/ACM Int Conf on Automated Software Engineering, p.1056-1060.
[51]Liu ZM, Kitouni O, Nolte N, et al., 2022. Towards understanding grokking: an effective theory of representation learning. Proc 36th Conf on Neural Information Processing Systems, p.34651-34663.
[52]Lu XD, Duan ZM, Qian YK, et al., 2020. Malicious code classification method based on deep forest. J Softw, 31(5):1454.
[53]Lu YL, Yu L, Zhao JZ, 2023. Survey of software vulnerability mining methods based on machine learning. Inform Counterm Technol, 2(2):1-19 (in Chinese).
[54]Luo ZH, Wang PW, Wang BS, et al., 2023. VulHawk: cross-architecture vulnerability detection with entropy-based binary code search. Proc 30th Annual Network and Distributed System Security Symp.
[55]Marcelli A, Graziano M, Ugarte-Pedrero X, et al., 2022. How machine learning is solving the binary function similarity problem. Proc 31st USENIX Security Symp, p.2099-2116.
[56]Massarelli L, Di Luna GA, Petroni F, et al., 2019a. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. Proc Workshop on Binary Analysis Research, p.1-11.
[57]Massarelli L, Di Luna GA, Petroni F, et al., 2019b. SAFE: self-attentive function embeddings for binary similarity. Proc 16th Int Conf on Detection of Intrusions and Malware, and Vulnerability Assessment, p.309-329.
[58]Mikolov T, Chen K, Corrado G, et al., 2013. Efficient estimation of word representations in vector space. Proc 1st Int Conf on Learning Representations.
[59]Nethercote N, Seward J, 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. Proc 28th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.89-100.
[60]Nitin V, Saieva A, Ray B, et al., 2021. DIRECT: a transformer-based model for decompiled identifier renaming. Proc 1st Workshop on Natural Language Processing for Programming, p.48-57.
[61]Patrick-Evans J, Dannehl M, Kinder J, 2023. XFL: naming functions in binaries with extreme multi-label learning. Proc IEEE Symp on Security and Privacy, p.2375-2390.
[62]Pei KX, Guan J, Broughton M, et al., 2021. StateFormer: fine-grained type recovery from binaries using generative state modeling. Proc 29th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering, p.690-702.
[63]Pei KX, Xuan Z, Yang JF, et al., 2023. Learning approximate execution semantics from traces for binary function similarity. IEEE Trans Softw Eng, 49(4):2776-2790.
[64]Peng DL, Zheng SX, Li YT, et al., 2021. How could neural networks understand programs? Proc 38th Int Conf on Machine Learning, p.8476-8486.
[65]Pham DP, Marion D, Mastio M, et al., 2021. Obfuscation revealed: leveraging electromagnetic signals for obfuscated malware classification. Proc 37th Annual Computer Security Applications Conf, p.706-719.
[66]Power A, Burda Y, Edwards H, et al., 2022. Grokking: generalization beyond overfitting on small algorithmic datasets.
[67]Qasem A, Debbabi M, Lebel B, et al., 2023. Binary function clone search in the presence of code obfuscation and optimization over multi-CPU architectures. Proc ACM Asia Conf on Computer and Communications Security, p.443-456.
[68]Qiao YC, Zhang WZ, Du XJ, et al., 2021. Malware classification based on multilayer perception and Word2Vec for IoT security. ACM Trans Int Technol, 22(1):10.
[69]Ramos DA, Engler D, 2015. Under-constrained symbolic execution: correctness checking for real code. Proc 24th USENIX Conf on Security Symp, p.49-64.
[70]Redmond K, Luo LN, Zeng Q, 2019. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. Proc Workshop on Binary Analysis Research, p.1-8.
[71]Shalev N, Partush N, 2018. Binary similarity detection using machine learning. Proc 13th Workshop on Programming Languages and Analysis for Security, p.42-47.
[72]Sun PF, Garcia L, Salles-Loustau G, et al., 2020. Hybrid firmware analysis for known mobile and IoT security vulnerabilities. Proc 50th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.373-384.
[73]Tai KS, Socher R, Manning CD, 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc 53rd Annual Meeting of the Association for Computational Linguistics and 7th Int Joint Conf on Natural Language Processing, p.1556-1566.
[74]Tang J, Qu M, Wang MZ, et al., 2015. LINE: large-scale information network embedding. Proc 24th Int Conf on World Wide Web, p.1067-1077.
[75]Ullah S, Oh H, 2022. BinDiffNN: learning distributed representation of assembly for robust binary diffing against semantic differences. IEEE Trans Softw Eng, 48(9):3442-3466.
[76]Vasan D, Alazab M, Wassan S, et al., 2020a. IMCFN: image-based malware classification using fine-tuned convolutional neural network architecture. Comput Netw, 171:107138.
[77]Vasan D, Alazab M, Venkatraman S, et al., 2020b. MTHAEL: cross-architecture IoT malware detection based on neural network advanced ensemble learning. IEEE Trans Comput, 69(11):1654-1667.
[78]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[79]Vinyals O, Bengio S, Kudlur M, 2015. Order Matters: sequence to sequence for sets. Proc 4th Int Conf on Learning Representations.
[80]Wang H, Qu WJ, Katz G, et al., 2022. jTrans: jump-aware Transformer for binary code similarity detection. Proc 31st ACM SIGSOFT Int Symp on Software Testing and Analysis, p.1-13.
[81]Wang HJ, Ma PC, Yuan YY, et al., 2023a. Enhancing DNN-based binary code function search with low-cost equivalence checking. IEEE Trans Softw Eng, 49(1):226-250.
[82]Wang HJ, Ma PC, Wang S, et al., 2023b. sem2vec: semantics-aware assembly tracelet embedding. ACM Trans Softw Eng Methodol, 32(4):90.
[83]Wang JJ, Huang YC, Chen CY, et al., 2024. Software testing with large language model: survey, landscape, and vision. IEEE Trans Softw Eng, 50(4):911-936.
[84]Wang JW, Chen ZJ, Xie X, et al., 2023. Review of malware detection and classification visualization techniques. Chin J Netw Inform Secur, 9(5):1 (in Chinese).
[85]Wu CY, Ban T, Cheng SM, et al., 2023. IoT malware classification based on reinterpreted function-call graphs. Comput Secur, 125:103060.
[86]Xu MJ, 2021. Understanding graph embedding methods and their applications. SIAM Rev, 63:825-853.
[87]Xu XJ, Liu C, Feng Q, et al., 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. Proc ACM SIGSAC Conf on Computer and Communications Security, p.363-376.
[88]Xu XZ, Feng SW, Ye YP, et al., 2023. Improving binary code similarity Transformer models by semantics-driven instruction deemphasis. Proc 32nd ACM SIGSOFT Int Symp on Software Testing and Analysis, p.1106-1118.
[89]Yang C, Liu ZY, Zhao DL, et al., 2015. Network representation learning with rich text information. Proc 24th Int Conf on Artificial Intelligence, p.2111-2117.
[90]Yang J, Fu C, Liu XY, et al., 2022. Codee: a tensor embedding scheme for binary code search. IEEE Trans Softw Eng, 48(7):2224-2244.
[91]Yang SG, Cheng L, Zheng YC, et al., 2021. Asteria: deep learning-based AST-encoding for cross-platform binary code similarity detection. 51st Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.224-236.
[92]Yang SG, Dong CP, Xiao Y, et al., 2023. Asteria-Pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol, 33(1):1.
[93]Yu SY, Achamyeleh YG, Wang CH, et al., 2023. CFG2VEC: hierarchical graph neural network for cross-architectural software reverse engineering. Proc IEEE/ACM 45th Int Conf on Software Engineering: Software Engineering in Practice, p.281-291.
[94]Yu YC, Gan ST, Qiu JY, et al., 2022. Binary code similarity analysis and its applications on embedded device firmware vulnerability search. J Softw, 33(11):4137-4172.
[95]Yu ZP, Zheng WX, Wang JQ, et al., 2020a. CodeCMR: cross-modal retrieval for function-level binary source code matching. 34th Conf on Neural Information Processing Systems, p.1-3.
[96]Yu ZP, Cao R, Tang QY, et al., 2020b. Order Matters: semantic-aware neural networks for binary code similarity detection. Proc 34th AAAI Conf on Artificial Intelligence, p.1145-1152.
[97]Yumlembam R, Issac B, Jacob SM, et al., 2023. IoT-based Android malware detection using graph neural network with adversarial defense. IEEE Int Things J, 10(10):8432-8444.
[98]Zhang XC, Sun WJ, Pang JM, et al., 2020. Similarity metric method for binary basic blocks of cross-instruction set architecture. Proc Workshop on Binary Analysis Research, p.1-12.
[99]Zhang YF, Huang C, Zhang YK, et al., 2022. Pre-training representations of binary code using contrastive learning.
[100]Zhang Z, Ye YP, You W, et al., 2021. OSPREY: recovery of variable and data structure via probabilistic analysis for stripped binary. Proc IEEE Symp on Security and Privacy, p.813-832.
[101]Zuo F, Li XP, Zhang Z, et al., 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. https://arxiv.org/pdf/1808.04706
Open peer comments: Debate/Discuss/Question/Opinion
<1>