Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2017 Vol.18 No.9 P.1336-1347

Automatic malware classification and new malware detection using machine learning

Author(s): Liu Liu, Bao-sheng Wang, Bo Yu, Qiu-xi Zhong
Affiliation(s): 1. College of Computer, National University of Defense Technology, Changsha 410073, China
Corresponding email(s): hotmailliuliu@163.com, wbshengnudt@163.com, BoYUnudt@sina.com, Qiuxizhong@163.com
Key Words: Malware classification, Machine learning, n-gram, Gray-scale image, Feature extraction, Malware detection

Share this article to： More <<< Previous Article \|Next Article >>>

Liu Liu, Bao-sheng Wang, Bo Yu, Qiu-xi Zhong. Automatic malware classification and new malware detection using machine learning[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(9): 1336-1347.

@article{title="Automatic malware classification and new malware detection using machine learning",
author="Liu Liu, Bao-sheng Wang, Bo Yu, Qiu-xi Zhong",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="9",
pages="1336-1347",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601325"
}

%0 Journal Article
%T Automatic malware classification and new malware detection using machine learning
%A Liu Liu
%A Bao-sheng Wang
%A Bo Yu
%A Qiu-xi Zhong
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 9
%P 1336-1347
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601325

TY - JOUR
T1 - Automatic malware classification and new malware detection using machine learning
A1 - Liu Liu
A1 - Bao-sheng Wang
A1 - Bo Yu
A1 - Qiu-xi Zhong
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 9
SP - 1336
EP - 1347
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601325

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware programs. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import functions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the unknown malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.

基于机器学习的自动化恶意代码分类与新恶意代码检测技术

概要：恶意软件的爆炸式增长对信息安全构成重大威胁。基于签名机制的传统反病毒系统无法将未知的恶意软件分类到相应的恶意家族和检测新的恶意软件。因此，我们提出一种基于机器学习的恶意软件分析系统，由数据处理系统，决策系统和新的恶意软件检测系统三个子系统组成。数据处理系统包含灰度图像的纹理特征，Opcode特征和API特征等三种特征提取方法。决策系统被用来分类恶意软件和证实可疑的恶意软件。最后，检测系统使用共享近邻聚类算法（shared nearest neighbor, SNN）来发现新的恶意软件。我们在Kingsoft,，ESET NOD32和Anubis收集的二万多恶意样本集上对所提出的方法进行了评估。结果表明，我们的系统可以有效地分类未知恶意软件，准确率可达98.9％。同时新恶意软件的成功检测率为86.7％。

关键词：恶意代码分类；机器学习；n-gram；灰度图；特征提取；恶意代码检测

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Annachhatre, C., Austin, T.H., Stamp, M., 2015. Hidden Markov models for malware classification. J. Comput. Virol. Hack. Tech., 11(2):59-73.

[2]Cheng, J.Y.C., Tsai, T.S., Yang, C.S., 2013. An information retrieval approach for malware classification based on Windows API calls. Int. Conf. on Machine Learning and Cybernetics, p.1678-1683.

[3]Damodaran, A., di Troia, F., Visaggio, C.A., et al., 2017. A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hack. Tech., 13(1): 1-12.

[4]Ding, Y.X., Dai, W., Yan, S.L., et al., 2014. Control flow-based Opcode behavior analysis for malware detection. Comput. Secur., 44:65-74.

[5]Egele, M., Scholte, T., Kirda, E., et al., 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Comput. Surv., 44(2): Article 6.

[6]Ertoz, L., Steinbach, M., Kumar, V., 2002. A new shared nearest neighbor clustering algorithm and its applications. Workshop on Clustering High Dimensional Data and Its Applications at the 2nd SIAM Int. Conf. on Data Mining, p.105-115.

[7]Gandotra, E., Bansal, D., Sofat, S., 2014. Malware analysis and classification: a survey. J. Inform. Secur., 5(2):44440.

[8]Han, K.S., Lim, J.H., Im, E.G., 2013. Malware analysis method using visualization of binary files. Proc. on Research in Adaptive and Convergent Systems, p.317-321.

[9]Hu, Q.H., Yu, D.R., Xie, Z.X., et al., 2007. EROS: ensemble rough subspaces. Patt. Recogn., 40(12):3728-3739.

[10]Islam, R., Tian, R.H., Batten, L.M., et al., 2013. Classification of malware based on integrated static and dynamic features. J. Netw. Comput. Appl., 36(2):646-656.

[11]Iwamoto, K., Wasaki, K., 2012. Malware classification based on extracted API sequences using static analysis. Proc. Asian Internet Engineering Conf., p.31-38.

[12]Jain, S., Meena, Y.K., 2011. Byte level n-gram analysis for malware detection. In: Venugopal, K.R., Patnaik, L.M. (Eds.), Computer Networks and Intelligent Computing. Springer, Berlin, p.51-59.

[13]Jarvis, R.A., Patrick, E.A., 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput., C-22(11):1025-1034.

[14]Jolliffe, I.T., 2002. Principal Component Analysis, Springer, New York.

[15]Kancherla, K., Mukkamala, S., 2013. Image visualization based malware detection. IEEE Symp. on Computational Intelligence in Cyber Security, p.40-44.

[16]Kapoor, A., Dhavale, S., 2016. Control flow graph based multiclass malware detection using bi-normal separation. Defen. Sci. J., 66(2):138-145.

[17]Kaspersky Labs, 2015. Security Bulletin 2015. https://securelist.com/files/2015/12/KSB_2015_Statistics_FINAL_EN.pdf

[18]Kinable, J., Kostakis, O., 2011. Malware classification based on call graph clustering. J. Comput. Virol., 7(4):233-245.

[19]Kong, D.G., Yan, G.H., 2013. Discriminant malware distance learning on structural information for automated malware classification. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1357-1365.

[20]Lee, J., Jeong, K., Lee, H., 2010. Detecting metamorphic malwares using code graphs. Proc. ACM Symp. on Applied Computing, p.1970-1977.

[21]Lin, C.T., Wang, N.J., Xiao, H., et al., 2015. Feature selection and extraction for malware classification. J. Inform. Sci. Eng., 31(3):965-992.

[22]Lin, D., Stamp, M., 2011. Hunting for undetectable metamorphic viruses. J. Comput. Virol., 7(3):201-214.

[23]Liu, X.W., Wang, L., Huang, G.B., et al., 2015. Multiple kernel extreme learning machine. Neurocomputing, 149: 253-264.

[24]Musale, M., Austin, T.H., Stamp, M., 2015. Hunting for metamorphic JavaScript malware. J. Comput. Virol. Hack. Tech., 11(2):89-102.

[25]Nataraj, L., Karthikeyan, S., Jacob, G., et al., 2014. Malware images: visualization and automatic classification. Proc. 8th Int. Symp. on Visualization for Cyber Security.

[26]Oliva, A., Torralba, A., 2001. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis., 42(3):145-175.

[27]Pascanu, R., Stokes, J.W., Sanossian, H., et al., 2015. Malware classification with recurrent networks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.1916-1920.

[28]Roundy, K.A., Miller, B.P., 2010. Hybrid analysis and control of malware. In: Jha, S., Sommer, R., Kreibich, C. (Eds.), Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, p.317-338.

[29]Russo, A., Sabelfeld, A., 2010. Dynamic vs. static flow-sensitive security analysis. 23rd IEEE Computer Security Foundations Symp., p.186-199.

[30]Salton, G., McGill, M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, USA.

[31]Shabtai, A., Moskovitch, R., Elovici, Y., et al., 2009. Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform. Secur. Tech. Rep., 14(1):16-29.

[32]Tao, H., Ma, X., Qiao, M., 2013. Subspace selective ensemble algorithm based on feature clustering. J. Comput., 8(2): 509-516.

[33]Tian, R.H., Batten, L., Islam, R., et al., 2009. An automated classification system based on the strings of Trojan and virus families. 4th Int. Conf. on Malicious and Unwanted Software, p.23-30.

[34]Tian, R.H., Islam, R., Batten, L., et al., 2010. Differentiating malware from cleanware using behavioural analysis. 5th Int. Conf. on Malicious and Unwanted Software, p.23-30.

[35]Tsyganok, K., Tumoyan, E., Babenko, L., et al., 2012. Classification of polymorphic and metamorphic malware samples based on their behavior. Proc. 5th Int. Conf. on Security of Information and Networks, p.111-116.

[36]Wong, W., Stamp, M., 2006. Hunting for metamorphic engines. J. Comput. Virol., 2(3):211-229.

[37]Yan, G.H., Brown, N., Kong, D.G., 2013. Exploring discriminatory features for automated malware classification. In: Rieck, K., Stewin, P., Seifert, J.P. (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment. Springer Berlin Heidelberg, p.41-61.

[38]Yao, W., Chen, X.Q., Zhao, Y., et al., 2012. Concurrent subspace width optimization method for RBF neural network modeling. IEEE Trans. Neur. Netw. Learn. Syst., 23(2): 247-259.

[39]Ye, Y.F., Li, T., Chen, Y., et al., 2010. Automatic malware categorization using cluster ensemble. Proc. 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.95-104.

[40]Yu, Y., Wang, H.M., Yin, G., et al., 2016. Reviewer recommendation for pull-requests in GitHub: what can we learn from code review and bug assignment Inform. Softw. Technol., 74:204-218.

[41]Zhou, Z.H., Wu, J.X., Tang, W., 2002. Ensembling neural networks: many could be better than all. Artif. Intell., 137(1-2):239-263.

Open peer comments: Debate/Discuss/Question/Opinion

<1>