JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2018 Vol.19 No.1 P.64-77

Recent advances in efficient computation of deep convolutional neural networks

Author(s): Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, Han-qing Lu
Affiliation(s): National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; more
Corresponding email(s): jcheng@nlpr.ia.ac.cn, peisong.wang@nlpr.ia.ac.cn, gang.li@nlpr.ia.ac.cn, qinghao.hu@nlpr.ia.ac.cn
Key Words: Deep neural networks, Acceleration, Compression, Hardware accelerator

Share this article to： More <<< Previous Article \|Next Article >>>

Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, Han-qing Lu. Recent advances in efficient computation of deep convolutional neural networks[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 64-77.

@article{title="Recent advances in efficient computation of deep convolutional neural networks",
author="Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, Han-qing Lu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="1",
pages="64-77",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700789"
}

%0 Journal Article
%T Recent advances in efficient computation of deep convolutional neural networks
%A Jian Cheng
%A Pei-song Wang
%A Gang Li
%A Qing-hao Hu
%A Han-qing Lu
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 1
%P 64-77
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700789

TY - JOUR
T1 - Recent advances in efficient computation of deep convolutional neural networks
A1 - Jian Cheng
A1 - Pei-song Wang
A1 - Gang Li
A1 - Qing-hao Hu
A1 - Han-qing Lu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 1
SP - 64
EP - 77
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700789

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks continue to increase. This poses a significant challenge to the deployment of such networks, especially in real-time applications or on resource-limited devices. Thus, network acceleration has become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) have been proposed in recent years. In this paper, we provide a comprehensive survey of recent advances in network acceleration, compression, and accelerator design from both algorithm and hardware points of view. Specifically, we provide a thorough analysis of each of the following topics: network pruning, low-rank approximation, network quantization, teacher–student networks, compact network design, and hardware accelerators. Finally, we introduce and discuss a few possible future directions.

深度卷积神经网络高效计算研究进展

概要：近年来迅速发展的深度神经网络已成为许多智能系统的基础工具。同时，深度网络的计算复杂度和资源消耗也在持续增加，这给深度网络的部署带来了严峻挑战，尤其在实时应用中或应用设备资源有限时。因此，网络加速是深度学习领域的热门话题。为提升深度神经网络的硬件性能，最近几年涌现出一大批基于现场可编程门阵列（field-programmable gate array, FPGA）或专用集成电路（application-specific integrated circuit, ASIC）的加速器。本文针对网络加速、压缩、软硬件结合的加速器设计等方面的进展进行了详细而全面的总结。特别地，本文对网络剪枝、低秩估计、网络量化、拟合网络、紧凑网络设计以及硬件加速器进行了深入分析。最后，展望了该领域未来一些研究方向。

关键词：深度神经网络；加速；压缩；硬件加速器

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Albericio J, Judd P, Hetherington T, et al., 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. Proc 43^rd Int Symp on Computer Architecture, p.1-13.

[2]Alwani M, Chen H, Ferdman M, et al., 2016. Fused-layer CNN accelerators. 49^th Annual IEEE/ACM Int Symp on MICRO, p.1-12.

[3]Anwar S, Hwang K, Sung W, 2017. Structured pruning of deep convolutional neural networks. ACM J Emerg Technol Comput Syst, 13(3), Article 32.

[4]Cai Z, He X, Sun J, et al., 2017. Deep learning with low precision by half-wave Gaussian quantization. IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.5918-5926.

[5]Chen L, Li J, Chen Y, et al., 2017. Accelerator-friendly neural-network training: learning variations and defects in RRAM crossbar. Proc Conf on Design, Automation and Test in Europe Conf and Exhibition, p.19-24.

[6]Chen Y, Sun N, Temam O, et al., 2014. DaDianNao: a machine-learning supercomputer. Proc 47^th Annual IEEE/ACM Int Symp on Microarchitecture, p.609-622.

[7]Chen Y, Krishna T, Emer J, et al., 2017. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Sol-Stat Circ, 52(1): 127-138.

[8]Cheng J, Wu J, Leng C, et al., 2017. Quantized CNN: a unified approach to accelerate and compress convolutional networks. IEEE Trans Neur Netw Learn Syst, 99:1-14.

[9]Cheng Z, Soudry D, Mao Z, et al., 2015. Training binary multilayer neural networks for image classification using expectation backpropagation. http://arxiv.org/abs/1503.03562

[10]Courbariaux M, Bengio Y, David J, 2015. Binaryconnect: training deep neural networks with binary weights during propagations. NIPS, p.3123-3131.

[11]Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. NIPS, p.2148-2156.

[12]Dettmers T, 2015. 8-bit approximations for parallelism in deep learning. http://arxiv.org/abs/1511.04561

[13]Gao M, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. Proc 22^nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.751-764.

[14]Gong Y, Liu L, Yang M, et al., 2014. Compressing deep convolutional networks using vector quantization. http://arxiv.org/abs/1412.6115

[15]Gudovskiy D, Rigazio L, 2017. ShiftCNN: generalized low-precision architecture for inference of convolutional neural networks. http://arxiv.org/abs/1706.02393

[16]Guo Y, Yao A, Chen Y, 2016. Dynamic network surgery for efficient DNNs. NIPS, p.1379-1387.

[17]Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32^nd Int Conf on Machine Learning, p.1737-1746.

[18]Hammerstrom D, 2012. A VLSI architecture for high-performance, low-cost, on-chip learning. IJCNN Int Joint Conf on Neural Networks, p.537-544.

[19]Han S, Mao H, Dally W, 2015a. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149

[20]Han S, Pool J, Tran J, et al., 2015b. Learning both weights and connections for efficient neural network. NIPS, p.1135-1143.

[21]Han S, Liu X, Mao H, et al., 2016. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43^rd Annual Int Symp on Computer Architecture, p.243-254.

[22]Han S, Kang J, Mao H, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75-84.

[23]Hassibi B, Stork D, 1993. Second order derivatives for network pruning: optimal brain surgeon. NIPS, p.164-171.

[24]He K, Zhang X, Ren S, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[25]He Y, Zhang X, Sun J, 2017. Channel pruning for accelerating very deep neural networks. http://arxiv.org/abs/1707.06168

[26]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. http://arxiv.org/abs/1503.02531

[27]Holi J, Hwang J, 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans Comput, 42(3):281-290.

[28]Horowitz M, 2014. 1.1 computing’s energy problem (and what we can do about it). IEEE Int Solid-State Circuits Conf Digest of Technical Papers, p.10-14.

[29]Hou L, Yao Q, Kwok J, 2016. Loss-aware binarization of deep networks. http://arxiv.org/abs/1611.01600

[30]Howard A, Zhu M, Chen B, et al., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861

[31]Hu Q, Wang P, Cheng J, 2018. From hashing to CNNs: training binary weight networks via hashing. 32^nd AAAI Conf on Artificial Intelligence, in press.

[32]Hwang K, Sung W, 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and −1. IEEE Workshop on Signal Processing Systems, p.1-6.

[33]Iandola F, Han S, Moskewicz M, et al., 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. http://arxiv.org/abs/1602.07360

[34]Jaderberg M, Vedaldi A, Zisserman A, 2014. Speeding up convolutional neural networks with low rank expansions. http://arxiv.org/abs/1405.3866

[35]Jegou H, Douze M, Schmid C, 2011. Product quantization for nearest neighbor search. IEEE Trans Patt Anal Mach Intell, 33(1):117-128.

[36]Jouppi N, 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44^th Annual Int Symp on Computer Architecture, p.1-12.

[37]Kim D, Kung J, Chai S, et al., 2016. Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. ACM/IEEE 43^rd Annual Int Symp on Computer Architecture, p.380-392.

[38]Kim K, Kim J, Yu J, et al., 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. Proc 53^rd Annual Design Automation Conf, Article 124.

[39]Kim M, Smaragdis P, 2016. Bitwise neural networks. http://arxiv.org/abs/1601.06071

[40]Kim YD, Park E, Yoo S, et al., 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. http://arxiv.org/abs/1511.06530

[41]Ko J, Mudassar B, Na T, et al., 2017. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation. ACM/EDAC/IEEE Design Automation Conf, p.1-6.

[42]Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.

[43]Krizhevsky A, Sutskever I, Hinton G, 2012. Imagenet classification with deep convolutional neural networks. NIPS, p.1097-1105.

[44]Lebedev V, Lempitsky V, 2016. Fast ConvNets using group-wise brain damage. IEEE Conf on Computer Vision and Pattern Recognition, p.2554-2564.

[45]Lebedev V, Ganin Y, Rakhuba M, et al., 2014. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. http://arxiv.org/abs/1412.6553

[46]LeCun Y, Denker J, Solla S, et al., 1989. Optimal brain damage. NIPS, p.598-605.

[47]Lee EH, Miyashita D, Chai E, et al., 2017. LogNet: energy-efficient neural networks using logarithmic computation. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5900-5904.

[48]Li F, Zhang B, Liu B, 2016. Ternary weight networks. http://arxiv.org/abs/1605.04711

[49]Li G, Li F, Zhao T, et al., 2018. Block convolution: towards memory-efficient inference of large-scale CNNs on FPGA. Design Automation and Test in Europe, in press.

[50]Lin M, Chen Q, Yan S, 2013. Network in network. http://arxiv.org/abs/1312.4400

[51]Lin Z, Courbariaux M, Memisevic R, et al., 2015. Neural networks with few multiplications. http://arxiv.org/abs/1510.03009

[52]Liu S, Du Z, Tao J, et al., 2016. Cambricon: an instruction set architecture for neural networks. Proc 43^rd Int Symp on Computer Architecture, p.393-405.

[53]Liu Z, Li J, Shen Z, et al., 2017. Learning efficient convolutional networks through network slimming. IEEE Int Conf on Computer Vision, p.2736-2744

[54]Luo J, Wu J, Lin W, 2017. ThiNet: a filter level pruning method for deep neural network compression. http://arxiv.org/abs/1707.06342

[55]Ma Y, Cao Y, Vrudhula S, et al., 2017a. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27^th Int Conf on Field Programmable Logic and Applications, p.1-8.

[56]Ma Y, Cao Y, Vrudhula S, et al., 2017b. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.45-54.

[57]Ma Y, Kim M, Cao Y, et al., 2017c. End-to-end scalable FPGA accelerator for deep residual networks. IEEE Int Symp on Circuits and Systems, p.1-4.

[58]Mao H, Han S, Pool J, et al., 2017. Exploring the regularity of sparse structure in convolutional neural networks. http://arxiv.org/abs/1705.08922

[59]Miyashita D, Lee E, Murmann B, 2016. Convolutional neural networks using logarithmic data representation. http://arxiv.org/abs/1603.01025

[60]Nguyen D, Kim D, Lee J, 2017. Double MAC: doubling the performance of convolutional neural networks on modern FPGAs. Design, Automation & Test in Europe Conf & Exhibition, p.890-893.

[61]21740} Nurvitadhi E, Hillsboro, Venkatesh G, et al., 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.5-14.

[62]Parashar A, Rhu M, Mukkara A, et al., 2017. SCNN: an accelerator for compressed-sparse convolutional neural networks. Proc 44^th Annual Int Symp on Computer Architecture, p.27-40.

[63]Price M, Glass J, Chandrakasan A, 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. IEEE Int Solid-State Circuits Conf, p.244-245.

[64]Qiu JT, Wang J, Yao S, et al., 2016. Going deeper with embedded FPGA platform for convolutional neural network. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.26-35.

[65]Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. European Conf on Computer Vision, p.525-542.

[66]Ren A, Li Z, Ding C, et al., 2017. SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. Proc 22^nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.405-418.

[67]Romero A, Ballas N, Kahou S, et al., 2014. FitNets: hints for thin deep nets. http://arxiv.org/abs/1412.6550

[68]Russakovsky O, Deng J, Su H, et al., 2015. Imagenet large scale visual recognition challenge. Int J Comput Vis, 115(3):211-252.

[69]Sharma H, Park J, Mahajan D, et al., 2016. From high-level deep neural models to FPGAs. 49^th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-21.

[70]Shen Y, Ferdman M, Milder P, 2017. Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. IEEE 25^th Annual Int Symp on Field-Programmable Custom Computing Machines, p.93-100.

[71]Sim H, Lee J, 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. Proc 54^th Annual Design Automation Conf, Article 29.

[72]Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556

[73]Suda N, Chandra V, Dasika G, et al., 2016. Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. Proc ACM/linebreak SIGDA Int Symp on Field-Programmable Gate Arrays, p.16-25.

[74]Szegedy C, Liu W, Jia Y, et al., 2015. Going deeper with convolutions. Conf on Computer Vision and Pattern Recognition, p.1-9.

[75]Tang W, Hua G, Wang L, 2017. How to train a compact binary neural network with high accuracy? 31^st AAAI Conf on Artificial Intelligence, p.2625-2631.

[76]Tann H, Hashemi S, Bahar I, et al., 2017. Hardware-software codesign of accurate, multiplier-free deep neural networks. 54^th ACM/EDAC/IEEE Design Automation Conf, p.1-6.

[77]4} Umuroglu Y, Fraster N, Gambardella G, et al., 2017. FINN: a framework for fast, scalable binarized neural network inference. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.65-74.

[78]Venieris S, Bouganis C, 2016. fpgaAConvNet: a framework for mapping convolutional neural networks on FPGAs. IEEE 24^th Annual Int Symp on Field-Programmable Custom Computing Machines, p.40-47.

[79]Venkataramani S, Ranjan A, Banerjee S, et al., 2017. ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. Proc 44^th Annual Int Symp on Computer Architecture, p.13-26.

[80]Wang P, Cheng J, 2016. Accelerating convolutional neural networks for mobile applications. Proc ACM on Multimedia Conf, p.541-545.

[81]Wang P, Cheng J, 2017. Fixed-point factorized networks. IEEE Conf on Computer Vision and Pattern Recognition, p.4012-4020.

[82]Wang P, Hu Q, Fang Z, et al., 2018. Deepsearch: a fast image search framework for mobile devices. ACM Trans Multim Comput Commun Appl, 14(1), Article 6.

[83]Wang Y, Xu J, Han Y, et al., 2016. Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. Proc 53^rd Annual Design Automation Conf, Article 110.

[84]Wei SC, Yu CH, Zhang P, et al., 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. 54^th ACM/EDAC/IEEE Design Automation Conf, p.1-6.

[85]Wen W, Wu C, Wang Y, et al., 2016. Learning structured sparsity in deep neural networks. NIPS, p.2074-2082.

[86]Wu J, Leng C, Wang Y, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820-4828.

[87]Xia L, Tang T, Huangfu W, et al., 2016. Switched by input: power efficient structure for RRAM-based convolutional neural network. 53^rd ACM/EDAC/IEEE Design Automation Conf, p.1-6.

[88]Xiao QC, Liang Y, Lu LQ, et al., 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 54^th ACM/EDAC/IEEE Design Automation Conf, p.1-6.

[89]Xie S, Girshick R, Dollar P, et al., 2017. Aggregated residual transformations for deep neural networks. IEEE Conf on Computer Vision and Pattern Recognition, p.5987-5995.

[90]Yang H, 2017. TIME: a training-in-memory architecture for memristor-based deep neural networks. 54^th ACM/linebreak EDAC/IEEE Design Automation Conf, p.1-6.

[91]Zagoruyko S, Komodakis N, 2016. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. http://arxiv.org/abs/1612.03928

[92]Zhang C, Li P, Sun GY, et al., 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.161-170.

[93]Zhang C, Fang Z, Pan P, et al., 2016a. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE/ACM Int Conf on Computer-Aided Design, p.1-8.

[94]Zhang C, Wu D, Sun J, et al., 2016b. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. Proc Int Symp on Low Power Electronics and Design, p.326-331.

[95]Zhang S, Du Z, Zhang L, et al., 2016. Cambricon-X: an accelerator for sparse neural networks. 49^th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-12.

[96]Zhang X, Zou J, He K, et al., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE Trans Patt Anal Mach Intell, 38(10):1943-1955.

[97]Zhang X, Zhou X, Lin M, et al., 2017. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. http://arxiv.org/abs/1707.01083

[98]Zhao R, Song WN, Zhang WT, et al., 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.15-24.

[99]Zhou A, Yao A, Guo Y, et al., 2017. Incremental network quantization: towards lossless CNNs with low-precision weights. http://arxiv.org/abs/1702.03044

[100]Zhou S, Wu Y, Ni Z, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. http://arxiv.org/abs/1606.06160

[101]Zhu C, Han S, Mao H, et al., 2016. Trained ternary quantization. http://arxiv.org/abs/1612.01064

[102]Zhu J, Qian Z, Tsui C, 2016. LRADNN: high-throughput and energy-efficient deep neural network accelerator using low rank approximation. 21^st Asia and South Pacific Design Automation Conf, p.581-586.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

深度卷积神经网络高效计算研究进展

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference