CLC number: TP391.4

On-line Access: 2019-06-10

Received: 2018-08-06

Revision Accepted: 2018-12-23

Crosschecked: 2019-05-13

Cited: 0

Clicked: 3665

Yan-min Qian, Xu Xiang. Binary neural networks for speech recognition[J]. Frontiers of Information Technology & Electronic Engineering, 2019, 20(5): 701-715.

@article{title="Binary neural networks for speech recognition",

author="Yan-min Qian, Xu Xiang",

journal="Frontiers of Information Technology & Electronic Engineering",

volume="20",

number="5",

pages="701-715",

year="2019",

publisher="Zhejiang University Press & Springer",

doi="10.1631/FITEE.1800469"

}

%0 Journal Article

%T Binary neural networks for speech recognition

%A Yan-min Qian

%A Xu Xiang

%J Frontiers of Information Technology & Electronic Engineering

%V 20

%N 5

%P 701-715

%@ 2095-9184

%D 2019

%I Zhejiang University Press & Springer

%DOI 10.1631/FITEE.1800469

TY - JOUR

T1 - Binary neural networks for speech recognition

A1 - Yan-min Qian

A1 - Xu Xiang

J0 - Frontiers of Information Technology & Electronic Engineering

VL - 20

IS - 5

SP - 701

EP - 715

%@ 2095-9184

Y1 - 2019

PB - Zhejiang University Press & Springer

ER -

DOI - 10.1631/FITEE.1800469

**Abstract: **Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floating-point matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

**
**

[1]Bengio Y, Léonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432

[2]Bi MX, Qian YM, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16^{th} Annual Conf of Int Speech Communication Association, p.3259-3263.

[3]Chen ZH, Zhuang YM, Qian YM, et al., 2017. Phone synchronous speech recognition with CTC lattices. *IEEE/ACM Trans Audio Speech Lang Process*, 25(1): 90-101.

[4]Chen ZH, Luitjens J, Xu HN, et al., 2018a. A GPU-based WFST decoder with exact lattice generation. https://arxiv.org/abs/1804.03243

[5]Chen ZH, Liu Q, Li H, et al., 2018b. On modular training of neural acoustics-to-word model for LVCSR. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4754-4758.

[6]Chen ZH, Droppo J, Li JY, et al., 2018c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. *IEEE/ACM Trans Audio Speech Lang Process*, 26(1):184-196.

[7]Collobert R, Kavukcuoglu K, Farabet C, 2011. Torch7: a Matlab-like environment for machine learning. BigLearn NIPS Workshop.

[8]Courbariaux M, Hubara I, Soudry D, et al., 2016. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or $-$1. https://arxiv.org/abs/1602.02830

[9]Dahl GE, Yu D, Deng L, et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. *IEEE Trans Audio Speech Lang Process*, 20(1):30-42.

[10]Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. 26^{th} Int Conf on Neural Information Processing Systems, p.2148-2156.

[11]Duchi J, Hazan E, Singer Y, 2011. Adaptive subgradient methods for online learning and stochastic optimization. *J Mach Learn Res*, 12:2121-2159.

[12]Goto K, van de Geijn RA, 2008. Anatomy of high-performance matrix multiplication. *ACM Trans Math Softw*, 34(3), Article 12.

[13]Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32^{nd} Int Conf on Machine Learning, p.1737-1746.

[14]Hammarlund P, Martinez AJ, Bajwa AA, et al., 2014. Haswell: the fourth-generation Intel core processor. *IEEE Micro*, 34(2):6-20.

[15]Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural network. Proc 28^{th} Int Conf on Neural Information Processing Systems, p.1135-1143.

[16]Han S, Kang JL, Mao HZ, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75-84.

[17]He TX, Fan YC, Qian YM, et al., 2014. Reshaping deep neural network for fast decoding by node-pruning. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.245-249.

[18]Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. *IEEE Signal Process Mag*, 29(6):82-97.

[19]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531

[20]Hubara I, Courbariaux M, Soudry D, et al., 2016. Quantized neural networks: training neural networks with low precision weights and activations. https://arxiv.org/abs/1609.07061

[21]Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32^{nd} Int Conf on Machine Learning, p.448-456.

[22]Jaitly N, Nguyen P, Senior A, et al., 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. Proc 13^{th}Annual Conf of the Int Speech Communication Association.

[23]Kingma D, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

[24]Li JY, Seltzer ML, Wang X, et al., 2017. Large-scale domain adaptation via teacher-student learning. Proc 18^{th} Annual Conf of Int Speech Communication Association, p.2386-2390.

[25]Low TM, Igual FD, Smith TM, et al., 2016. Analytical modeling is enough for high-performance BLIS. *ACM Trans Math Softw*, 43(2), Article 12.

[26]Lu L, Renals S, 2017. Small-footprint highway deep neural networks for speech recognition. *IEEE/ACM Trans Audio Speech Lang Process*, 25(7):1502-1511.

[27]Lu L, Guo M, Renals S, 2017. Knowledge distillation for small-footprint highway networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4820-4824.

[28]Mohamed AR, Dahl GE, Hinton GE, 2012. Acoustic modeling using deep belief networks. *IEEE Trans Audio Speech Lang Process*, 20(1):14-22.

[29]Novikov A, Podoprikhin D, Osokin A, et al., 2015. Tensorizing neural networks. Advances in Neural Information Processing Systems, p.442-450.

[30]Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.

[31]Qian YM, Woodland PC, 2016. Very deep convolutional neural networks for robust speech recognition. Proc IEEE Spoken Language Technology Workshop, p.481-488.

[32]Qian YM, He TX, Deng W, et al., 2015. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition. Proc Int Joint Conf on Neural Networks, p.1-6.

[33]Qian YM, Bi MX, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. *IEEE/ACM Trans Audio Speech Lang Process*, 24(12):2263-2276.

[34]Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14$^rm th$ European Conf on Computer Vision, p.525-542.

[35]Sainath TN, Mohamed AR, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8614-8618.

[36]Sak H, Senior A, Beaufays F, 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc 15^{th} Annual Conf of Int Speech Communication Association, p.338-342.

[37]Saon G, Kurata G, Sercu T, et al., 2017. English conversational telephone speech recognition by humans and machines. https://arxiv.org/abs/1703.02136

[38]Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4955-4959.

[39]Wang YQ, Li JY, Gong YF, 2015. Small-footprint high-performance deep neural network-based speech recognition using split-VQ. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4984-4988.

[40]Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256

[41]Xiong W, Droppo J, Huang X, et al., 2017. The Microsoft 2016 conversational speech recognition system. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.5255-5259.

[42]Xue J, Li JY, Gong YF, 2013. Restructuring of deep neural network acoustic models with singular value decomposition. Proc 14^{th} Annual Conf of Int Speech Communication Association, p.2365-2369.

[43]Young S, Evermann G, Gales M, et al., 2006. The HTK Book. Cambridge University Engineering Department, Cambridge, UK.

[44]Yu D, Seide F, Li G, et al., 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4409-4412.

[45]Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Proc 17^{th} Annual Conf of Int Speech Communication Association, p.17-21.

[46]Zhou SC, Wu YX, Ni ZK, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. https://arxiv.org/abs/1606.06160

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China

Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn

Copyright © 2000 - 2022 Journal of Zhejiang University-SCIENCE

Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn

Copyright © 2000 - 2022 Journal of Zhejiang University-SCIENCE

Open peer comments: Debate/Discuss/Question/Opinion

<1>