JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

2019 Vol.20 No.5 P.701-715

Binary neural networks for speech recognition

Yan-min Qian, Xu Xiang

Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

yanminqian@sjtu.edu.cn, chinoiserie@sjtu.edu.cn

Abstract: Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floating-point matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

Key words: Speech recognition, Binary neural networks, Binary matrix multiplication, Knowledge distillation, Population count

Chinese Summary <38> 用于语音识别的二值神经网络

摘要：近年来，在语音识别的声学建模中，深度神经网络(DNNs)明显优于高斯混合模型。然而，推断阶段巨大的计算量使其难以部署在低功耗的嵌入式模型上。为此，稀疏性和低精度定点量化技术被广泛使用。为降低推理阶段计算量，本文开发了用于语音识别的二进制神经网络，并实现了高速的二值矩阵乘法。在中央处理器(CPU)和图形处理单元(GPU)上，二值矩阵乘法的运行速度是浮点矩阵乘法的5–7倍。针对大规模连续语音识别的声学建模，提出多种二值神经网络及相关模型优化算法。为提高二值模型的精度，探索了从浮点模型到二值模型的知识蒸馏技术。在标准的Switchboard语音识别任务上，该二值神经网络模型比浮点神经网络模型速度提高3–4倍。借助知识蒸馏技术，二值深度神经网络或卷积神经网络相对其浮点神经网络的词错误率增加可以保持在15%以内。若只二值化卷积神经网络的卷积层，词错误率增加几乎可忽略。

关键词组：语音识别；二值神经网络；二值矩阵乘法；知识蒸馏；位1计数

Share this article to： More

Go to Contents

References:

Open peer comments: Debate/Discuss/Question/Opinion

<1>

DOI:

10.1631/FITEE.1800469

CLC number:

TP391.4

Download Full Text:

Click Here

Downloaded:

2320

Download summary:

Downloaded:

1920

Clicked:

6430

Cited:

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2019-05-13

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE

CONTENTS

INSTR. FOR AUTHOR

FOR REVIEWER

ABOUT JZUS

Publishing Service