Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2022 Vol.23 No.7 P.1127-1138

Efficient decoding self-attention for end-to-end speech synthesis

Author(s): Wei ZHAO, Li XU
Affiliation(s): 1. College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China more
Corresponding email(s): weizhao_ee@zju.edu.cn, xupower@zju.edu.cn
Key Words: Efficient decoding, End-to-end, Self-attention, Speech synthesis

Share this article to： More <<< Previous Article \|

Wei ZHAO, Li XU. Efficient decoding self-attention for end-to-end speech synthesis[J]. Frontiers of Information Technology & Electronic Engineering, 2022, 23(7): 1127-1138.

@article{title="Efficient decoding self-attention for end-to-end speech synthesis",
author="Wei ZHAO, Li XU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="23",
number="7",
pages="1127-1138",
year="2022",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2100501"
}

%0 Journal Article
%T Efficient decoding self-attention for end-to-end speech synthesis
%A Wei ZHAO
%A Li XU
%J Frontiers of Information Technology & Electronic Engineering
%V 23
%N 7
%P 1127-1138
%@ 2095-9184
%D 2022
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2100501

TY - JOUR
T1 - Efficient decoding self-attention for end-to-end speech synthesis
A1 - Wei ZHAO
A1 - Li XU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 23
IS - 7
SP - 1127
EP - 1138
%@ 2095-9184
Y1 - 2022
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2100501

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: self-attention has been innovatively applied to text-to-speech (TTS) because of its parallel structure and superior strength in modeling sequential data. However, when used in end-to-end speech synthesis with an autoregressive decoding scheme, its inference speed becomes relatively low due to the quadratic complexity in sequence length. This problem becomes particularly severe on devices without graphics processing units (GPUs). To alleviate the dilemma, we propose an efficient decoding self-attention (EDSA) module as an alternative. Combined with a dynamic programming decoding procedure, TTS model inference can be effectively accelerated to have a linear computation complexity. We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720% and 50% higher inference speed on the central processing unit (CPU) and GPU respectively, with almost the same performance. Thus, this method may make the deployment of such models easier when there are limited GPU resources. In addition, our model may perform better than the baseline Transformer TTS on out-of-domain utterances.

一种端到端语音合成中的高效解码自注意力网络

赵伟^1,2，许力^1,2
¹浙江大学电气工程学院，中国杭州市，310027
²浙江大学机器人研究院，中国余姚市，315400
摘要：自注意力网络由于其并行结构和强大的序列建模能力，被广泛应用于语音合成（TTS）领域。然而，当使用自回归解码方法进行端到端语音合成时，由于序列长度的二次复杂性，其推理速度相对较慢。当部署设备未配备图形处理器（GPU）时，该效率问题更加严重。为解决该问题，提出一种高效解码自注意力网络（EDSA）作为替代。通过一个动态规划解码过程，有效加速TTS模型推理，使其具有线性计算复杂度。基于普通话和英文数据集的实验结果表明，所提EDSA模型在中央处理器（CPU）和GPU上的推理速度分别提高720%和50%，而性能几乎相同。因此，在GPU资源有限的情况下，该方法可使此类模型的部署更加容易。此外，所提模型在域外语言处理上可能比基线Transformer TTS性能更好。

关键词：高效解码；端到端；自注意力网络；语音合成

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ainslie J, Ontanon S, Alberti C, et al., 2020. ETC: encoding long and structured inputs in Transformers. Proc Conf on Empirical Methods in Natural Language Processing, p.268-284.

[2]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450

[3]Bahdanau D, Cho K, Bengio Y, 2015. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473v6

[4]Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document transformer. https://arxiv.org/abs/2004.05150

[5]Child R, Gray S, Radford A, et al., 2019. Generating long sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509

[6]Choromanski KM, Likhosherstov V, Dohan D, et al., 2020. Rethinking attention with performers. https://arxiv.org/abs/2009.14794

[7]Dai ZH, Yang ZL, Yang YM, et al., 2019. Transformer-XL: attentive language models beyond a fixed-length context. Proc 57^th Annual Meeting of the Association for Computational Linguistics, p.2978-2988.

[8]DataBaker, 2019. Chinese Standard Mandarin Speech Copus. https://www.data-baker.com [Accessed on June 1, 2020].

[9]Hayashi T, Yamamoto R, Inoue K, et al., 2020. Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7654-7658.

[10]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[11]Ihm HR, Lee JY, Choi BJ, et al., 2020. Reformer-TTS: neural speech synthesis with reformer network. Proc Interspeech 21^st Annual Conf of the Int Speech Communication Association, p.2012-2016.

[12]Ito K, Johnson L, 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/ [Accessed on June 1, 2020].

[13]Katharopoulos A, Vyas A, Pappas N, et al., 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. Proc 37^th Int Conf on Machine Learning, p.5156-5165.

[14]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

[15]Kitaev N, Kaiser L, Levskaya A, 2020. Reformer: the efficient Transformer. https://arxiv.org/abs/2001.04451v1

[16]Lee J, Lee Y, Kim J, et al., 2019. Set Transformer: a framework for attention-based permutation-invariant neural networks. Proc 36^th Int Conf on Machine Learning, p.3744-3753.

[17]Li NH, Liu SJ, Liu YQ, et al., 2019. Neural speech synthesis with Transformer network. Proc AAAI Conf on Artificial Intelligence, p.6706-6713.

[18]Lim D, Jang W, OG, et al., 2020. JDI-T: Jointly trained Duration Informed Transformer for text-to-speech without explicit alignment. Proc Conf of the Int Speech Communication Association, p.4004-4008.

[19]Park K, 2019. g2pC. GitHub. https://github.com/Kyubyong/g2pC [Accessed on June 1, 2020].

[20]Park K, Kim J, 2019. g2pE. GitHub. https://github.com/Kyubyong/g2p [Accessed on June 1, 2020].

[21]Ping W, Peng KN, Gibiansky A, et al., 2018. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. https://arxiv.org/abs/1710.07654

[22]Prenger R, Valle R, Catanzaro B, 2019. WaveGlow: a flow-based generative network for speech synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3617-3621.

[23]Ren Y, Ruan YJ, Tan X, et al., 2019. FastSpeech: fast, robust and controllable text to speech. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.3165-3174.

[24]Ren Y, Hu CX, Tan X, et al., 2021. FastSpeech 2: fast and high-quality end-to-end text to speech. https://arxiv.org/abs/2006.04558v3

[25]Shen J, Pang RM, Weiss RJ, et al., 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4779-4783.

[26]Tachibana H, Uenoyama K, Aihara S, 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4784-4788.

[27]Tay Y, Dehghani M, Bahri D, et al., 2020. Efficient transformers: a survey. https://arxiv.org/abs/2009.06732

[28]Tay Y, Bahri D, Metzler D, et al., 2021. Synthesizer: rethinking self-attention for Transformer models. Proc 38^th Int Conf on Machine Learning, p.10183-10192.

[29]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems, p.5998-6008.

[30]Wang SN, Li BZ, Khabsa M, et al., 2020. Linformer: self-attention with linear complexity. https://arxiv.org/abs/2006.04768

[31]Wang YX, Skerry-Ryan RJ, Stanton D, et al., 2017. Tacotron: towards end-to-end speech synthesis. Proc Interspeech 18^th Annual Conf of the Int Speech Communication Association, p.4006-4010.

[32]Wu F, Fan A, Baevski A, et al., 2019. Pay less attention with lightweight and dynamic convolutions. https://arxiv.org/abs/1901.10430v2

[33]Yang ZL, Dai ZH, Yang YM, et al., 2019. XLNet: generalized autoregressive pretraining for language understanding. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.5754-5764.

[34]Zaheer M, Guruganesh G, Dubey KA, et al., 2020. Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062

[35]Zeng Z, Wang JZ, Cheng N, et al., 2020. AlignTTS: efficient feed-forward text-to-speech system without explicit alignment. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6714-6718.

[36]Zhang B, Xiong DY, Su JS, 2018. Accelerating neural transformer via an average attention network. Proc 56^th Annual Meeting of the Association for Computational Linguistics, p.1789-1798.

[37]Zhao W, He T, Xu L, 2021. Enhancing local dependencies for Transformer-based text-to-speech via hybrid lightweight convolution. IEEE Access, 9:42762-42770.

Open peer comments: Debate/Discuss/Question/Opinion

<1>