CLC number: TN912.3
On-line Access: 2022-07-21
Received: 2021-10-21
Revision Accepted: 2022-07-21
Crosschecked: 2022-01-09
Cited: 0
Clicked: 2012
Citations: Bibtex RefMan EndNote GB/T7714
Wei ZHAO, Li XU. Efficient decoding self-attention for end-to-end speech synthesis[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2100501 @article{title="Efficient decoding self-attention for end-to-end speech synthesis", %0 Journal Article TY - JOUR
一种端到端语音合成中的高效解码自注意力网络1浙江大学电气工程学院,中国杭州市,310027 2浙江大学机器人研究院,中国余姚市,315400 摘要:自注意力网络由于其并行结构和强大的序列建模能力,被广泛应用于语音合成(TTS)领域。然而,当使用自回归解码方法进行端到端语音合成时,由于序列长度的二次复杂性,其推理速度相对较慢。当部署设备未配备图形处理器(GPU)时,该效率问题更加严重。为解决该问题,提出一种高效解码自注意力网络(EDSA)作为替代。通过一个动态规划解码过程,有效加速TTS模型推理,使其具有线性计算复杂度。基于普通话和英文数据集的实验结果表明,所提EDSA模型在中央处理器(CPU)和GPU上的推理速度分别提高720%和50%,而性能几乎相同。因此,在GPU资源有限的情况下,该方法可使此类模型的部署更加容易。此外,所提模型在域外语言处理上可能比基线Transformer TTS性能更好。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Ainslie J, Ontanon S, Alberti C, et al., 2020. ETC: encoding long and structured inputs in Transformers. Proc Conf on Empirical Methods in Natural Language Processing, p.268-284. [2]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450 [3]Bahdanau D, Cho K, Bengio Y, 2015. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473v6 [4]Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document transformer. https://arxiv.org/abs/2004.05150 [5]Child R, Gray S, Radford A, et al., 2019. Generating long sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509 [6]Choromanski KM, Likhosherstov V, Dohan D, et al., 2020. Rethinking attention with performers. https://arxiv.org/abs/2009.14794 [7]Dai ZH, Yang ZL, Yang YM, et al., 2019. Transformer-XL: attentive language models beyond a fixed-length context. Proc 57th Annual Meeting of the Association for Computational Linguistics, p.2978-2988. [8]DataBaker, 2019. Chinese Standard Mandarin Speech Copus. https://www.data-baker.com [Accessed on June 1, 2020]. [9]Hayashi T, Yamamoto R, Inoue K, et al., 2020. Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7654-7658. [10]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. [11]Ihm HR, Lee JY, Choi BJ, et al., 2020. Reformer-TTS: neural speech synthesis with reformer network. Proc Interspeech 21st Annual Conf of the Int Speech Communication Association, p.2012-2016. [12]Ito K, Johnson L, 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/ [Accessed on June 1, 2020]. [13]Katharopoulos A, Vyas A, Pappas N, et al., 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. Proc 37th Int Conf on Machine Learning, p.5156-5165. [14]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980 [15]Kitaev N, Kaiser L, Levskaya A, 2020. Reformer: the efficient Transformer. https://arxiv.org/abs/2001.04451v1 [16]Lee J, Lee Y, Kim J, et al., 2019. Set Transformer: a framework for attention-based permutation-invariant neural networks. Proc 36th Int Conf on Machine Learning, p.3744-3753. [17]Li NH, Liu SJ, Liu YQ, et al., 2019. Neural speech synthesis with Transformer network. Proc AAAI Conf on Artificial Intelligence, p.6706-6713. [18]Lim D, Jang W, OG, et al., 2020. JDI-T: Jointly trained Duration Informed Transformer for text-to-speech without explicit alignment. Proc Conf of the Int Speech Communication Association, p.4004-4008. [19]Park K, 2019. g2pC. GitHub. https://github.com/Kyubyong/g2pC [Accessed on June 1, 2020]. [20]Park K, Kim J, 2019. g2pE. GitHub. https://github.com/Kyubyong/g2p [Accessed on June 1, 2020]. [21]Ping W, Peng KN, Gibiansky A, et al., 2018. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. https://arxiv.org/abs/1710.07654 [22]Prenger R, Valle R, Catanzaro B, 2019. WaveGlow: a flow-based generative network for speech synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3617-3621. [23]Ren Y, Ruan YJ, Tan X, et al., 2019. FastSpeech: fast, robust and controllable text to speech. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.3165-3174. [24]Ren Y, Hu CX, Tan X, et al., 2021. FastSpeech 2: fast and high-quality end-to-end text to speech. https://arxiv.org/abs/2006.04558v3 [25]Shen J, Pang RM, Weiss RJ, et al., 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4779-4783. [26]Tachibana H, Uenoyama K, Aihara S, 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4784-4788. [27]Tay Y, Dehghani M, Bahri D, et al., 2020. Efficient transformers: a survey. https://arxiv.org/abs/2009.06732 [28]Tay Y, Bahri D, Metzler D, et al., 2021. Synthesizer: rethinking self-attention for Transformer models. Proc 38th Int Conf on Machine Learning, p.10183-10192. [29]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems, p.5998-6008. [30]Wang SN, Li BZ, Khabsa M, et al., 2020. Linformer: self-attention with linear complexity. https://arxiv.org/abs/2006.04768 [31]Wang YX, Skerry-Ryan RJ, Stanton D, et al., 2017. Tacotron: towards end-to-end speech synthesis. Proc Interspeech 18th Annual Conf of the Int Speech Communication Association, p.4006-4010. [32]Wu F, Fan A, Baevski A, et al., 2019. Pay less attention with lightweight and dynamic convolutions. https://arxiv.org/abs/1901.10430v2 [33]Yang ZL, Dai ZH, Yang YM, et al., 2019. XLNet: generalized autoregressive pretraining for language understanding. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.5754-5764. [34]Zaheer M, Guruganesh G, Dubey KA, et al., 2020. Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062 [35]Zeng Z, Wang JZ, Cheng N, et al., 2020. AlignTTS: efficient feed-forward text-to-speech system without explicit alignment. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6714-6718. [36]Zhang B, Xiong DY, Su JS, 2018. Accelerating neural transformer via an average attention network. Proc 56th Annual Meeting of the Association for Computational Linguistics, p.1789-1798. [37]Zhao W, He T, Xu L, 2021. Enhancing local dependencies for Transformer-based text-to-speech via hybrid lightweight convolution. IEEE Access, 9:42762-42770. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>