CLC number: TN912.3
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2020-09-08
Cited: 0
Clicked: 5214
Citations: Bibtex RefMan EndNote GB/T7714
Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng. Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(11): 1639-1650.
@article{title="Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder",
author="Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="11",
pages="1639-1650",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2000019"
}
%0 Journal Article
%T Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder
%A Jing-jing Chen
%A Qi-rong Mao
%A You-cai Qin
%A Shuang-qing Qian
%A Zhi-shen Zheng
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 11
%P 1639-1650
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2000019
TY - JOUR
T1 - Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder
A1 - Jing-jing Chen
A1 - Qi-rong Mao
A1 - You-cai Qin
A1 - Shuang-qing Qian
A1 - Zhi-shen Zheng
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 11
SP - 1639
EP - 1650
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2000019
Abstract: Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest. However, these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech. In this study, we propose a novel weighted-factor autoencoder (WFAE) model for MSS, which introduces a regularization loss in the objective function to isolate one source without containing other sources. By incorporating a latent attention mechanism and a supervised source constructor in the separation layer, WFAE can learn source-specific generative factors and a set of discriminative features for each source, leading to MSS performance improvement. Experiments on benchmark datasets show that our approach outperforms the existing methods. In terms of three important metrics, WFAE has great success on a relatively challenging MSS case, i.e., speaker-independent MSS.
[1]Araki S, Sawada H, Mukai R, et al., 2007. Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process, 87(8):1833-1847.
[2]Benesty J, Chen JD, Huang YT, 2008. Microphone Array Signal Processing. Springer, Berlin, Germany.
[3]Bregman AS, 1990. Auditory Scene Analysis: the Perceptual Organization of Sound. The MIT Press, Cambridge.
[4]Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297-336.
[5]Chen Z, Luo Y, Mesgarani N, 2017. Deep attractor network for single-microphone speaker separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.246-250.
[6]Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.708-712.
[7]Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report, NASA, USA.
[8]Ghahramani Z, Jordan MI, 1997. Factorial hidden Markov models. Mach Learn, 29(2-3):245-273.
[9]Gou JP, Yi Z, Zhang D, et al., 2018. Sparsity and geometry preserving graph embedding for dimensionality reduction. IEEE Access, 6:75748-75766.
[10]Grais EM, Plumbley MD, 2017. Single channel audio source separation using convolutional denoising autoencoders. Proc IEEE Global Conf on Signal and Information Processing, p.1265-1269.
[11]Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.31-35.
[12]Hsu WN, Zhang Y, Glass J, 2017. Learning latent representations for speech generation and transformation. 18th Annual Conf of the Int Speech Communication Association, p.1273-1277.
[13]Hu K, Wang DL, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122-131.
[14]Huang PS, Kim M, Hasegawa-Johnson M, et al., 2014. Deep learning for monaural speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1562-1566.
[15]Hyvärinen A, Oja E, 2000. Independent component analysis: algorithms and applications. Neur Netw, 13(4-5):411-430.
[16]Karamatli E, Cemgil AT, Kirbiz S, 2019. Weak label supervision for monaural source separation using non-negative denoising variational autoencoders. Proc 27th Signal Processing and Communications Applications Conf, p.1-4.
[17]Kolbaek M, Yu D, Tan ZH, et al., 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process, 25(10):1901-1913.
[18]Luo Y, Mesgarani N, 2019. Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process, 27(8):1256-1266.
[19]Luo Y, Chen Z, Yoshioka T, 2019. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. https://arxiv.org/abs/1910.06379
[20]Nadas A, Nahamoo D, Picheny MA, 1989. Speech recognition using noise-adaptive prototypes. IEEE Trans Acoust Speech Signal Process, 37(10):1495-1503.
[21]Osako K, Mitsufuji Y, Singh R, et al., 2017. Supervised monaural source separation based on autoencoders. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.11-15.
[22]Panayotov V, Chen GG, Povey D, et al., 2015. LibriSpeech: an ASR corpus based on public domain audio books. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5206-5210.
[23]Pandey L, Kumar A, Namboodiri V, 2018. Monaural audio source separation using variational autoencoders. Proc Interspeech, p.3489-3493.
[24]Qian YM, Weng C, Chang XK, et al., 2018. Past review, current progress, and challenges ahead on the cocktail party problem. Front Inform Technol Electron Eng, 19(1):40-63.
[25]Radford A, Metz L, Chintala S, 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/abs/1511.06434
[26]Roweis ST, 2001. One microphone source separation. Proc 13th Int Conf on Neural Information Processing Systems, p.793-799.
[27]Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Proc 9th Int Conf on Spoken Language Processing.
[28]Smaragdis P, 2007. Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process, 15(1):1-12.
[29]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(11):2579-2605.
[30]Vincent E, Gribonval R, Fevotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462-1469.
[31]Wang DL, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, USA.
[32]Wang YN, Du J, Dai LR, et al., 2016. Unsupervised single-channel speech separation via deep neural network for different gender mixtures. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1-4.
[33]Wang YX, Narayanan A, Wang DL, 2014. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process, 22(12):1849-1858.
[34]Williamson DS, 2018. Monaural speech separation using a phase-aware deep denoising auto encoder. Proc IEEE 28th Int Workshop on Machine Learning for Signal Processing, p.1-6.
[35]Xia LM, Wang H, Guo WT, 2019. Gait recognition based on Wasserstein generating adversarial image inpainting network. J Cent South Univ, 26(10):2759-2770.
[36]Yu D, Kolbaek M, Tan ZH, et al., 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.241-245.
[37]Zhang QJ, Zhang L, 2018. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Front Comput Sci, 12(6):1140-1148.
Open peer comments: Debate/Discuss/Question/Opinion
<1>