JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2018 Vol.19 No.1 P.40-63

Past review, current progress, and challenges ahead on the cocktail party problem

Author(s): Yan-min Qian, Chao Weng, Xuan-kai Chang, Shuai Wang, Dong Yu
Affiliation(s): Tencent AI Lab, Tencent, Bellevue 98004, USA; more
Corresponding email(s): yanminqian@tencent.com
Key Words: Cocktail party problem, Computational auditory scene analysis, Non-negative matrix factorization, Permutation invariant training, Multi-talker speech processing

Share this article to： More <<< Previous Article \|Next Article >>>

Yan-min Qian, Chao Weng, Xuan-kai Chang, Shuai Wang, Dong Yu. Past review, current progress, and challenges ahead on the cocktail party problem[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 40-63.

@article{title="Past review, current progress, and challenges ahead on the cocktail party problem",
author="Yan-min Qian, Chao Weng, Xuan-kai Chang, Shuai Wang, Dong Yu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="1",
pages="40-63",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700814"
}

%0 Journal Article
%T Past review, current progress, and challenges ahead on the cocktail party problem
%A Yan-min Qian
%A Chao Weng
%A Xuan-kai Chang
%A Shuai Wang
%A Dong Yu
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 1
%P 40-63
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700814

TY - JOUR
T1 - Past review, current progress, and challenges ahead on the cocktail party problem
A1 - Yan-min Qian
A1 - Chao Weng
A1 - Xuan-kai Chang
A1 - Shuai Wang
A1 - Dong Yu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 1
SP - 40
EP - 63
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700814

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

This article has been corrected, see doi:10.1631/FITEE.19e0001

鸡尾酒会问题的技术回顾、当前进展及未来挑战

概要：鸡尾酒会问题即在多人同时说话的场景下追踪并识别某一个特定说话人的语音。在自动语音识别技术大规模推广应用中，鸡尾酒会问题是亟待解决的关键问题之一。本文回顾了在过去20多年中针对鸡尾酒会问题提出的相关技术。主要讨论在鸡尾酒会问题中扮演中心角色的语音分离问题。介绍了以下内容：传统的单通道情况下的技术，如计算听觉场景分析（computation alauditory scene analysis, CASA）、非负矩阵分解（non-negative matrix factorization, NMF）以及生成式模型建模；传统的多通道情况下的技术，如波束成形和多通道盲源分离；一些基于深度学习的最新技术，如深度聚类（deep clustering, DPCL）、深度吸引网络（deep attractor network, DANet）以及排列不变性训练（permutation invariant training, PIT）。此外，介绍了在鸡尾酒会环境下针对改善多说话人语音识别和说话人识别精度的相关技术。笔者认为，利用一个更加强大的模型来有效地开发和利用来自麦克风阵列、声学训练集合以及语言本身的知识非常重要。更好的优化策略和技术的提出会逐步解决鸡尾酒会问题。

关键词：鸡尾酒会问题；计算听觉场景分析；非负矩阵分解；排列不变性训练；多说话人语音处理

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abdel-Hamid O, Mohamed A, Jiang H, et al., 2014. Convolutional neural networks for speech recognition. Annual Conf of Int Speech Communication Association, p.1533-1545.

[2]Anguera X, Wooters C, Hernando J, 2007. Acoustic beamforming for speaker diarization of meetings. IEEE Trans Audio Speech Lang Process, 15(7):2011-2022.

[3]Applebaum S, 1976. Adaptive arrays. IEEE Trans Antennas Propag, 24(9):585-598.

[4]Barker J, Ma N, Coy A, et al., 2010. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput Speech Lang, 24(1):94-111.

[5]Behnke S, 2003. Discovering hierarchical speech features using convolutional non-negative matrix factorization. Int Joint Conf on Neural Networks, p.2758-2763.

[6]Bello RWJ, 2010. Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. 11^th Int Society for Music Information Retrieval Conf, p.123-128.

[7]Benesty J, Chen J, Huang Y, et al., 2007. On microphone-array beamforming from a MIMO acoustic signal processing perspective. IEEE Trans Audio Speech Lang Process, 15(3):1053-1065.

[8]Benesty J, Chen J, Huang Y, 2008. Automatic Speech Recognition: a Deep Learning Approach. Springer Berlin Heidelberg, New York, USA.

[9]Bi M, Qian Y, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16^th Annual Conf of Int Speech Communication Association, p.3259-3263.

[10]Bregman AS, 1990. Auditory scene analysis. In: Smelzer NJ, Bates PB (Eds.), International Encyclopedia of the Social and Behavioral Sciences. Elsevier, Amsterdam.

[11]Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297-336.

[12]Capon J, 1969. High resolution frequency-wavenumber spectrum analysis. Proc IEEE, 57:1408-1418.

[13]Carter GC, Nuttall AH, Cable PG, 1973. The smoothed coherence transform. Proc IEEE, 61:1497-1498.

[14]Chang X, Qian Y, Yu D, 2018. Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[15]Chen J, Benesty J, Huang Y, 2006. Time delay estimation in room acoustic environments: an overview. EURASIP J Adv Signal Process, 2006:026503.

[16]Chen N, Qian Y, Yu K, 2015. Multi-task learning for text-dependent speaker verification. Annual Conf of Int Speech Communication Association, p.185-189.

[17]Chen Z, 2017. Single Channel Auditory Source Separation with Neural Network. PhD Thesis, Columbia University, New York, USA.

[18]Chen Z, Ellis DP, 2013. Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition. Workshop on Applications of Signal Processing to Audio and Acoustics, p.1-4.

[19]Chen Z, McFee B, Ellis DP, 2014. Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition. Annual Conf of Int Speech Communication Association, p.2833-2837.

[20]Chen Z, Li J, Xiao X, et al., 2017a. Cracking the cocktail party problem by multi-beam deep attractor network. IEEE Workshop on Automatic Speech Recognition and Understanding, p.437-444.

[21]Chen Z, Luo Y, Mesgarani N, 2017b. Deep attractor network for single-microphone speaker separation. Int Conf on Acoustics, Speech, and Signal Processing, p.246-250.

[22]Chen Z, Droppo J, Li J, et al., 2017c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. http://arxiv.org/abs/1707.07048

[23]Cherry EC, 1953. Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am, 25(5):975-979.

[24]Cooke M, Hershey JR, Rennie SJ, 2010. Monaural speech separation and recognition challenge. Comput Speech Lang, 24(1):1-15.

[25]Dehak N, Kenny PJ, Dehak R, et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process, 19(4):788-798.

[26]Doclo S, Moonen M, 2003. Design of far-field and near-field broadband beamformers using eigenfilters. IEEE Signal Process Lett, 83(12):2641-2673.

[27]Drude L, Haeb-Umbach R, 2017. Tight integration of spatial and spectral features for BSS with deep clustering embeddings. Annual Conf of Int Speech Communication Association, p.2650-2654.

[28]Du J, Tu Y, Xu Y, et al., 2014. Speech separation of a target speaker based on deep neural networks. Int Conf on Signal Processing, p.473-477.

[29]Ellis DPW, 1996. Prediction-Driven Computational Auditory Scene Analysis. PhD Thesis, Massachusetts Institute of Technology, Cambridge, USA.

[30]Ephraim Y, Malah D, 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Audio Speech Lang Process, 33(2): 443-445.

[31]Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. Int Conf on Acoustics Speech and Signal Processing, p.708-712.

[32]Erdogan H, Hershey J, Watanabe S, et al., 2016. Improved MVDR beamforming using single-channel mask prediction networks. Annual Conf of Int Speech Communication Association, p.1981-1985.

[33]Erdogan H, Hershey JR, Watanabe S, et al., 2017. Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio. New Era for Robust Speech Recognition, p.165-186.

[34]Fischer S, Simmer KU, 1996. Beamforming microphone arrays for speech acquisition in noisy environments. Speech Commun, 20(3-4):215-227.

[35]Frost OL, 1972. An algorithm for linearly constrained adaptive array processing. Proc IEEE, 60(8):926-935.

[36]Gannot S, Burshtein D, Weinstein E, 2001. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process, 49(8): 1614-1626.

[37]Gannot S, Burshtein D, Weinstein E, 2004. Analysis of the power spectral deviation of the general transfer function GSC. IEEE Trans Signal Process, 52(4):1115-1120.

[38]Ghahramani Z, Jordan MI, 1996. Factorial hidden Markov models. NIPS, p.472-478.

[39]Hassab JC, Boucher RE, 1981. Performance of the generalized cross correlator in the presence of a strong spectral peak in the signal. IEEE Trans Audio Speech Lang Process, 29(3):549-555.

[40]Hershey JR, Kristjansson T, Rennie S, et al., 2007. Single channel speech separation using factorial dynamics. NIPS, p.593-600.

[41]Hershey JR, Rennie SJ, Olsen PA, et al., 2010. Super-human multi-talker speech recognition: a graphical modeling approach. Comput Speech Lang, 24(1):45-66.

[42]Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Int Conf on Acoustics Speech and Signal Processing, p.31-35.

[43]Heymanna J, Drudea L, Haeb-Umbacha R, 2017. A generic neural acoustic beamforming architecture for robust multi-channel speech processing. Comput Speech Lang, 46(C):374-385.

[44]Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82-97.

[45]Hoyer PO, 2004. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res, 5:1457-1469.

[46]Hu G, Wang D, 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neur Netw, 15(5):1135-1150.

[47]Hu G, Wang D, 2008. Segregation of unvoiced speech from nonspeech interference. J Acoust Soc Am, 124(2): 1306-1319.

[48]Hu G, Wang D, 2010. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 18(8):2067-2079.

[49]Hu K, Wang D, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122-131.

[50]Hu Y, Loizou PC, 2007. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun, 49(7):588-601.

[51]Hu Y, Loizou PC, 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process, 16(1):229-238.

[52]Huang Z, Wang S, Qian Y, 2018. Joint i-vector with end-to-end system for short duration text-independent speaker verification. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[53]Hyvarinen A, Karhunen J, Oja E, 2001. Independent Component Analysis. John Wiley & Sons, Inc, New York, USA.

[54]Isik Y, Roux JL, Chen Z, et al., 2016. Single-channel multi-speaker separation using deep clustering. Annual Conf of Int Speech Communication Association, p.545-549.

[55]Kellermann W, 1997. Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays. Int Conf on Acoustics Speech and Signal Processing, p.219-222.

[56]Kim T, Attias HT, Lee SY, et al., 2006. Blind source separation exploiting higher-order frequency dependencies. IEEE Trans Audio Speech Lang Process, 15(4):70-79.

[57]Kjems U, Boldt JB, Pedersen MS, et al., 2009. Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J Acoust Soc Am, 126(3):1415-1426.

[58]Knapp CK, Carter GC, 1976. The generalized correlation method for estimation of time delay. IEEE Trans Audio Speech Lang Process, 24(4):320-327.

[59]Kolbaek M, Yu D, Tan ZH, et al., 2017a. Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training. IEEE Int Workshop on Machine Learning for Signal Processing. http://arxiv.org/abs/1708.09588

[60]Kolbaek M, Yu D, Tan ZH, et al., 2017b. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE Trans Audio Speech Lang Process, 25(10):1901-1913.

[61]Kristjansson T, Hershey J, Olsen P, et al., 2006. Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system. Int Conf on Spoken Language Processing, Paper 1775-Mon1WeS.7.

[62]Kuhl PK, 1991. Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Percept Psychol, 50(2):93-107.

[63]Larcher A, Lee KA, Ma B, et al., 2014. Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun, 60:56-77.

[64]Lee DD, Seung HS, 2001. Algorithms for non-negative matrix factorization. NIPS, p.556-562.

[65]Lee TW, 1998. Independent Component Analysis—Theory and Applications. Kluwer Academic Publishers, Boston, USA.

[66]Lei Y, Scheffer N, Ferrer L, et al., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. Int Conf on Acoustics Speech and Signal Processing, p.1695-1699.

[67]Li P, Guan Y, Wang S, et al., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 24(1):30-44.

[68]Liu Y, Qian Y, Chen N, et al., 2015. Deep feature for text-dependent speaker verification. Speech Commun, 73:1-13.

[69]Lovekin JM, Yantorno RE, Krishnamachari KR, et al., 2001. Developing usable speech criteria for speaker identification technology. Int Conf on Acoustics Speech and Signal Processing, p.421-424.

[70]Mandel MI, Weiss RJ, Ellis DPW, 2010. Model-based expectation maximization source separation and localization. IEEE Trans Audio Speech Lang Process, 18(2):382-394.

[71]McDermott JH, 2009. The cocktail party problem. Curr Biol, 19(22):R1024-R1027.

[72]Mesgarani N, Chang EF, 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397):233-236.

[73]Mowlaee P, Saeidi R, Tan ZH, et al., 2010. Joint single-channel speech separation and speaker identification. Int Conf on Acoustics Speech and Signal Processing, p.4430-4433.

[74]Mowlaee P, Saeidi R, Christensen MG, et al., 2012. A joint approach for single-channel speaker identification and speech separation. IEEE Trans Audio Speech Lang Process, 20(9):2586-2601.

[75]Narayanan A, Wang D, 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.7092-7096.

[76]Ono N, 2011. Stable and fast update rules for independent vector analysis based on auxiliary function technique. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[77]Peddinti V, Povey D, Khudanpur S, 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. Annual Conf of Int Speech Communication Association, p.3214-3218.

[78]Pedersen MS, Larsen J, Kjems U, et al., 2007. A Survey of Convolutive Blind Source Separation Methods. Springer Press, New York, USA.

[79]Qian YM, Bi M, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE Trans Audio Speech Lang Process, 24(12):2263-2276.

[80]Qian YM, Chang XK, Yu D, 2017. Single-channel multi-talker speech recognition with permutation invariant training. http://arxiv.org/abs/1707.06527

[81]Qian YM, Tan T, Hu H, et al., 2018. Noise robust speech recognition on Aurora4 by humans and machines. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[82]Raj B, Virtanen T, Chaudhuri S, et al., 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. Annual Conf of Int Speech Communication Association, p.717-720.

[83]Rennie SJ, Hershey JR, Olsen PA, 2010. Single-channel multitalker speech recognition. IEEE Signal Process Mag, 27(6):66-80.

[84]Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted gaussian mixture models. Dig Signal Process, 10(1-3):19-41.

[85]Rix AW, Beerends JG, Hollier MP, et al., 2001. Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. Int Conf on Acoustics, Speech, and Signal Processing, p.749-752.

[86]Roth P, 1971. Effective measurements using digital signal analysis. IEEE Spectr, 8(4):62-70.

[87]Sainath TN, Mohamed A, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.8614-8618.

[88]Sainath TN, Vinyals O, Senior A, et al., 2015. Convolutional, long short-term memory, fully connected deep neural networks. Int Conf on Acoustics Speech and Signal Processing, p.4580-4584.

[89]Sawada H, Araki S, Makino S, 2007. A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.139-142.

[90]Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Annual Conf of Int Speech Communication Association, Paper 1652-Thu2FoP.10.

[91]Schuller B, Weninger F, Wöllmer M, et al., 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.4562-4565.

[92]Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.4955-4959.

[93]Shao Y, Wang D, 2003. Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. Int Conf on Acoustics, Speech, and Signal Processing, p.205-208.

[94]Shao Y, Wang D, 2006. Model-based sequential organization in cochannel speech. IEEE Trans Audio Speech Lang Process, 14(1):289-298.

[95]Souden M, Benesty J, Affes S, 2010. On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans Signal Process, 18(2):260-276.

[96]Souden M, Araki S, Kinoshita K, et al., 2013. A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans Signal Process, 21(9):1913-1928.

[97]Sydow C, 1994. Broadband beamforming for a microphone array. J Acoust Soc Am, 96(8):845-849.

[98]Taal CH, Hendriks RC, Heusdens R, et al., 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. Int Conf on Acoustics Speech and Signal Processing, p.4214-4217.

[99]Tan T, Qian Y, Yu D, 2018. Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[100]Tu Y, Du J, Xu Y, et al., 2014a. Deep neural network based speech separation for robust speech recognition. Int Conf on Signal Processing, p.532-536.

[101]Tu Y, Du J, Xu Y, et al., 2014b. Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. Int Symp on Chinese Spoken Language Processing, p.250-254.

[102]Variani E, Lei X, McDermott E, et al., 2014. Deep neural networks for small footprint text-dependent speaker verification. Int Conf on Acoustics Speech and Signal Processing, p.4052-4056.

[103]Vincent E, Gribonval R, F évotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462-1469.

[104]Virtanen T, 2006. Speech recognition using factorial hidden Markov models for separation in the feature space. Annual Conf of Int Speech Communication Association, Paper 1850-Mon1WeS.5.

[105]Virtanen T, 2007. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process, 15(3):1066-1074.

[106]Wang D, 2005. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P (Ed.), Speech Separation by Humans and Machines. Springer, Boston, USA, p.181-197.

[107]Wang D, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York, USA.

[108]Wang S, Qian Y, Yu K, 2018. Focal KL-divergence based dilated convolutional neural networks for co-channel speaker identification. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[109]Wang Y, Narayanan A, Wang D, 2014. On training targets for supervised speech separation. IEEE Trans Audio Speech Lang Process, 22(12):1849-1858.

[110]Weng C, Yu D, Seltzer ML, et al., 2015. Deep neural networks for single-channel multi-talker speech recognition. IEEE Trans Audio Speech Lang Process, 23(10):1670-1679.

[111]Weninger F, Erdogan H, Watanabe S, et al., 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Int Conf on Latent Variable Analysis and Signal Separation, p.91-99.

[112]Xiao X, Zhao SK, Jones DL, et al., 2017. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.3246-3250.

[113]Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. http://arxiv.org/abs/1610.05256

[114]Xu Y, Du J, Dai LR, et al., 2014. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett, 21(1):65-68.

[115]Yilmaz O, Rickard S, 2004. Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process, 52(7):1830-1847.

[116]Yu D, Deng L, 2014. Automatic Speech Recognition: a Deep Learning Approach. Springer, New York, USA.

[117]Yu D, Li, JY, 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA J Automat Sin, 4(3):396-409.

[118]Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Annual Conf of Int Speech Communication Association, p.17-21.

[119]Yu D, Chang X, Qian Y, 2017a. Recognizing multi-talker speech with permutation invariant training. Annual Conf of Int Speech Communication Association, p.2456-2460.

[120]Yu D, Kolbaek M, Tan ZH, et al., 2017b. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Int Conf on Acoustics, Speech and Signal Processing, p.241-245.

[121]Yu F, Koltun V, 2015. Multi-scale context aggregation by dilated convolutions. http://arxiv.org/abs/1511.07122

[122]Zhang C, Koishida K, 2017. End-to-end text-independent speaker verification with triplet loss on short utterances. Annual Conf of Int Speech Communication Association, p.1487-1491.

[123]Zhang L, Chen Z, Zheng M, et al., 2011. Robust non-negative matrix factorization. Front. Electr. Electron. Eng. China, 6(2):192-200.

[124]Zhao X, Wang Y, Wang D, 2015a. Cochannel speaker identification in anechoic and reverberant conditions. IEEE Trans Audio Speech Lang Process, 23(11):1727-1736.

[125]Zhao X, Wang Y, Wang D, 2015b. Deep neural networks for cochannel speaker identification. Int Conf on Acoustics, Speech and Signal Processing, p.4824-4828.

[126]Zhou Y, Qian Y, 2018. Robust mask estimation by integrating neural network-based and clustering-based approaches for adaptive acoustic beamforming. Int Conf on Acoustics, Speech, and Signal Processing, in press.

[127]Zmolikova K, Delcroix M, Kinoshita K, et al., 2017. Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. Annual Conf of Int Speech Communication Association, p.2655-2659.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

鸡尾酒会问题的技术回顾、当前进展及未来挑战

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference