Full Text:   <3830>

Summary:  <1964>

CLC number: TP391.4

On-line Access: 2024-08-27

Received: 2023-10-17

Revision Accepted: 2024-05-08

Crosschecked: 2018-01-25

Cited: 0

Clicked: 8307

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Yan-min Qian

http://orcid.org/0000-0002-0314-3790

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2018 Vol.19 No.1 P.40-63

http://doi.org/10.1631/FITEE.1700814


Past review, current progress, and challenges ahead on the cocktail party problem


Author(s):  Yan-min Qian, Chao Weng, Xuan-kai Chang, Shuai Wang, Dong Yu

Affiliation(s):  Tencent AI Lab, Tencent, Bellevue 98004, USA; more

Corresponding email(s):   yanminqian@tencent.com

Key Words:  Cocktail party problem, Computational auditory scene analysis, Non-negative matrix factorization, Permutation invariant training, Multi-talker speech processing



Abstract: 
The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

This article has been corrected, see doi:10.1631/FITEE.19e0001

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE