
CLC number: TP391.4
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2014-09-17
Cited: 2
Clicked: 11744
Li Chen, Ying-chun Yang, Zhao-hui Wu. Mismatched feature detection with finer granularity for emotional speaker recognition[J]. Journal of Zhejiang University Science C,in press.Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/jzus.C1400002 @article{title="Mismatched feature detection with finer granularity for emotional speaker recognition", %0 Journal Article TY - JOUR
用于情感说话人识别的精细失真特征检测与修正研究目的:说话人情感变化时其发音器官会发生形变,导致部分语音特征分布较中性条件下发生一定偏移。这些发生偏移的特征使得说话人识别性能大幅下降,称作"失真特征",需剔除或修正,以提升情感说话人识别系统性能。创新要点:鉴于不同音素引起的失真特征分布变化存在差异,提出在音素类、高斯符号化和概率高斯符号化三种声学类上的精细失真特征检测模型与修正方法。 研究方法:采用流形分析方法,观测失真特征分布,得到结论:偏离中性特征空间越远,区分说话人能力越差。若基于某项特征的说话人区分能力小于某个阈值,即检测为失真特征(图1)。对于音素类和高斯符号化表示的声学类,采用支持向量机建立可靠–失真特征检测模型;对于概率高斯符号化表征的声学类,采用模糊支持向量机建立可靠–失真特征检测模型。为确保修正后的失真特征逼近真实的中性情形又不损失说话人特性,对检测出的失真特征进行修正时,将失真特征空间映射到可靠特征空间的同时,要使得转换后的失真特征空间和其他说话人的可靠特征空间的距离不会随之减少。 重要结论:情感导致说话人的部分语音特征分布发生变化成为失真特征,通过三种声学类的精细失真特征检测与修正,能够有效处理失真特征,提升系统识别性能。最高的概率高斯符号化下的失真特征修正算法,使得基准的GMM-UBM算法识别率提升6.77%,i-vector算法识别率提升3.32%。 情感说话人识别;模糊支持向量机;失真特征检测;特征修正 Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Arslan, L.M., Hansen, J.H.L., 1994. Minimum cost based phoneme class detection for improved iterative speech enhancement. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.45-48. ![]() [2]Balasubramanian, M., Schwartz, E.L., 2002. The isomap algorithm and topological stability. Science, 295(5552):7. ![]() [3]Bao, H.J., Xu, M.X., Zheng, T.F., 2007. Emotion attribute projection for speaker recognition on emotional speech. Proc. 8th Annual Conf. of the Int. Speech Communication Association, p.601-604. ![]() [4]Bitouk, D., Verma, R., Nenkova, A., 2010. Class-level spectral features for emotion recognition. Speech Commun., 52(7-8):613-625. ![]() [5]Brady, M.C., 2005. Synthesizing affect with an analog vocal tract: glottal source. Toward Social Mechanisms of Android Science: a CogSci Workshop, p.45-49. ![]() [6]Chen, L., Yang, Y.C., Yao, M., 2011. Reliability detection by fuzzy SVM with UBM component feature for emotional speaker recognition. Proc. 8th Int. Conf. on Fuzzy Systems and Knowledge Discovery, p.458-461. ![]() [7]Cowie, R., Cornelius, R.R., 2003. Describing the emotional states that are expressed in speech. Speech Commun., 40(1-2):5-32. ![]() [8]Dehak, N., Kenny, P., Dehak, R., et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process., 19(4):788-798. ![]() [9]Drygajlo, A., El-Maliki, M., 1998. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.121-124. ![]() [10]El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recog., 44(3):572-587. ![]() [11]Gadek, J., 2009. Influence of upper respiratory system disease on the performance of automatic voice recognition systems. Comput. Med. Act., 65:211-221. ![]() [12]Ghiurcau, M.V., Rusu, C., Astola, J., 2011a. A study of the effect of emotional state upon text-independent speaker identification. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.4944-4947. ![]() [13]Ghiurcau, M.V., Rusu, C., Astola, J., 2011b. Speaker recognition in an emotional environment. Proc. Signal Processing and Applied Mathematics for Electronics and Communications, p.81-84. ![]() [14]Huang, T., Yang, Y.C., 2008. Applying pitch-dependent difference detection and modification to emotional speaker recognition. Proc. 9th Annual Conf. of the Int. Speech Communication Association, p.2751-2754. ![]() [15]Huang, T., Yang, Y.C., 2010. Learning virtual HD model for bi-model emotional speaker recognition. Proc. 20th Int. Conf. on Pattern Recognition, p.1614-1617. ![]() [16]Jawarkar, N.P., Holambe, R.S., Basu, T.K., 2012. Text-independent speaker identification in emotional environments: a classifier fusion approach. Front. Comput. Educ., 133:569-576. ![]() [17]Jin, Q., Schultz, T., Waibel, A., 2007. Far-field speaker recognition. IEEE Trans. Audio Speech Lang. Process., 15(7):2023-2032. ![]() [18]Kelly, F., Harte, N., 2011. Effects of long-term ageing on speaker verification. Proc. European Workshop on Biometrics and ID Management, p.113-124. ![]() [19]Lee, C.M., Yildirim, S., Bulut, M., et al., 2004. Effects of emotion on different phoneme classes. J. Acoust. Soc. Am., 116:2481. ![]() [20]Li, A., Fang, Q., Hu, F., et al., 2010. Acoustic and articulatory analysis on Mandarin Chinese vowels in emotional speech. Proc. 7th Int. Symp. on Chinese Spoken Language Processing, p.38-43. ![]() [21]Lin, C.F., Wang, S.D., 2002. Fuzzy support vector machines. IEEE Trans. Neur. Netw., 13(2):464-471. ![]() [22]Platt, J.C., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers, 10(3):61-74. ![]() [23]Reynolds, D.A., 2003. Channel robust speaker verification via feature mapping. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.53-56. ![]() [24]Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process., 3(1):72-83. ![]() [25]Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted Gaussian mixture models. Digit. Signal Process., 10(1-3):19-41. ![]() [26]Rose, R.C., Hofstetter, E.M., Reynolds, D.A., 1994. Integrated models of signal and background with application to speaker identification in noise. IEEE Trans. Speech Audio Process., 2(2):245-257. ![]() [27]Scherer, K., Johnstone, T., Banziger, T., 1998. Automatic verification of emotionally stressed speakers: the problem of individual differences. Proc. Int. Conf. on Speech and Computer, p.233-238. ![]() [28]Shahin, I., 2013. Speaker identification in emotional talking environments based on CSPHMM2s. Eng. Appl. Artif. Intell., 26(7):1652-1659. ![]() [29]Shan, Z.Y., Yang, Y.C., 2008. Learning polynomial function based neutral-emotion GMM transformation for emotional speaker recognition. Proc. 19th Int. Conf. on Pattern Recognition, p.1-4. ![]() [30]Shan, Z.Y., Yang, Y.C., Ye, R.Z., 2007. Natural-emotion GMM transformation algorithm for emotional speaker recognition. Proc. 8th Annual Conf. of the Int. Speech Communication Association, p.782-785. ![]() [31]Shriberg, E., Graciarena, M., Bratt, H., et al., 2008. Effects of vocal effort and speaking style on text-independent speaker verification. Proc. 9th Annual Conf. of the Int. Speech Communication Association, p.609-612. ![]() [32]Torres-Carrasquillo, P.A., Reynolds, D.A., Deller, J.R.Jr., 1993. Language identification using Gaussian mixture model tokenization. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.757-760. ![]() [33]Triefenbach, F., Jalalvand, A., Schrauwen, B., et al., 2010. Phoneme recognition with large hierarchical reservoirs. Proc. 24th Annual Conf. on Neural Information Processing Systems, p.2307-2315. ![]() [34]Twaddell, W.F., 1935. On defining the phoneme. Language, 11(1):5-62. ![]() [35]Yang, Y.C., Chen, L., 2012. Toward emotional speaker recognition: framework and preliminary results. Proc. 7th Chinese Conf. on Biometric Recognition, p.235-242. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


Open peer comments: Debate/Discuss/Question/Opinion
<1>