Journal of Zhejiang University

Journal of Zhejiang University SCIENCE C 2014 Vol.15 No.10 P.903-916

Mismatched feature detection with finer granularity for emotional speaker recognition

Author(s): Li Chen, Ying-chun Yang, Zhao-hui Wu
Affiliation(s): 1. College of Computer Science & Technology, Zhejiang University, Hangzhou 310027, China
Corresponding email(s): stchenli@zju.edu.cn, yyc@zju.edu.cn, wzh@zju.edu.cn
Key Words: Emotional speaker recognition, Mismatched feature detection, Feature regulation

Share this article to： More <<< Previous Article \|Next Article >>>

Li Chen, Ying-chun Yang, Zhao-hui Wu. Mismatched feature detection with finer granularity for emotional speaker recognition[J]. Journal of Zhejiang University Science C, 2014, 15(10): 903-916.

@article{title="Mismatched feature detection with finer granularity for emotional speaker recognition",
author="Li Chen, Ying-chun Yang, Zhao-hui Wu",
journal="Journal of Zhejiang University Science C",
volume="15",
number="10",
pages="903-916",
year="2014",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1400002"
}

%0 Journal Article
%T Mismatched feature detection with finer granularity for emotional speaker recognition
%A Li Chen
%A Ying-chun Yang
%A Zhao-hui Wu
%J Journal of Zhejiang University SCIENCE C
%V 15
%N 10
%P 903-916
%@ 1869-1951
%D 2014
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1400002

TY - JOUR
T1 - Mismatched feature detection with finer granularity for emotional speaker recognition
A1 - Li Chen
A1 - Ying-chun Yang
A1 - Zhao-hui Wu
J0 - Journal of Zhejiang University Science C
VL - 15
IS - 10
SP - 903
EP - 916
%@ 1869-1951
Y1 - 2014
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1400002

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: The shapes of speakers’ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition—phoneme classes, Gaussian mixture model (GMM) tokenizer, and probabilistic GMM tokenizer—are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus (MASC) show that our feature pruning and feature regulation methods increase the identification rate (IR) by 3.64% and 6.77%, compared with the baseline GMM-UBM (universal background model) algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector.

用于情感说话人识别的精细失真特征检测与修正

研究目的：说话人情感变化时其发音器官会发生形变，导致部分语音特征分布较中性条件下发生一定偏移。这些发生偏移的特征使得说话人识别性能大幅下降，称作"失真特征"，需剔除或修正，以提升情感说话人识别系统性能。
创新要点：鉴于不同音素引起的失真特征分布变化存在差异，提出在音素类、高斯符号化和概率高斯符号化三种声学类上的精细失真特征检测模型与修正方法。
研究方法：采用流形分析方法，观测失真特征分布，得到结论：偏离中性特征空间越远，区分说话人能力越差。若基于某项特征的说话人区分能力小于某个阈值，即检测为失真特征（图1）。对于音素类和高斯符号化表示的声学类，采用支持向量机建立可靠–失真特征检测模型；对于概率高斯符号化表征的声学类，采用模糊支持向量机建立可靠–失真特征检测模型。为确保修正后的失真特征逼近真实的中性情形又不损失说话人特性，对检测出的失真特征进行修正时，将失真特征空间映射到可靠特征空间的同时，要使得转换后的失真特征空间和其他说话人的可靠特征空间的距离不会随之减少。
重要结论：情感导致说话人的部分语音特征分布发生变化成为失真特征，通过三种声学类的精细失真特征检测与修正，能够有效处理失真特征，提升系统识别性能。最高的概率高斯符号化下的失真特征修正算法，使得基准的GMM-UBM算法识别率提升6.77%，i-vector算法识别率提升3.32%。
情感说话人识别；模糊支持向量机；失真特征检测；特征修正

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Arslan, L.M., Hansen, J.H.L., 1994. Minimum cost based phoneme class detection for improved iterative speech enhancement. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.45-48.

[2]Balasubramanian, M., Schwartz, E.L., 2002. The isomap algorithm and topological stability. Science, 295(5552):7.

[3]Bao, H.J., Xu, M.X., Zheng, T.F., 2007. Emotion attribute projection for speaker recognition on emotional speech. Proc. 8th Annual Conf. of the Int. Speech Communication Association, p.601-604.

[4]Bitouk, D., Verma, R., Nenkova, A., 2010. Class-level spectral features for emotion recognition. Speech Commun., 52(7-8):613-625.

[5]Brady, M.C., 2005. Synthesizing affect with an analog vocal tract: glottal source. Toward Social Mechanisms of Android Science: a CogSci Workshop, p.45-49.

[6]Chen, L., Yang, Y.C., Yao, M., 2011. Reliability detection by fuzzy SVM with UBM component feature for emotional speaker recognition. Proc. 8th Int. Conf. on Fuzzy Systems and Knowledge Discovery, p.458-461.

[7]Cowie, R., Cornelius, R.R., 2003. Describing the emotional states that are expressed in speech. Speech Commun., 40(1-2):5-32.

[8]Dehak, N., Kenny, P., Dehak, R., et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process., 19(4):788-798.

[9]Drygajlo, A., El-Maliki, M., 1998. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.121-124.

[10]El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recog., 44(3):572-587.

[11]Gadek, J., 2009. Influence of upper respiratory system disease on the performance of automatic voice recognition systems. Comput. Med. Act., 65:211-221.

[12]Ghiurcau, M.V., Rusu, C., Astola, J., 2011a. A study of the effect of emotional state upon text-independent speaker identification. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.4944-4947.

[13]Ghiurcau, M.V., Rusu, C., Astola, J., 2011b. Speaker recognition in an emotional environment. Proc. Signal Processing and Applied Mathematics for Electronics and Communications, p.81-84.

[14]Huang, T., Yang, Y.C., 2008. Applying pitch-dependent difference detection and modification to emotional speaker recognition. Proc. 9th Annual Conf. of the Int. Speech Communication Association, p.2751-2754.

[15]Huang, T., Yang, Y.C., 2010. Learning virtual HD model for bi-model emotional speaker recognition. Proc. 20th Int. Conf. on Pattern Recognition, p.1614-1617.

[16]Jawarkar, N.P., Holambe, R.S., Basu, T.K., 2012. Text-independent speaker identification in emotional environments: a classifier fusion approach. Front. Comput. Educ., 133:569-576.

[17]Jin, Q., Schultz, T., Waibel, A., 2007. Far-field speaker recognition. IEEE Trans. Audio Speech Lang. Process., 15(7):2023-2032.

[18]Kelly, F., Harte, N., 2011. Effects of long-term ageing on speaker verification. Proc. European Workshop on Biometrics and ID Management, p.113-124.

[19]Lee, C.M., Yildirim, S., Bulut, M., et al., 2004. Effects of emotion on different phoneme classes. J. Acoust. Soc. Am., 116:2481.

[20]Li, A., Fang, Q., Hu, F., et al., 2010. Acoustic and articulatory analysis on Mandarin Chinese vowels in emotional speech. Proc. 7th Int. Symp. on Chinese Spoken Language Processing, p.38-43.

[21]Lin, C.F., Wang, S.D., 2002. Fuzzy support vector machines. IEEE Trans. Neur. Netw., 13(2):464-471.

[22]Platt, J.C., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers, 10(3):61-74.

[23]Reynolds, D.A., 2003. Channel robust speaker verification via feature mapping. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.53-56.

[24]Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process., 3(1):72-83.

[25]Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted Gaussian mixture models. Digit. Signal Process., 10(1-3):19-41.

[26]Rose, R.C., Hofstetter, E.M., Reynolds, D.A., 1994. Integrated models of signal and background with application to speaker identification in noise. IEEE Trans. Speech Audio Process., 2(2):245-257.

[27]Scherer, K., Johnstone, T., Banziger, T., 1998. Automatic verification of emotionally stressed speakers: the problem of individual differences. Proc. Int. Conf. on Speech and Computer, p.233-238.

[28]Shahin, I., 2013. Speaker identification in emotional talking environments based on CSPHMM2s. Eng. Appl. Artif. Intell., 26(7):1652-1659.

[29]Shan, Z.Y., Yang, Y.C., 2008. Learning polynomial function based neutral-emotion GMM transformation for emotional speaker recognition. Proc. 19th Int. Conf. on Pattern Recognition, p.1-4.

[30]Shan, Z.Y., Yang, Y.C., Ye, R.Z., 2007. Natural-emotion GMM transformation algorithm for emotional speaker recognition. Proc. 8th Annual Conf. of the Int. Speech Communication Association, p.782-785.

[31]Shriberg, E., Graciarena, M., Bratt, H., et al., 2008. Effects of vocal effort and speaking style on text-independent speaker verification. Proc. 9th Annual Conf. of the Int. Speech Communication Association, p.609-612.

[32]Torres-Carrasquillo, P.A., Reynolds, D.A., Deller, J.R.Jr., 1993. Language identification using Gaussian mixture model tokenization. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.757-760.

[33]Triefenbach, F., Jalalvand, A., Schrauwen, B., et al., 2010. Phoneme recognition with large hierarchical reservoirs. Proc. 24th Annual Conf. on Neural Information Processing Systems, p.2307-2315.

[34]Twaddell, W.F., 1935. On defining the phoneme. Language, 11(1):5-62.

[35]Yang, Y.C., Chen, L., 2012. Toward emotional speaker recognition: framework and preliminary results. Proc. 7th Chinese Conf. on Biometric Recognition, p.235-242.

Open peer comments: Debate/Discuss/Question/Opinion

<1>