JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2015 Vol.16 No.5 P.358-366

Speech emotion recognition with unsupervised feature learning

Author(s): Zheng-wei Huang, Wen-tao Xue, Qi-rong Mao
Affiliation(s): Department of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
Corresponding email(s): zhengwei.hg@gmail.com, striveyou@163.com, mao_qr@mail.ujs.edu.cn
Key Words: Speech emotion recognition, Unsupervised feature learning, Neural network, Affect computing

Share this article to： More <<< Previous Article \|Next Article >>>

Zheng-wei Huang, Wen-tao Xue, Qi-rong Mao. Speech emotion recognition with unsupervised feature learning[J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16(5): 358-366.

@article{title="Speech emotion recognition with unsupervised feature learning",
author="Zheng-wei Huang, Wen-tao Xue, Qi-rong Mao",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="16",
number="5",
pages="358-366",
year="2015",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1400323"
}

%0 Journal Article
%T Speech emotion recognition with unsupervised feature learning
%A Zheng-wei Huang
%A Wen-tao Xue
%A Qi-rong Mao
%J Frontiers of Information Technology & Electronic Engineering
%V 16
%N 5
%P 358-366
%@ 2095-9184
%D 2015
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1400323

TY - JOUR
T1 - Speech emotion recognition with unsupervised feature learning
A1 - Zheng-wei Huang
A1 - Wen-tao Xue
A1 - Qi-rong Mao
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 16
IS - 5
SP - 358
EP - 366
%@ 2095-9184
Y1 - 2015
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1400323

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

The paper presents a very interesting issue related to unsupervised feature extraction for speech emotion recognition.

基于无监督特征学习的语音情感识别方法

目的：语音情感识别是人机交互的关键技术之一。同时，良好的情感特征对语音情感识别系统性能具有极大影响。目前的语音情感特征主要通过手工设计方法提取，对于其是否能够很好地刻画情感特性以及是否存在最优情感特征集，相关研究者并没有达成公认。所以有必要对语音情感特征提取进行进一步深入研究。
创新点：提出一种基于数据驱动的无监督情感特征学习方法。该方法能够自动从无标注语音数据中学习产生与情感相关的特征映射函数，用于语音情感特征提取。
方法：采用三种无监督学习算法（K-均值聚类，稀疏自动编码器，稀疏受限玻尔兹曼机）从若干无标注语音块中学习产生与目标相关的特征提取器，继而对整个语音样本进行特征提取（卷积和池化），最后训练一个线性支持向量机对未知样本进行识别。同时对模型涉及的超参数（块大小和隐层结点数目）进行选择。
结论：相对于传统原始特征，学习产生的特征具有一定的稀疏性并且对说话人及其他扰动因素具有一定鲁棒性。实验结果表明，尺寸较大的块和数量较多的隐层结点有助于提升系统性能（图4、5）。

关键词：语音情感识别；无监督特征学习；神经网络；情感计算

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abdel-Hamid, O., Mohamed, A.R., Jiang, H., et al., 2012. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.4277-4280.

[2]Burkhardt, F., Paeschke, A., Rolfes, M., et al., 2005. A database of German emotional speech. Interspeech, p.1517-1520.

[3]Chan, T.H., Jia, K., Gao, S., et al., 2014. PCANet: a simple deep learning baseline for image classification? arXiv preprint, arXiv:1404.3606.

[4]Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of single-layer networks in unsupervised feature learning. Int. Conf. on Artificial Intelligence and Statistics, p.215-223.

[5]Dahl, G.E., Yu, D., Deng, L., et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 20(1):30-42.

[6]El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recogn., 44(3):572-587.

[7]Feraru, M., Zbancioc, M., 2013. Speech emotion recognition for SROL database using weighted KNN algorithm. Int. Conf. on Electronics, Computers and Artificial Intelligence, p.1-4.

[8]Gao, H., Chen, S.G., An, P., et al., 2012. Emotion recognition of Mandarin speech for different speech corpora based on nonlinear features. IEEE 11th Int. Conf. on Signal Processing, p.567-570.

[9]Gunes, H., Schuller, B., 2013. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput., 31(2):120-136.

[10]Haq, S., Jackson, P.J., 2009. Speaker-dependent audio-visual emotion recognition. Auditory-Visual Speech Processing, p.53-58.

[11]Hinton, G., Deng, L., Yu, D., et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag., 29(6):82-97.

[12]Kim, Y., Lee, H., Provost, E.M., 2013. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.3687-3691.

[13]Koolagudi, S.G., Devliyal, S., Barthwal, A., et al., 2012. Emotion recognition from semi natural speech using artificial neural networks and excitation source features. In: Contemporary Computing. Springer Berlin Heidelberg, p.273-282.

[14]Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hidden Markov models with deep belief networks. IEEE Workshop on Automatic Speech Recognition and Understanding, p.216-221.

[15]Lee, H., Pham, P., Largman, Y., et al., 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. Advances in Neural Information Processing Systems, p.1096-1104.

[16]Li, L., Zhao, Y., Jiang, D., et al., 2013. Hybrid deep neural network–hidden Markov model (DNN-HMM) based speech emotion recognition. Humaine Association Conf. on Affective Computing and Intelligent Interaction, p.312-317.

[17]Mao, Q., Wang, X., Zhan, Y., 2010. Speech emotion recognition method based on improved decision tree and layered feature selection. Int. J. Human. Robot., 7(2):245-261.

[18]Mao, Q.R., Zhao, X.L., Huang, Z.W., et al., 2013. Speaker-independent speech emotion recognition by fusion of functional and accompanying paralanguage features. J. Zhejiang Univ.-Sci. C (Comput. & Electron.), 14(7):573-582.

[19]Martin, O., Kotsia, I., Macq, B., et al., 2006. The eNTERFACE’05 audio-visual emotion database. Proc. Int. Conf. on Data Engineering Workshops, p.8.

[20]Mencattini, A., Martinelli, E., Costantini, G., et al., 2014. Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowl.-Based Syst., 63:68-81.

[21]Mohamed, A.R., Dahl, G.E., Hinton, G., 2012. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20(1):14-22.

[22]Nicolaou, M.A., Gunes, H., Pantic, M., 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput., 2(2):92-105.

[23]Pantic, M., Nijholt, A., Pentland, A., et al., 2008. Human-centred intelligent human? Computer interaction (HCI2): how far are we from attaining it? Int. J. Auton. Adapt. Commun. Syst., 1(2):168-187.

[24]Ramakrishnan, S., El Emary, I.M., 2013. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst., 52(3):1467-1478.

[25]Ranzato, M., Huang, F.J., Boureau, Y.L., et al., 2007. Unsupervised learning of invariant feature hierarchies with applications to object recognition. IEEE Conf. on Computer Vision and Pattern Recognition, p.1-8.

[26]Razavian, A.S., Azizpour, H., Sullivan, J., et al., 2014. CNN features off-the-shelf: an astounding baseline for recognition. arXiv preprint, arXiv:1403.6382.

[27]Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep belief networks. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.65-68.

[28]Stuhlsatz, A., Meyer, C., Eyben, F., et al., 2011. Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.5688-5691.

[29]Sun, R., Moore, E.II, 2011. Investigating glottal parameters and Teager energy operators in emotion recognition. LNCS, 6975:425-434.

[30]Sun, Y., Wang, X., Tang, X., 2013. Deep learning face representation from predicting 10,000 classes. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p.1891-1898.

[31]Thapliyal, N., Amoli, G., 2012. Speech based emotion recognition with Gaussian mixture model. Int. J. Adv. Res. Comput. Eng. Technol., 1(5):65-69.

[32]Wu, C.H., Liang, W.B., 2011. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput., 2(1):10-21.

[33]Wu, S., Falk, T.H., Chan, W.Y., 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun., 53(5):768-785.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

基于无监督特征学习的语音情感识别方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference