|
Journal of Zhejiang University SCIENCE C
ISSN 1869-1951(Print), 1869-196x(Online), Monthly
2014 Vol.15 No.10 P.903-916
Mismatched feature detection with finer granularity for emotional speaker recognition
Abstract: The shapes of speakers’ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition—phoneme classes, Gaussian mixture model (GMM) tokenizer, and probabilistic GMM tokenizer—are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus (MASC) show that our feature pruning and feature regulation methods increase the identification rate (IR) by 3.64% and 6.77%, compared with the baseline GMM-UBM (universal background model) algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector.
Key words: Emotional speaker recognition, Mismatched feature detection, Feature regulation
创新要点:鉴于不同音素引起的失真特征分布变化存在差异,提出在音素类、高斯符号化和概率高斯符号化三种声学类上的精细失真特征检测模型与修正方法。
研究方法:采用流形分析方法,观测失真特征分布,得到结论:偏离中性特征空间越远,区分说话人能力越差。若基于某项特征的说话人区分能力小于某个阈值,即检测为失真特征(图1)。对于音素类和高斯符号化表示的声学类,采用支持向量机建立可靠–失真特征检测模型;对于概率高斯符号化表征的声学类,采用模糊支持向量机建立可靠–失真特征检测模型。为确保修正后的失真特征逼近真实的中性情形又不损失说话人特性,对检测出的失真特征进行修正时,将失真特征空间映射到可靠特征空间的同时,要使得转换后的失真特征空间和其他说话人的可靠特征空间的距离不会随之减少。
重要结论:情感导致说话人的部分语音特征分布发生变化成为失真特征,通过三种声学类的精细失真特征检测与修正,能够有效处理失真特征,提升系统识别性能。最高的概率高斯符号化下的失真特征修正算法,使得基准的GMM-UBM算法识别率提升6.77%,i-vector算法识别率提升3.32%。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/jzus.C1400002
CLC number:
TP391.4
Download Full Text:
Downloaded:
3197
Download summary:
<Click Here>Downloaded:
2357Clicked:
9005
Cited:
2
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2014-09-17