Full Text:   <2722>

Summary:  <1775>

CLC number: TN912.34

On-line Access: 2017-05-24

Received: 2015-11-03

Revision Accepted: 2016-04-18

Crosschecked: 2017-02-21

Cited: 0

Clicked: 7180

Citations:  Bibtex RefMan EndNote GB/T7714


Ha-jin Yu


-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2017 Vol.18 No.5 P.738-750


Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition

Author(s):  Myung-jae Kim, Il-ho Yang, Min-seok Kim, Ha-jin Yu

Affiliation(s):  School of Computer Science, University of Seoul, Seoul 02504, Korea

Corresponding email(s):   mj@uos.ac.kr, heisco@hanmail.net, ms@uos.ac.kr, hjyu@uos.ac.kr

Key Words:  Speaker recognition, Histogram equalization, i-vector

Share this article to: More <<< Previous Article|

Myung-jae Kim, Il-ho Yang, Min-seok Kim, Ha-jin Yu. Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(5): 738-750.

@article{title="Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition",
author="Myung-jae Kim, Il-ho Yang, Min-seok Kim, Ha-jin Yu",
journal="Frontiers of Information Technology & Electronic Engineering",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition
%A Myung-jae Kim
%A Il-ho Yang
%A Min-seok Kim
%A Ha-jin Yu
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 5
%P 738-750
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500380

T1 - Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition
A1 - Myung-jae Kim
A1 - Il-ho Yang
A1 - Min-seok Kim
A1 - Ha-jin Yu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 5
SP - 738
EP - 750
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500380

We propose a method for histogram equalization using supplement sets to improve the performance of speaker recognition when the training and test utterances are very short. The supplement sets are derived using outputs of selection or clustering algorithms from the background speakers’ utterances. The proposed approach is used as a feature normalization method for building histograms when there are insufficient input utterance samples. In addition, the proposed method is used as an i-vector normalization method in an i-vector-based probabilistic linear discriminant analysis (PLDA) system, which is the current state-of-the-art for speaker verification. The ranks of sample values for histogram equalization are estimated in ascending order from both the input utterances and the supplement set. New ranks are obtained by computing the sum of different kinds of ranks. Subsequently, the proposed method determines the cumulative distribution function of the test utterance using the newly defined ranks. The proposed method is compared with conventional feature normalization methods, such as cepstral mean normalization (CMN), cepstral mean and variance normalization (MVN), histogram equalization (HEQ), and the European Telecommunications Standards Institute (ETSI) advanced front-end methods. In addition, performance is compared for a case in which the greedy selection algorithm is used with fuzzy C-means and K-means algorithms. The YOHO and Electronics and Telecommunications Research Institute (ETRI) databases are used in an evaluation in the feature space. The test sets are simulated by the Opus VoIP codec. We also use the 2008 National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) corpus for the i-vector system. The results of the experimental evaluation demonstrate that the average system performance is improved when the proposed method is used, compared to the conventional feature normalization methods.


概要:本文提出了一种用于说话人识别技术的直方图均衡化方法。该方法采用了一套增补简化特征集,用以在训练数据和测试数据较短时改进说话人识别的效果。该增补特征集采用选择算法或聚类算法从背景人声中派生得到。当输入语音数据样本不足时,本文提出的方法可作为构建直方图的特征归一化方法使用。另外,该方法作为一种i-vector归一化方法,源于一种目前较为先进的基于i-vector的概率线性判别分析(Probabilistic linear discrimin antanalysis, PLDA)说话人识别系统。在输入语音和增补集中,用于直方图均衡化的样本值序号均按升序进行估计。新的序列号则按不同种类的序号之和进行排列。随后,该方法采用最新的序列号得出了测试语音样本的累积分布函数。本文将这一方法与倒谱均值归一化(Cepstral mean normalization, CMN)方法、倒谱均值和方差(Cepstral mean and variance, MVN)归一化法、直方图均衡化(Histogram equalization, HEQ)方法和欧洲电信标准协会模拟前端方法进行了比较。此外,在一具体算例中将该方法性能与采用模糊C-means和K-means算法的贪婪选择算法进行了比较。采用YOHO和ETRI数据库对特征空间进行评估。测试集采用Opus VoIP编码器进行了模拟。本文还采用了2008美国国家标准技术研究所说话人识别评测语料库对该i-vector系统进行了评测。试验结果表明,与传统特征归一化方法相比,当采用所提出的方法时,平均系统性能可得到有效提提升。


Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Atal, B.S., 1974. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Am., 55(6):1304-1312.

[2]Blanco, Y., Zazo, S., Principe, J.C., 2000. Alternative statistical Gaussianity measure using the cumulative density function. Proc. 2nd Int. Workshop on Independent Component Analysis and Blind Signal Separation, p.537-542.

[3]Bousquet, P., Matrouf, D., Bonastre, J., 2011. Intersession compensation and scoring methods in the i-vectors space for speaker recognition. INTERSPEECH, p.485-488.

[4]Bousquet, P., Larcher, A., Matrouf, D., et al., 2012. Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis. Odyssey: the Speaker and Language Recognition Workshop, p.157-164.

[5]Cannon, R.L., Dave, J.V., Bezdek, J.C., 1986. Efficient implementation of the fuzzy c-means clustering algorithms. IEEE Trans. Patt. Anal. Mach. Intell., 8(2):248-255.

[6]Dehak, N., Kenny, P., Dehak, R., et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process., 19(4):788-798.

[7]de la Torre, A., Peinado, A.M., Segura, J.C., et al., 2005. Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Audio Speech Lang. Process., 13(3):355-366.

[8]Duda, R.O., Hart, P.E., Stork, D.G., 2012. Pattern Classification. John Wiley & Sons, Tronto.

[9]ETSI, 2005. Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Extended Advanced Front-End Feature Extraction Algorithm; Compression Algorithms; Back-End Speech Reconstruction Algorithm, ETSI ES 202 212. European Telecommunication Standards Institute, Sophia Antipolis.

[10]Franc, V., 2005. Optimization Algorithms for Kernel Methods. PhD Thesis, Centre for Machine Perception, Czech Technical University, Prague, Czech Republic.

[11]Garcia-Romero, D., Espy-Wilson, C.Y., 2011. Analysis of i-vector length normalization in speaker recognition systems. INTERSPEECH, p.249-252.

[12]Gonzalez, R.C., Wintz, P., 1987. Digital Image Processing. Addision-Wesley Publishing Company, Boston.

[13]Jiang, Y., Lee, K., Tang, Z., et al., 2012. PLDA modeling in i-vector and supervector space for speaker verification. INTERSPEECH, p.1680-1683.

[14]Jones, E., Oliphant, T., Peterson, P., 2001. Scipy: Open Source Scientific Tools for Python. http://www.scipy.org/

[15]Kenny, P., 2010. Bayesian speaker verification with heavy-tailed priors. Odyssey: the Speaker and Language Recognition Workshop.

[16]Kim, M., Yang, I., Yu, H., 2008. Robust speaker identification using greedy kernel PCA. 20th IEEE Int. Conf. on Tools with Artificial Intelligence, p.143-146.

[17]Kim, N., 1998. Statistical linear approximation for environment compensation. IEEE Signal Process. Lett., 5(1):8-10.

[18]Larcher, A., Bonastre, J., Fauve, B., et al., 2013. Alize 3.0–-open source toolkit for state-of-the-art speaker recognition. INTERSPEECH, p.2768-2772.

[19]Moreno, P.J., Raj, B., Stern, R.M., 1996. A vector Taylor series approach for environment-independent speech recognition. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.733-736.

[20]Pelecanos, J., Sridharan, S., 2001. Feature warping for robust speaker verification. Odyssey: the Speaker and Language Recognition Workshop, p.213-218.

[21]Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted Gaussian mixture models. Dig. Signal Process., 10(1):19-41.

[22]Segura, J.C., Benítez, C., de la Torre, A., et al., 2004. Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process. Lett., 11(5):517-520.

[23]Skosan, M., Mashao, D., 2006. Modified segmental histogram equalization for robust speaker verification. Patt. Recog. Lett., 27(5):479-486.

[24]Stolcke, A., Kajarekar, S., Ferrer, L., 2008. Nonparametric feature normalization for SVM-based speaker verification. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.1577-1580.

[25]Valin, J.M., Vos, K., Terriberry, T., 2012. Definition of the Opus Audio Codec. http://opus-codec.org/

[26]Viikki, O., Laurila, K., 1998. Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun., 25(1):133-147.

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE