CLC number: TP391.42
On-line Access: 2009-08-14
Received: 2008-12-18
Revision Accepted: 2009-04-04
Crosschecked: 2009-07-20
Cited: 0
Clicked: 5886
Bahram VAZIRNEZHAD, Farshad ALMASGANJ, Seyed Mohammad AHADI, Ari CHANEN. Speaker adapted dynamic lexicons containing phonetic deviations of words[J]. Journal of Zhejiang University Science A, 2009, 10(10): 1461-1475.
@article{title="Speaker adapted dynamic lexicons containing phonetic deviations of words",
author="Bahram VAZIRNEZHAD, Farshad ALMASGANJ, Seyed Mohammad AHADI, Ari CHANEN",
journal="Journal of Zhejiang University Science A",
volume="10",
number="10",
pages="1461-1475",
year="2009",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.A0820761"
}
%0 Journal Article
%T Speaker adapted dynamic lexicons containing phonetic deviations of words
%A Bahram VAZIRNEZHAD
%A Farshad ALMASGANJ
%A Seyed Mohammad AHADI
%A Ari CHANEN
%J Journal of Zhejiang University SCIENCE A
%V 10
%N 10
%P 1461-1475
%@ 1673-565X
%D 2009
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.A0820761
TY - JOUR
T1 - Speaker adapted dynamic lexicons containing phonetic deviations of words
A1 - Bahram VAZIRNEZHAD
A1 - Farshad ALMASGANJ
A1 - Seyed Mohammad AHADI
A1 - Ari CHANEN
J0 - Journal of Zhejiang University Science A
VL - 10
IS - 10
SP - 1461
EP - 1475
%@ 1673-565X
Y1 - 2009
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.A0820761
Abstract: Speaker variability is an important source of speech variations which makes continuous speech recognition a difficult task. Adapting automatic speech recognition (ASR) models to the speaker variations is a well-known strategy to cope with the challenge. Almost all such techniques focus on developing adaptation solutions within the acoustic models of the ASR systems. Although variations of the acoustic features constitute an important portion of the inter-speaker variations, they do not cover variations at the phonetic level. Phonetic variations are known to form an important part of variations which are influenced by both micro-segmental and suprasegmental factors. Inter-speaker phonetic variations are influenced by the structure and anatomy of a speaker’s articulatory system and also his/her speaking style which is driven by many speaker background characteristics such as accent, gender, age, socioeconomic and educational class. The effect of inter-speaker variations in the feature space may cause explicit phone recognition errors. These errors can be compensated later by having appropriate pronunciation variants for the lexicon entries which consider likely phone misclassifications besides pronunciation. In this paper, we introduce speaker adaptive dynamic pronunciation models, which generate different lexicons for various speaker clusters and different ranges of speech rate. The models are hybrids of speaker adapted contextual rules and dynamic generalized decision trees, which take into account word phonological structures, rate of speech, unigram probabilities and stress to generate pronunciation variants of words. Employing the set of speaker adapted dynamic lexicons in a Farsi (Persian) continuous speech recognition task results in word error rate reductions of as much as 10.1% in a speaker-dependent scenario and 7.4% in a speaker-independent scenario.
[1] Almasganj, F., Seyedsalehi, S.A., Bijankhan, M., Sameti, H., Sheikhzadegan, J., 2001. SHENAVA-1: Persian Spontaneous Continuous Speech Recognizer. Proc. Int. Conf. on Electrical Engineering, p.101-106 (in Farsi).
[2] Bijankhan, M., Sheikhzadegan, M.J., 1994. FARSDAT—The Farsi Spoken Language Database. Proc. Int. Conf. on Speech Sciences and Technology, 2:826-829.
[3] Bijankhan, M., Sheikhzadegan, M.J., Roohani, M.R., Zarrintare, R., Ghasemi, S.Z., Ghasedi, M.E., 2003. Tfarsdat—The Telephone Farsi Speech Database. European Conf. on Speech Communication and Technology, p.1525-1528.
[4] Chen, K., Hasegawa-Johnson, M., 2004. Modeling Pronunciation Variation Using Artificial Neural Networks for English Spontaneous Speech. Int. Conf. on Spoken Language Processing, p.1461-1464.
[5] Cremelie, N., Martens, J.P., 1999. In search of better pronunciation models for speech recognition. Speech Commun., 29(2-4):115-136.
[6] Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process., 28(4):357-366.
[7] Fosler-Lussier, E., 1999. Dynamic Pronunciation Models for Automatic Speech Recognition. PhD Thesis, University of California, Berkeley, CA, USA.
[8] Fosler-Lussier, E., Morgan, N., 1999. Effects of speaking rate and word frequency on pronunciations in conversational speech. Speech Commun., 29(2-4):137-158.
[9] Fukada, T., Yoshimura, T., Sagisaka, Y., 1999. Automatic generation of multiple pronunciations based on neural networks. Speech Commun., 27(1):63-73.
[10] Haghshenas, A.M., 1996. A Course in Phonetics. Agah Publications, Tehran, Iran (in Farsi).
[11] Hazen, T., Hetherington, L., Shu, L., Livescu, K., 2005. Pronunciation modeling using a finite-state transducer representation. Speech Commun., 46(2):189-203.
[12] Humphries, J., 1997. Accent Modeling and Adaptation in Automatic Speech Recognition. PhD Thesis, University of Cambridge, Cambridge, UK.
[13] Imai, T., Ando, A., Miyasaka, E., 1995. A New Method for Automatic Generation of Speaker-dependent Phonological Rules. Int. Conf. on Acoustics, Speech, and Signal Processing, p.864-867.
[14] Jande, P.A., 2008. Spoken language annotation and data-driven modeling of phone-level pronunciation in discourse context. Speech Commun., 50(2):126-141.
[15] Padrell, J., Macho, D., Nadeu, C., 2005. Robust Speech Activity Detection Using LDA Applied to FF Parameters. Int. Conf. on Acoustics, Speech, and Signal Processing, p.557-560.
[16] Randolph, M., 1990. A Data-driven Method for Discovering and Predicting Allophonic Variation. Int. Conf. on Acoustics, Speech, and Signal Processing, p.1177-1180.
[17] Riley, M., 1991. A Statistical Model for Generating Pronunciation Networks. Int. Conf. on Acoustics, Speech, and Signal Processing, p.737-740.
[18] Saraclar, M., Khudanpur, S., 2004. Pronunciation change in conversational speech and its implications for automatic speech recognition. Comput. Speech Lang., 18(4):375-395.
[19] Schmid, P., Cole, R., Fanty, M., 1993. Automatically Generated Word Pronunciations from Phoneme Classifier Output. Int. Conf. on Acoustics, Speech, and Signal Processing, p.223-226.
[20] Skorik, S., Berthommier, F., 2000. On a Cepstrum-based Speech Detector Robust to White Noise. SPECOM. St. Petersburg, Russia, p.1-5.
[21] Sloboda, T., 1995. Dictionary Learning Performance through Consistency. Int. Conf. on Acoustics, Speech, and Signal Processing, p.453-456.
[22] Strik, H., Cucchiarini, C., 1999. Modeling pronunciation variation for ASR: a survey of the literature. Speech Commun., 29(2-4):225-246.
[23] Vazirnezhad, B., Almasganj, F., Bijankhan, M., 2005a. Automatic extraction of contextual rules and generating pronunciation variants to use in automatic continuous speech recognition. J. Comput. Sci. Eng., 3(3):40-50 (in Farsi).
[24] Vazirnezhad, B., Almasganj, F., Bijankhan, M., 2005b. A Hybrid Statistical Model to Generate Pronunciation Variants of Words. Proc. IEEE Natural Language Processing and Knowledge Engineering, p.106-110.
[25] Vazirnezhad, B., Almasganj, F., Ahadi, M., 2009. Hybrid statistical pronunciation models to be trained with a medium-size corpus. Comput. Speech Lang., 23(1):1-24.
[26] Wooters, C., Stolcke, A., 1994. Multiple-pronunciation Lexical Modeling in a Speaker Independent Speech Understanding System. Int. Conf. on Spoken Language Processing, p.1363-1366.
[27] Zheng, J., Franco, H., 2003. Modeling word level rate of speech variation in large vocabulary conversational speech recognition. Speech Commun., 41(2-3):273-285.
Open peer comments: Debate/Discuss/Question/Opinion
<1>