Full Text:   <29>

Summary:  <100>

CLC number: TP18

On-line Access: 2026-01-08

Received: 2025-06-13

Revision Accepted: 2025-11-02

Crosschecked: 2026-01-08

Cited: 0

Clicked: 145

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Fei WU

https://orcid.org/0000-0003-2139-8807

Yuxiao LIN

https://orcid.org/0000-0002-3954-5927

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2025 Vol.26 No.11 P.2310-2323

http://doi.org/10.1631/FITEE.2500411


Multi-talker audio–visual speech recognition towards diverse scenarios&


Author(s):  Yuxiao LIN, Tao JIN, Xize CHENG, Zhou ZHAO, Fei WU

Affiliation(s):  College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

Corresponding email(s):   yuxiaolinling@zju.edu.cn, jint_zju@zju.edu.cn, chengxize@zju.edu.cn, zhaozhou@zju.edu.cn, wufei@zju.edu.cn

Key Words:  Speech recognition and synthesis, Multi-modal recognition, Curriculum learning, Multi-talker speech recognition


Yuxiao LIN, Tao JIN, Xize CHENG, Zhou ZHAO, Fei WU. Multi-talker audio–visual speech recognition towards diverse scenarios&[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(11): 2310-2323.

@article{title="Multi-talker audio–visual speech recognition towards diverse scenarios&",
author="Yuxiao LIN, Tao JIN, Xize CHENG, Zhou ZHAO, Fei WU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="11",
pages="2310-2323",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500411"
}

%0 Journal Article
%T Multi-talker audio–visual speech recognition towards diverse scenarios&
%A Yuxiao LIN
%A Tao JIN
%A Xize CHENG
%A Zhou ZHAO
%A Fei WU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 11
%P 2310-2323
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500411

TY - JOUR
T1 - Multi-talker audio–visual speech recognition towards diverse scenarios&
A1 - Yuxiao LIN
A1 - Tao JIN
A1 - Xize CHENG
A1 - Zhou ZHAO
A1 - Fei WU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 11
SP - 2310
EP - 2323
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500411


Abstract: 
Recently, audio–visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.

面向多样化场景的多说话者音频─视觉语音识别

林宇箫,金涛,成曦泽,赵洲,吴飞
浙江大学计算机科学与技术学院,中国杭州市,310027
摘要:近年来,音频─视觉语音识别(AVSR)日益受到关注。然而,现有研究大多简化了实际应用中的复杂挑战,仅关注双说话者场景和完美对齐的音频─视频片段。本文研究了说话者数量和模态未对齐对AVSR任务的影响,在更现实的条件下提出一个端到端的AVSR框架。具体而言,提出一种说话者数量感知型专家混合(SA-MoE)机制,以明确建模不同说话者数场景下的特征差异,并设计跨模态重新对齐(CMR)模块,用于稳健处理异步输入。此外,利用内在难度差异,提出一种名为基于挑战的课程学习(CBCL)的新训练策略,该策略迫使模型关注困难且具有挑战性的数据而非简单数据,从而提高效率。

关键词:语音识别与合成,多模态识别;课程学习;多说话者语音识别

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Afouras T, Chung JS, Senior A, et al., 2018a. Deep audio-visual speech recognition. IEEE Trans Patt Anal Mach Intell, 44(12):8717-8727.

[2]Afouras T, Chung JS, Zisserman A, 2018b. LRS3-TED: a large-scale dataset for visual speech recognition. https://arxiv.org/abs/1809.00496

[3]Bengio Y, Louradour J, Collobert R, et al., 2009. Curriculum learning. Proc 26th Annual Int Conf on Machine Learning, p.41-48.

[4]Cheng HY, Liu ZY, Wu W, et al., 2023. Filter-recovery network for multi-speaker audio-visual speech separation. Proc 11th Int Conf on Learning Representations.

[5]Cheng XZ, Jin T, Li LJ, et al., 2023. OpenSR: open-modality speech recognition via maintaining multi-modality alignment. Proc 61st Annual Meeting of the Association for Computational Linguistics, p.6592-6607.

[6]Chung JS, Senior A, Vinyals O, et al., 2017. Lip reading sentences in the wild. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6447-6456.

[7]Ebbinghaus H, 1885. Über Das Gedächtnis: Untersuchungen Zur Experimentellen Psychologie. Duncker & Humblot, Leipzig, Germany (in German).

[8]Gao RH, Grauman K, 2021. VisualVoice: audio-visual speech separation with cross-modal consistency. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15490-15500.

[9]Graves A, Bellemare MG, Menick J, et al., 2017. Automated curriculum learning for neural networks. Proc 34th Int Conf on Machine Learning, p.1311-1320.

[10]Gulati A, Qin J, Chiu CC, et al., 2020. Conformer: convolution-augmented Transformer for speech recognition. Proc 21st Annual Conf of the Int Speech Communication Association, p.5036-5040.

[11]Guo PC, Boyer F, Chang XK, et al., 2021. Recent developments on ESPnet toolkit boosted by Conformer. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5874-5878.

[12]Hacohen G, Weinshall D, 2019. On the power of curriculum learning in training deep networks. Proc 36th Int Conf on Machine Learning, p.2535-2544.

[13]Haliassos A, Ma PC, Mira R, et al., 2023. Jointly learning visual and auditory speech representations from raw data. Proc 11th Int Conf on Learning Representations.

[14]Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.31-35.

[15]Hu EJ, Shen YL, Wallis P, et al., 2022. LoRA: low-rank adaptation of large language models. Proc 10th Int Conf on Learning Representations.

[16]Hu YC, Li RZ, Chen C, et al., 2023. Cross-modal global interaction and local alignment for audio-visual speech recognition. Proc 32nd Int Joint Conf on Artificial Intelligence, p.5076-5084.

[17]Kumar G, Foster G, Cherry C, et al., 2019. Reinforcement learning based curriculum optimization for neural machine translation. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.2054-2061.

[18]Kumar MP, Packer B, Koller D, 2010. Self-paced learning for latent variable models. Proc 24th Int Conf on Neural Information Processing Systems, p.1189-1197.

[19]Kwon Y, Chung SW, 2023. Mole: mixture of language experts for multi-lingual automatic speech recognition. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.

[20]Li JH, Li CD, Wu YF, et al., 2024. Unified cross-modal attention: robust audio-visual speech recognition and beyond. IEEE/ACM Trans Audio Speech Lang Process, 32:1941-1953.

[21]Lin Y, Jin T, Cheng X, et al., 2025. Curriculum learning aided audio-visual speech recognition with arbitrary speaker number. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.

[22]Ma PC, Petridis S, Pantic M, 2021. End-to-end audio-visual speech recognition with Conformers. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7613-7617.

[23]Ma PC, Haliassos A, Fernandez-Lopez A, et al., 2023. Auto-AVSR: audio-visual speech recognition with automatic labels. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.

[24]Makino T, Liao H, Assael Y, et al., 2019. Recurrent neural network transducer for audio-visual speech recognition. Proc IEEE Automatic Speech Recognition and Understanding Workshop, p.905-912.

[25]Murre JMJ, Dros J, 2015. Replication and analysis of Ebbinghaus’ forgetting curve. PLoS ONE, 10(7):e0120644.

[26]Nagrani A, Chung JS, Zisserman A, 2017. VoxCeleb: a large-scale speaker identification dataset. Proc 18th Annual Conf of the Int Speech Communication Association, p.2616-2620.

[27]Nguyen TS, Stüker S, Waibel A, 2021. Super-human performance in online low-latency recognition of conversational speech. Proc 22nd Annual Conf of the Int Speech Communication Association, p.1762-1766.

[28]Penha G, Hauff C, 2020. Curriculum learning strategies for IR: an empirical study on conversation response ranking. Proc 42nd European Conf on IR Research on Advances in Information Retrieval, p.699-713.

[29]Petridis S, Stafylakis T, Ma PC, et al., 2018. Audio-visual speech recognition with a hybrid CTC/attention architecture. Proc IEEE Spoken Language Technology Workshop, p.513-520.

[30]Platanios EA, Stretcu O, Neubig G, et al., 2019. Competence-based curriculum learning for neural machine translation. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1162-1172.

[31]Rahimi A, Afouras T, Zisserman A, 2022. Reading to listen at the cocktail party: multi-modal speech separation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10483-10492.

[32]Shi BW, Hsu WN, Lakhotia K, et al., 2022a. Learning audio-visual speech representation by masked multimodal cluster prediction. Proc 10th Int Conf on Learning Representations.

[33]Shi BW, Hsu WN, Mohamed A, 2022b. Robust self-supervised audio-visual speech recognition. Proc 23rd Annual Conf of the Int Speech Communication Association, p.2118-2122.

[34]Spitkovsky VI, Alshawi H, Jurafsky D, 2010. From baby steps to leapfrog: how “less is more” in unsupervised dependency parsing. Proc Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.751-759.

[35]Wang WX, Ma GD, Li YK, et al., 2023. Language-routing mixture of experts for multilingual and code-switching speech recognition. Proc 24th Annual Conf of the Int Speech Communication Association, p.1389-1393.

[36]Weinshall D, Cohen G, Amir D, 2018. Curriculum learning by transfer learning: theory and experiments with deep networks. Proc 35th Int Conf on Machine Learning, p.5235-5243.

[37]Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256

[38]Yang S, Zhang YH, Feng DL, et al., 2019. LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. Proc 14th IEEE Int Conf on Automatic Face & Gesture Recognition, p.1-8.

[39]Yang XD, Cheng XZ, Duan JQ, et al., 2024. AudioVSR: enhancing video speech recognition with audio data. Proc Conf on Empirical Methods in Natural Language Processing, p.15352-15361.

[40]Yu D, Kolbæk M, Tan ZH, et al., 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.241-245.

[41]Zhu QS, Zhou L, Zhang ZQ, et al., 2024. VatLM: visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Trans Multimedia, 26:1055-1064.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE