JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

2025 Vol.26 No.11 P.2310-2323

Multi-talker audio–visual speech recognition towards diverse scenarios^&

Yuxiao LIN, Tao JIN, Xize CHENG, Zhou ZHAO, Fei WU

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

yuxiaolinling@zju.edu.cn, jint_zju@zju.edu.cn, chengxize@zju.edu.cn, zhaozhou@zju.edu.cn, wufei@zju.edu.cn

Abstract: Recently, audio–visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.

Key words: Speech recognition and synthesis; Multi-modal recognition; Curriculum learning; Multi-talker speech recognition

Chinese Summary <1> 面向多样化场景的多说话者音频─视觉语音识别

林宇箫，金涛，成曦泽，赵洲，吴飞
浙江大学计算机科学与技术学院，中国杭州市，310027
摘要：近年来，音频─视觉语音识别（AVSR）日益受到关注。然而，现有研究大多简化了实际应用中的复杂挑战，仅关注双说话者场景和完美对齐的音频─视频片段。本文研究了说话者数量和模态未对齐对AVSR任务的影响，在更现实的条件下提出一个端到端的AVSR框架。具体而言，提出一种说话者数量感知型专家混合（SA-MoE）机制，以明确建模不同说话者数场景下的特征差异，并设计跨模态重新对齐（CMR）模块，用于稳健处理异步输入。此外，利用内在难度差异，提出一种名为基于挑战的课程学习（CBCL）的新训练策略，该策略迫使模型关注困难且具有挑战性的数据而非简单数据，从而提高效率。

关键词组：语音识别与合成，多模态识别；课程学习；多说话者语音识别

Share this article to： More

Go to Contents

References:

Open peer comments: Debate/Discuss/Question/Opinion

<1>

DOI:

10.1631/FITEE.2500411

CLC number:

TP18

Download Full Text:

Click Here

Downloaded:

258

Download summary:

Downloaded:

343

Clicked:

337

Cited:

On-line Access:

2026-01-08

Received:

2025-06-13

Revision Accepted:

2025-11-02

Crosschecked:

2026-01-08

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE

CONTENTS

INSTR. FOR AUTHOR

FOR REVIEWER

ABOUT JZUS

Publishing Service