|
|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.11 P.2310-2323
Multi-talker audio–visual speech recognition towards diverse scenarios&
Abstract: Recently, audio–visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.
Key words: Speech recognition and synthesis; Multi-modal recognition; Curriculum learning; Multi-talker speech recognition
浙江大学计算机科学与技术学院,中国杭州市,310027
摘要:近年来,音频─视觉语音识别(AVSR)日益受到关注。然而,现有研究大多简化了实际应用中的复杂挑战,仅关注双说话者场景和完美对齐的音频─视频片段。本文研究了说话者数量和模态未对齐对AVSR任务的影响,在更现实的条件下提出一个端到端的AVSR框架。具体而言,提出一种说话者数量感知型专家混合(SA-MoE)机制,以明确建模不同说话者数场景下的特征差异,并设计跨模态重新对齐(CMR)模块,用于稳健处理异步输入。此外,利用内在难度差异,提出一种名为基于挑战的课程学习(CBCL)的新训练策略,该策略迫使模型关注困难且具有挑战性的数据而非简单数据,从而提高效率。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2500411
CLC number:
TP18
Download Full Text:
Downloaded:
257
Download summary:
<Click Here>Downloaded:
343Clicked:
337
Cited:
0
On-line Access:
2026-01-08
Received:
2025-06-13
Revision Accepted:
2025-11-02
Crosschecked:
2026-01-08