Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Multi-talker audio–visual speech recognition towards diverse scenarios&

Abstract: Recently, audio–visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.

Key words: Speech recognition and synthesis; Multi-modal recognition; Curriculum learning; Multi-talker speech recognition

Chinese Summary  <1> 面向多样化场景的多说话者音频─视觉语音识别

林宇箫,金涛,成曦泽,赵洲,吴飞
浙江大学计算机科学与技术学院,中国杭州市,310027
摘要:近年来,音频─视觉语音识别(AVSR)日益受到关注。然而,现有研究大多简化了实际应用中的复杂挑战,仅关注双说话者场景和完美对齐的音频─视频片段。本文研究了说话者数量和模态未对齐对AVSR任务的影响,在更现实的条件下提出一个端到端的AVSR框架。具体而言,提出一种说话者数量感知型专家混合(SA-MoE)机制,以明确建模不同说话者数场景下的特征差异,并设计跨模态重新对齐(CMR)模块,用于稳健处理异步输入。此外,利用内在难度差异,提出一种名为基于挑战的课程学习(CBCL)的新训练策略,该策略迫使模型关注困难且具有挑战性的数据而非简单数据,从而提高效率。

关键词组:语音识别与合成,多模态识别;课程学习;多说话者语音识别


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2500411

CLC number:

TP18

Download Full Text:

Click Here

Downloaded:

258

Download summary:

<Click Here> 

Downloaded:

343

Clicked:

337

Cited:

0

On-line Access:

2026-01-08

Received:

2025-06-13

Revision Accepted:

2025-11-02

Crosschecked:

2026-01-08

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE