Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning

Abstract: In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.

Key words: Chain-of-thought; Multi-label zero-shot learning; Multimodal reasoning; Large language model

Chinese Summary  <0> GMCoT:面向多标签零样本学习的图增强多模态思维链推理框架

温翔1,王皓波3,陈珂1,2,胡天磊1,2,陈刚1,2
1浙江大学区块链与数据安全全国重点实验室,中国杭州市,310027
2浙江大学杭州高新区(滨江)区块链与数据安全研究院,中国杭州市,310027
3浙江大学软件学院,中国杭州市,310027
摘要:近年来,多标签零样本学习(ML-ZSL)因其在图像标注、文本分类、生物信息学等领域的广泛潜在应用受到越来越多关注。ML-ZSL的核心挑战在于:在未见类别上,在不依赖任何带标注训练数据的前提下预测多个标签;这与传统的监督学习范式形成鲜明对比。然而,现有方法仍面临若干重要挑战,其中包括:不同模态之间存在显著的语义鸿沟,从而阻碍有效的知识迁移;多标签之间存在复杂且高度耦合的关系,使得对其进行合理而精确的建模变得困难。为解决上述问题,本文提出一种图增强多模态思维链(GMCoT)推理方法。该方法将多模态大语言模型的优势与图结构相结合,从而显著提升多标签预测中的推理能力。首先,提出一种新颖的多模态思维链推理框架,该框架通过模拟人类逐步推理过程来生成多标签预测结果。其次,提出一种将标签图融入推理过程的集成技术。该技术能够捕获标签间复杂的语义关系,从而提高多标签生成的准确性与一致性。在多个基准数据集上的综合实验表明,所提出的GMCoT方法在ML-ZSL任务中优于现有多种先进方法。

关键词组:思维链;多标签零样本学习;多模态推理;大语言模型


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2500429

CLC number:

TP18

Download Full Text:

Click Here

Downloaded:

302

Download summary:

<Click Here> 

Downloaded:

211

Clicked:

313

Cited:

0

On-line Access:

2026-01-09

Received:

2025-06-23

Revision Accepted:

2025-12-02

Crosschecked:

2026-01-11

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE