JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.12 P.2623-2637

GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning

Author(s): Xiang WEN, Haobo WANG, Ke CHEN, Tianlei HU, Gang CHEN
Affiliation(s): State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou 310027, China; more
Corresponding email(s): wenxiang@zju.edu.cn, cg@zju.edu.cn
Key Words: Chain-of-thought, Multi-label zero-shot learning, Multimodal reasoning, Large language model

Share this article to： More <<< Previous Article \|Next Article >>>

Xiang WEN, Haobo WANG, Ke CHEN, Tianlei HU, Gang CHEN. GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(12): 2623-2637.

@article{title="GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning",
author="Xiang WEN, Haobo WANG, Ke CHEN, Tianlei HU, Gang CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="12",
pages="2623-2637",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500429"
}

%0 Journal Article
%T GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning
%A Xiang WEN
%A Haobo WANG
%A Ke CHEN
%A Tianlei HU
%A Gang CHEN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 12
%P 2623-2637
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500429

TY - JOUR
T1 - GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning
A1 - Xiang WEN
A1 - Haobo WANG
A1 - Ke CHEN
A1 - Tianlei HU
A1 - Gang CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 12
SP - 2623
EP - 2637
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500429

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.

GMCoT：面向多标签零样本学习的图增强多模态思维链推理框架

温翔¹，王皓波³，陈珂^1,2，胡天磊^1,2，陈刚^1,2
¹浙江大学区块链与数据安全全国重点实验室，中国杭州市，310027
²浙江大学杭州高新区（滨江）区块链与数据安全研究院，中国杭州市，310027
³浙江大学软件学院，中国杭州市，310027
摘要：近年来，多标签零样本学习（ML-ZSL）因其在图像标注、文本分类、生物信息学等领域的广泛潜在应用受到越来越多关注。ML-ZSL的核心挑战在于：在未见类别上，在不依赖任何带标注训练数据的前提下预测多个标签；这与传统的监督学习范式形成鲜明对比。然而，现有方法仍面临若干重要挑战，其中包括：不同模态之间存在显著的语义鸿沟，从而阻碍有效的知识迁移；多标签之间存在复杂且高度耦合的关系，使得对其进行合理而精确的建模变得困难。为解决上述问题，本文提出一种图增强多模态思维链（GMCoT）推理方法。该方法将多模态大语言模型的优势与图结构相结合，从而显著提升多标签预测中的推理能力。首先，提出一种新颖的多模态思维链推理框架，该框架通过模拟人类逐步推理过程来生成多标签预测结果。其次，提出一种将标签图融入推理过程的集成技术。该技术能够捕获标签间复杂的语义关系，从而提高多标签生成的准确性与一致性。在多个基准数据集上的综合实验表明，所提出的GMCoT方法在ML-ZSL任务中优于现有多种先进方法。

关键词：思维链；多标签零样本学习；多模态推理；大语言模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Akata Z, Perronnin F, Harchaoui Z, et al., 2016. Label-embedding for image classification. IEEE Trans Patt Anal Mach Intell, 38(7):1425-1438.

[2]Ali M, Khan S, 2023. CLIP-Decoder: ZeroShot multilabel classification using multimodal CLIP aligned representations. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.4677-4681.

[3]Aytes SA, Baek J, Hwang SJ, 2025. Sketch-of-thought: efficient LLM reasoning with adaptive cognitive-inspired sketching.

[4]Balasubramanian K, Lebanon G, 2012. The landmark selection method for multiple output prediction.

[5]Ben-Cohen A, Zamir N, Ben-Baruch E, et al., 2021. Semantic diversity learning for zero-shot multi-label classification. Proc IEEE/CVF Int Conf on Computer Vision, p.620-630.

[6]Changpinyo S, Chao WL, Gong BQ, et al., 2016. Synthesized classifiers for zero-shot learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5327-5336.

[7]Changpinyo S, Chao WL, Sha F, 2017. Predicting visual exemplars of unseen classes for zero-shot learning. Proc IEEE Int Conf on Computer Vision, p.3496-3505.

[8]Chen YN, Lin HT, 2012. Feature-aware label space dimension reduction for multi-label classification. Proc 26^th Int Conf on Neural Information Processing Systems, p.1529-1537.

[9]Chen ZF, Zhou QH, Shen YK, et al., 2023. See, think, confirm: interactive prompting between vision and language models for knowledge-based visual reasoning.

[10]Chen ZM, Wei XS, Wang P, et al., 2019. Multi-label image recognition with graph convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5172-5181.

[11]Cheng X, Lin HZ, Wu XY, et al., 2021. MlTr: multi-label classification with Transformer.

[12]Chua TS, Tang JH, Hong RC, et al., 2009. NUS-WIDE: a real-world web image database from National University of Singapore. Proc ACM Int Conf on Image and Video Retrieval, Article 48.

[13]Deng J, Ding N, Jia YQ, et al., 2014. Large-scale object classification using label relation graphs. 13^th European Conf on Computer Vision, p.48-64.

[14]Fu YW, Yang YX, Hospedales TM, et al., 2015. Transductive multi-class and multi-label zero-shot learning.

[15]Gong YC, Jia YQ, Leung T, et al., 2014. Deep convolutional ranking for multilabel image annotation.

[16]Gupta A, Narayan S, Khan S, et al., 2023. Generative multi-label zero-shot learning.

[17]He SN, Guo TA, Dai T, et al., 2023. Open-vocabulary multi-label classification via multi-modal knowledge transfer. 37^th AAAI Conf on Artificial Intelligence, p.808-816.

[18]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.

[19]Hu HX, Zhou GT, Deng ZW, et al., 2016. Learning structured inference neural networks with label relations. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2960-2968.

[20]Huynh D, Elhamifar E, 2020. A shared multi-attention framework for multi-label zero-shot learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8773-8783.

[21]Jiang B, Zhang Z, Lin D, et al., 2019. Semi-supervised learning with graph learning-convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11313-11320.

[22]Kuznetsova A, Rom H, Alldrin N, et al., 2020. The Open Images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis, 128(7):1956-1981.

[23]Lampert CH, Nickisch H, Harmeling S, 2009. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conf on Computer Vision and Pattern Recognition, p.951-958.

[24]Lampert CH, Nickisch H, Harmeling S, 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Patt Anal Mach Intell, 36(3):453-465.

[25]Lanchantin J, Wang TL, Ordonez V, et al., 2021. General multi-label image classification with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16473-16483.

[26]Lee CW, Fang W, Yeh CK, et al., 2018. Multi-label zero-shot learning with structured knowledge graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1576-1585.

[27]Li Q, Qiao MY, Bian W, et al., 2016. Conditional graphical Lasso for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2977-2986.

[28]Li Y, Song Y, Luo J, 2017. Improving multi-label classification with missing labels by learning visual and semantic embeddings. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5184-5192.

[29]Liu SL, Zhang L, Yang X, et al., 2021. Query2Label: a simple Transformer way to multi-label classification.

[30]Luo G, Yang X, Dou W, et al., 2025. Mono-InternVL: pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. Proc Computer Vision and Pattern Recognition Conf, p.24960-24971.

[31]Marino K, Salakhutdinov R, Gupta A, 2016. The more you know: using knowledge graphs for image classification.

[32]Mensink T, Gavves E, Snoek CGM, 2014. COSTA: co-occurrence statistics for zero-shot classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2441-2448.

[33]Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 27^th Int Conf on Neural Information Processing Systems, p.3111-3119.

[34]Nam J, Kim J, Loza Mencía E, et al., 2014. Large-scale multi-label text classification—revisiting neural networks. European Conf on Machine Learning and Knowledge Discovery in Databases, p.437-452.

[35]Narayan S, Gupta A, Khan F, et al., 2021. Discriminative region-based multi-label zero-shot learning. Proc IEEE/CVF Int Conf on Computer Vision, p.8711-8720.

[36]Read J, Pfahringer B, Holmes G, et al., 2011. Classifier chains for multi-label classification. Mach Learn, 85(3):333-359.

[37]Ren Z, Jin HL, Lin Z, et al., 2015. Multi-instance visual-semantic embedding.

[38]Ridnik T, Sharir G, Ben-Cohen A, et al., 2023. ML-Decoder: scalable and versatile classification head. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.32-41.

[39]Schuster M, Paliwal KK, 1997. Bidirectional recurrent neural networks. IEEE Trans Signal Process, 45(11):2673-2681.

[40]Tai F, Lin HT, 2012. Multilabel classification with principal label space transformation. Neur Comput, 24(9):2508-2542.

[41]Tan C, Wei JX, Gao ZY, et al., 2024. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. 18^th European Conf on Computer Vision, p.305-322.

[42]Tsoumakas G, Katakis I, 2007. Multi-label classification: an overview. Int J Data Warehous Min, 3(3):1-13.

[43]Tsoumakas G, Katakis I, 2008. Multi-label classification: an overview. In: Wang J (Ed.), Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications. IGI Global Scientific Publishing, Hershey, PA, USA, p.64-74.

[44]Wang J, Yang Y, Mao JH, et al., 2016. CNN-RNN: a unified framework for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2285-2294.

[45]Wang XZ, Wei J, Schuurmans D, et al., 2022. Self-consistency improves chain of thought reasoning in language models.

[46]Wei YC, Xia W, Huang JS, et al., 2014. CNN: single-label to multi-label.

[47]Xian YQ, Schiele B, Akata Z, 2017. Zero-shot learning—the good, the bad and the ugly. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3077-3086.

[48]Xie C, Liang S, Li J, et al., 2025. RelationLMM: large multimodal model as open and versatile visual relationship generalist. IEEE Trans Patt Anal Mach Intell, 47(5):3515-3529.

[49]Yang YH, Xu HH, Huang H, et al., 2023. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.

[50]Yao FL, Tian CY, Liu JT, et al., 2023. Thinking like an expert: multimodal hypergraph-of-thought (HoT) reasoning to boost foundation modals.

[51]Yeh CK, Wu WC, Ko WJ, et al., 2017. Learning deep latent space for multi-label classification. Proc 31^st AAAI Conf on Artificial Intelligence, p.2838-2844.

[52]Zhang DA, Yang JM, Lyu HJ, et al., 2024. CoCoT: contrastive chain-of-thought prompting for large multimodal models with multiple image inputs.

[53]Zhang ML, Zhou ZH, 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng, 18(10):1338-1351.

[54]Zhang Y, Gong BQ, Shah M, 2016. Fast zero-shot image tagging. IEEE Conf on Computer Vision and Pattern Recognition, p.5985-5994.

[55]Zhang ZM, Saligrama V, 2015. Zero-shot learning via semantic similarity embedding. Proc IEEE Int Conf on Computer Vision, p.4166-4174.

[56]Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

GMCoT：面向多标签零样本学习的图增强多模态思维链推理框架

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference