
CLC number: TP18
On-line Access: 2026-01-09
Received: 2025-06-23
Revision Accepted: 2025-12-02
Crosschecked: 2026-01-11
Cited: 0
Clicked: 7
Xiang WEN, Haobo WANG, Ke CHEN, Tianlei HU, Gang CHEN. GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(12): 2623-2637.
@article{title="GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning",
author="Xiang WEN, Haobo WANG, Ke CHEN, Tianlei HU, Gang CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="12",
pages="2623-2637",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500429"
}
%0 Journal Article
%T GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning
%A Xiang WEN
%A Haobo WANG
%A Ke CHEN
%A Tianlei HU
%A Gang CHEN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 12
%P 2623-2637
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500429
TY - JOUR
T1 - GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning
A1 - Xiang WEN
A1 - Haobo WANG
A1 - Ke CHEN
A1 - Tianlei HU
A1 - Gang CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 12
SP - 2623
EP - 2637
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500429
Abstract: In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.
[1]Akata Z, Perronnin F, Harchaoui Z, et al., 2016. Label-embedding for image classification. IEEE Trans Patt Anal Mach Intell, 38(7):1425-1438.
[2]Ali M, Khan S, 2023. CLIP-Decoder: ZeroShot multilabel classification using multimodal CLIP aligned representations. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.4677-4681.
[3]Aytes SA, Baek J, Hwang SJ, 2025. Sketch-of-thought: efficient LLM reasoning with adaptive cognitive-inspired sketching.
[4]Balasubramanian K, Lebanon G, 2012. The landmark selection method for multiple output prediction.
[5]Ben-Cohen A, Zamir N, Ben-Baruch E, et al., 2021. Semantic diversity learning for zero-shot multi-label classification. Proc IEEE/CVF Int Conf on Computer Vision, p.620-630.
[6]Changpinyo S, Chao WL, Gong BQ, et al., 2016. Synthesized classifiers for zero-shot learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5327-5336.
[7]Changpinyo S, Chao WL, Sha F, 2017. Predicting visual exemplars of unseen classes for zero-shot learning. Proc IEEE Int Conf on Computer Vision, p.3496-3505.
[8]Chen YN, Lin HT, 2012. Feature-aware label space dimension reduction for multi-label classification. Proc 26th Int Conf on Neural Information Processing Systems, p.1529-1537.
[9]Chen ZF, Zhou QH, Shen YK, et al., 2023. See, think, confirm: interactive prompting between vision and language models for knowledge-based visual reasoning.
[10]Chen ZM, Wei XS, Wang P, et al., 2019. Multi-label image recognition with graph convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5172-5181.
[11]Cheng X, Lin HZ, Wu XY, et al., 2021. MlTr: multi-label classification with Transformer.
[12]Chua TS, Tang JH, Hong RC, et al., 2009. NUS-WIDE: a real-world web image database from National University of Singapore. Proc ACM Int Conf on Image and Video Retrieval, Article 48.
[13]Deng J, Ding N, Jia YQ, et al., 2014. Large-scale object classification using label relation graphs. 13th European Conf on Computer Vision, p.48-64.
[14]Fu YW, Yang YX, Hospedales TM, et al., 2015. Transductive multi-class and multi-label zero-shot learning.
[15]Gong YC, Jia YQ, Leung T, et al., 2014. Deep convolutional ranking for multilabel image annotation.
[16]Gupta A, Narayan S, Khan S, et al., 2023. Generative multi-label zero-shot learning.
[17]He SN, Guo TA, Dai T, et al., 2023. Open-vocabulary multi-label classification via multi-modal knowledge transfer. 37th AAAI Conf on Artificial Intelligence, p.808-816.
[18]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.
[19]Hu HX, Zhou GT, Deng ZW, et al., 2016. Learning structured inference neural networks with label relations. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2960-2968.
[20]Huynh D, Elhamifar E, 2020. A shared multi-attention framework for multi-label zero-shot learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8773-8783.
[21]Jiang B, Zhang Z, Lin D, et al., 2019. Semi-supervised learning with graph learning-convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11313-11320.
[22]Kuznetsova A, Rom H, Alldrin N, et al., 2020. The Open Images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis, 128(7):1956-1981.
[23]Lampert CH, Nickisch H, Harmeling S, 2009. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conf on Computer Vision and Pattern Recognition, p.951-958.
[24]Lampert CH, Nickisch H, Harmeling S, 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Patt Anal Mach Intell, 36(3):453-465.
[25]Lanchantin J, Wang TL, Ordonez V, et al., 2021. General multi-label image classification with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16473-16483.
[26]Lee CW, Fang W, Yeh CK, et al., 2018. Multi-label zero-shot learning with structured knowledge graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1576-1585.
[27]Li Q, Qiao MY, Bian W, et al., 2016. Conditional graphical Lasso for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2977-2986.
[28]Li Y, Song Y, Luo J, 2017. Improving multi-label classification with missing labels by learning visual and semantic embeddings. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5184-5192.
[29]Liu SL, Zhang L, Yang X, et al., 2021. Query2Label: a simple Transformer way to multi-label classification.
[30]Luo G, Yang X, Dou W, et al., 2025. Mono-InternVL: pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. Proc Computer Vision and Pattern Recognition Conf, p.24960-24971.
[31]Marino K, Salakhutdinov R, Gupta A, 2016. The more you know: using knowledge graphs for image classification.
[32]Mensink T, Gavves E, Snoek CGM, 2014. COSTA: co-occurrence statistics for zero-shot classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2441-2448.
[33]Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 27th Int Conf on Neural Information Processing Systems, p.3111-3119.
[34]Nam J, Kim J, Loza Mencía E, et al., 2014. Large-scale multi-label text classification—revisiting neural networks. European Conf on Machine Learning and Knowledge Discovery in Databases, p.437-452.
[35]Narayan S, Gupta A, Khan F, et al., 2021. Discriminative region-based multi-label zero-shot learning. Proc IEEE/CVF Int Conf on Computer Vision, p.8711-8720.
[36]Read J, Pfahringer B, Holmes G, et al., 2011. Classifier chains for multi-label classification. Mach Learn, 85(3):333-359.
[37]Ren Z, Jin HL, Lin Z, et al., 2015. Multi-instance visual-semantic embedding.
[38]Ridnik T, Sharir G, Ben-Cohen A, et al., 2023. ML-Decoder: scalable and versatile classification head. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.32-41.
[39]Schuster M, Paliwal KK, 1997. Bidirectional recurrent neural networks. IEEE Trans Signal Process, 45(11):2673-2681.
[40]Tai F, Lin HT, 2012. Multilabel classification with principal label space transformation. Neur Comput, 24(9):2508-2542.
[41]Tan C, Wei JX, Gao ZY, et al., 2024. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. 18th European Conf on Computer Vision, p.305-322.
[42]Tsoumakas G, Katakis I, 2007. Multi-label classification: an overview. Int J Data Warehous Min, 3(3):1-13.
[43]Tsoumakas G, Katakis I, 2008. Multi-label classification: an overview. In: Wang J (Ed.), Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications. IGI Global Scientific Publishing, Hershey, PA, USA, p.64-74.
[44]Wang J, Yang Y, Mao JH, et al., 2016. CNN-RNN: a unified framework for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2285-2294.
[45]Wang XZ, Wei J, Schuurmans D, et al., 2022. Self-consistency improves chain of thought reasoning in language models.
[46]Wei YC, Xia W, Huang JS, et al., 2014. CNN: single-label to multi-label.
[47]Xian YQ, Schiele B, Akata Z, 2017. Zero-shot learning—the good, the bad and the ugly. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3077-3086.
[48]Xie C, Liang S, Li J, et al., 2025. RelationLMM: large multimodal model as open and versatile visual relationship generalist. IEEE Trans Patt Anal Mach Intell, 47(5):3515-3529.
[49]Yang YH, Xu HH, Huang H, et al., 2023. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.
[50]Yao FL, Tian CY, Liu JT, et al., 2023. Thinking like an expert: multimodal hypergraph-of-thought (HoT) reasoning to boost foundation modals.
[51]Yeh CK, Wu WC, Ko WJ, et al., 2017. Learning deep latent space for multi-label classification. Proc 31st AAAI Conf on Artificial Intelligence, p.2838-2844.
[52]Zhang DA, Yang JM, Lyu HJ, et al., 2024. CoCoT: contrastive chain-of-thought prompting for large multimodal models with multiple image inputs.
[53]Zhang ML, Zhou ZH, 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng, 18(10):1338-1351.
[54]Zhang Y, Gong BQ, Shah M, 2016. Fast zero-shot image tagging. IEEE Conf on Computer Vision and Pattern Recognition, p.5985-5994.
[55]Zhang ZM, Saligrama V, 2015. Zero-shot learning via semantic similarity embedding. Proc IEEE Int Conf on Computer Vision, p.4166-4174.
[56]Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models.
Open peer comments: Debate/Discuss/Question/Opinion
<1>