CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-02-16
Cited: 0
Clicked: 2990
Citations: Bibtex RefMan EndNote GB/T7714
Jinyi GUO, Jieyu DING. Robust cross-modal retrieval with alignment refurbishment[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200514 @article{title="Robust cross-modal retrieval with alignment refurbishment", %0 Journal Article TY - JOUR
基于对齐自修正的鲁棒跨模态检索1南京理工大学计算机科学与工程学院,中国南京市,210094 2青岛大学数学与统计学院,中国青岛市,266071 摘要:跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索。目前多种跨模态检索方法已被提出并取得良好性能。这些方法使用干净对齐的跨模态数据进行训练。虽然这些数据在语义上是匹配的,但相较于互联网上容易获得的噪声对齐的数据(即成对但在语义上不匹配),标注成本很高。当用噪声对齐的数据训练这些模型时,它们的性能会急剧下降。因此,本文提出一种对齐自修正的鲁棒跨模态检索算法(RCAR),显著降低了噪声数据对模型的影响。具体来说,RCAR首先进行多任务学习,减缓模型对噪声数据的过拟合,使数据分离。然后,利用两成分的贝塔混合模型将数据分为干净数据和噪声数据,并根据后验概率修正对齐标签。此外,在噪声对齐范式中定义两种噪声类型:部分噪声数据和完全噪声数据。实验结果表明,与当下流行的跨模态检索方法相比,RCAR在两种类型的噪声下都能取得更稳健的性能。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36th Int Conf on Machine Learning, p.312-321. ![]() [2]Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31st Int Conf on Neural Information Processing Systems, p.1003-1013. ![]() [3]Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652-12660. ![]() [4]Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.104-120. ![]() [5]Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555 ![]() [6]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. ![]() [7]Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35th Conf on Artificial Intelligence, p.1218-1226. ![]() [8]Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12. ![]() [9]Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503-521. ![]() [10]Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31st Conf on Artificial Intelligence, p.1919-1925. ![]() [11]Han B, Yao QM, Yu XR et al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32nd Int Conf on Neural Information Processing Systems, p.8536-8546. ![]() [12]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. ![]() [13]Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1st ACM Int Conf on Multimedia Information Retrieval, p.39-43. ![]() [14]Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904-4916. ![]() [15]Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35th Int Conf on Machine Learning, p.2309-2318. ![]() [16]Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137. ![]() [17]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations. ![]() [18]Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15th European Conf on Computer Vision, p.212-228. ![]() [19]Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653-4661. ![]() [20]Li XJ, Yin X, Li CY, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.121-137. ![]() [21]Lin TY, Maire M, Belongie S, et al., 2014. Stacked cross attention for image–text matching. Proc 13th European Conf on Computer Vision, p.740-755. ![]() [22]Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719-1723. ![]() [23]Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447-461. ![]() [24]Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33rd Int Conf on Neural Information Processing Systems, p.13-23. ![]() [25]Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8th Int Conf on Learning Representations. ![]() [26]Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37th Int Conf on Machine Learning, p.6543-6553. ![]() [27]Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35th Int Conf on Machine Learning, p.3361-3370. ![]() [28]Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160-2173. ![]() [29]Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146-1151. ![]() [30]Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128. ![]() [31]Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380-384. ![]() [32]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763. ![]() [33]Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3rd Int Conf on Learning Representations. ![]() [34]Ren SQ, He KM, Girshick R, et al., 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137-1149. ![]() [35]Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098 ![]() [36]Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36th Int Conf on Machine Learning, p.5907-5915. ![]() [37]Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199 ![]() [38]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579-2605. ![]() [39]Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215 ![]() [40]Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568-2580. ![]() [41]Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15650-15659. ![]() [42]Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654-9663. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>