JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Robust cross-modal retrieval with alignment refurbishment

Author(s): Jinyi GUO, Jieyu DING
Affiliation(s): School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; more
Corresponding email(s): jinyi_g@njust.edu.cn, djy@qdu.edu.cn
Key Words: Cross-modal retrieval; Robust learning; Alignment correction; Beta-mixture model

Share this article to： More <<< Previous Paper \|Next Paper >>>

Jinyi GUO, Jieyu DING. Robust cross-modal retrieval with alignment refurbishment[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200514

@article{title="Robust cross-modal retrieval with alignment refurbishment",
author="Jinyi GUO, Jieyu DING",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2200514"
}

%0 Journal Article
%T Robust cross-modal retrieval with alignment refurbishment
%A Jinyi GUO
%A Jieyu DING
%J Frontiers of Information Technology & Electronic Engineering
%P 1403-1415
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2200514"

TY - JOUR
T1 - Robust cross-modal retrieval with alignment refurbishment
A1 - Jinyi GUO
A1 - Jieyu DING
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1403
EP - 1415
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2200514"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data. Currently, many cross-modal retrieval methods have been proposed and have achieved excellent results; however, these are trained with clean cross-modal pairs, which are semantically matched but costly, compared with easily available data with noise alignment (i.e., paired but mismatched in semantics). When training these methods with noise-aligned data, the performance degrades dramatically. Therefore, we propose a robust cross-modal retrieval with alignment refurbishment (RCAR), which significantly reduces the impact of noise on the model. Specifically, RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable. Then, RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component. In addition, we define partial and complete noises in the noise-alignment paradigm. Experimental results show that, compared with the popular cross-modal retrieval methods, RCAR achieves more robust performance with both types of noise.

基于对齐自修正的鲁棒跨模态检索

郭金一¹，丁洁玉²
¹南京理工大学计算机科学与工程学院，中国南京市，210094
²青岛大学数学与统计学院，中国青岛市，266071
摘要：跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索。目前多种跨模态检索方法已被提出并取得良好性能。这些方法使用干净对齐的跨模态数据进行训练。虽然这些数据在语义上是匹配的，但相较于互联网上容易获得的噪声对齐的数据（即成对但在语义上不匹配），标注成本很高。当用噪声对齐的数据训练这些模型时，它们的性能会急剧下降。因此，本文提出一种对齐自修正的鲁棒跨模态检索算法（RCAR），显著降低了噪声数据对模型的影响。具体来说，RCAR首先进行多任务学习，减缓模型对噪声数据的过拟合，使数据分离。然后，利用两成分的贝塔混合模型将数据分为干净数据和噪声数据，并根据后验概率修正对齐标签。此外，在噪声对齐范式中定义两种噪声类型：部分噪声数据和完全噪声数据。实验结果表明，与当下流行的跨模态检索方法相比，RCAR在两种类型的噪声下都能取得更稳健的性能。

关键词组：跨模态检索；鲁棒学习；对齐修正；贝塔混合模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36^th Int Conf on Machine Learning, p.312-321.

[2]Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31^st Int Conf on Neural Information Processing Systems, p.1003-1013.

[3]Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652-12660.

[4]Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16^th European Conf on Computer Vision, p.104-120.

[5]Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555

[6]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[7]Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35^th Conf on Artificial Intelligence, p.1218-1226.

[8]Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12.

[9]Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503-521.

[10]Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31^st Conf on Artificial Intelligence, p.1919-1925.

[11]Han B, Yao QM, Yu XR et al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32^nd Int Conf on Neural Information Processing Systems, p.8536-8546.

[12]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[13]Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1^st ACM Int Conf on Multimedia Information Retrieval, p.39-43.

[14]Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38^th Int Conf on Machine Learning, p.4904-4916.

[15]Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35^th Int Conf on Machine Learning, p.2309-2318.

[16]Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137.

[17]Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3^rd Int Conf on Learning Representations.

[18]Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15^th European Conf on Computer Vision, p.212-228.

[19]Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653-4661.

[20]Li XJ, Yin X, Li CY, et al., 2020. UNITER: universal image-text representation learning. Proc 16^th European Conf on Computer Vision, p.121-137.

[21]Lin TY, Maire M, Belongie S, et al., 2014. Stacked cross attention for image–text matching. Proc 13^th European Conf on Computer Vision, p.740-755.

[22]Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719-1723.

[23]Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447-461.

[24]Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33^rd Int Conf on Neural Information Processing Systems, p.13-23.

[25]Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8^th Int Conf on Learning Representations.

[26]Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37^th Int Conf on Machine Learning, p.6543-6553.

[27]Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35^th Int Conf on Machine Learning, p.3361-3370.

[28]Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160-2173.

[29]Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146-1151.

[30]Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128.

[31]Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380-384.

[32]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38^th Int Conf on Machine Learning, p.8748-8763.

[33]Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3^rd Int Conf on Learning Representations.

[34]Ren SQ, He KM, Girshick R, et al., 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137-1149.

[35]Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098

[36]Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36^th Int Conf on Machine Learning, p.5907-5915.

[37]Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199

[38]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579-2605.

[39]Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215

[40]Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568-2580.

[41]Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15650-15659.

[42]Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654-9663.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于对齐自修正的鲁棒跨模态检索

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference