CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-04-22
Cited: 0
Clicked: 4602
Yahong Han, Aming Wu, Linchao Zhu, Yi Yang. Visual commonsense reasoning with directional visual connections[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 625-637.
@article{title="Visual commonsense reasoning with directional visual connections",
author="Yahong Han, Aming Wu, Linchao Zhu, Yi Yang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="22",
number="5",
pages="625-637",
year="2021",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2000722"
}
%0 Journal Article
%T Visual commonsense reasoning with directional visual connections
%A Yahong Han
%A Aming Wu
%A Linchao Zhu
%A Yi Yang
%J Frontiers of Information Technology & Electronic Engineering
%V 22
%N 5
%P 625-637
%@ 2095-9184
%D 2021
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2000722
TY - JOUR
T1 - Visual commonsense reasoning with directional visual connections
A1 - Yahong Han
A1 - Aming Wu
A1 - Linchao Zhu
A1 - Yi Yang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 22
IS - 5
SP - 625
EP - 637
%@ 2095-9184
Y1 - 2021
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2000722
Abstract: To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.
[1]Anderson P, He XD, Buehler C, et al., 2018. Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6077-6086.
[2]Antol S, Agrawal A, Lu JS, et al., 2015. VQA: visual question answering. Proc IEEE Int Conf on Computer Vision, p.2425-2433.
[3]Arandjelović R, Gronat P, Torii A, et al., 2018. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Patt Anal Mach Intell, 40(6):1437-1451.
[4]Badrinarayanan V, Kendall A, Cipolla R, 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell, 39(12):2481-2495.
[5]Bansal A, Zhang YT, Chellappa R, 2020. Visual question answering on image sets. European Conf on Computer Vision, p.51-67.
[6]Ben-younes H, Cadene R, Cord M, et al., 2017. MUTAN: multimodal tucker fusion for visual question answering. Proc IEEE Int Conf on Computer Vision, p.2631-2639.
[7]Bola M, Sabel BA, 2015. Dynamic reorganization of brain functional networks during cognition. NeuroImage, 114:398-413.
[8]Cadene R, Ben-younes H, Cord M, et al., 2019. MUREL: multimodal relational reasoning for visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1989-1998.
[9]Chen L, Yan X, Xiao J, et al., 2020. Counterfactual samples synthesizing for robust visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10797-10806.
[10]Chen LC, Papandreou G, Kokkinos I, et al., 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Patt Anal Mach Intell, 40(4):834-848.
[11]Chen YP, Rohrbach M, Yan ZC, et al., 2019. Graph-based global reasoning networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.433-442.
[12]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p.4171-4186.
[13]Feltovich PJ, Ford KM, Hoffman RR, 1997. Expertise in Context: Human and Machine. MIT Press, Cambridge, MA, USA, p.67-99.
[14]Gao P, Li H, Li S, et al., 2018. Question-guided hybrid convolution for visual question answering. European Conf on Computer Vision, p.485-501.
[15]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448.
[16]Goyal Y, Khot T, Summers-Stay D, et al., 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6325-6334.
[17]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[18]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.
[19]Jégou H, Douze M, Schmid C, et al., 2010. Aggregating local descriptors into a compact image representation. Proc IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.3304-3311.
[20]Kim KM, Choi SH, Kim JH, et al., 2018. Multimodal dual attention memory for video story question answering. https://arxiv.org/abs/1809.07999
[21]Kipf TN, Welling M, 2016. Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907v4
[22]Le TM, Le V, Venkatesh S, et al., 2020. Hierarchical conditional relation networks for video question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9969-9978.
[23]Li G, Duan N, Fang YJ, et al., 2020. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf on Artificial Intelligence, p.11336-11344.
[24]Li LH, Yatskar M, Yin D, et al., 2019. VisualBERT: a simple and performant baseline for vision and language. https://arxiv.org/abs/1908.03557
[25]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. European Conf on Computer Vision, p.21-37.
[26]Lu JS, Xiong CM, Parikh D, et al., 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3242-3250.
[27]Lu JS, Batra D, Parikh D, et al., 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://arxiv.org/abs/1908.02265
[28]Malinowski M, Doersch C, Santoro A, et al., 2018. Learning visual question answering by bootstrapping hard attention. European Conf on Computer Vision, p.3-20.
[29]Monti F, Boscaini D, Masci J, et al., 2017. Geometric deep learning on graphs and manifolds using mixture model CNNs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5425-5434.
[30]Narasimhan M, Lazebnik S, Schwing AG, 2018. Out of the box: reasoning with graph convolution nets for factual visual question answering. Proc 32nd Int Conf on Neural Information Processing Systems, p.2659-2670.
[31]Norcliffe-Brown W, Vafeias ES, Parisot S, 2018. Learning conditioned graph structures for interpretable visual question answering. https://arxiv.org/abs/1806.07243
[32]Pan YH, 2019. On visual knowledge. Front Inform Technol Electron Eng, 20(8):1021-1025.
[33]Pan YH, 2020. Miniaturized five fundamental issues about visual knowledge. Front Inform Technol Electron Eng, online.
[34]Park HJ, Friston K, 2013. Structural and functional brain networks: from connections to cognition. Science, 342(6158):1238411.
[35]Perez E, Strub F, de Vries H, et al., 2017. FiLM: visual reasoning with a general conditioning layer. https://arxiv.org/abs/1709.07871v2
[36]Schwartz I, Yu S, Hazan T, et al., 2019. Factor graph attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2039-2048.
[37]Su WJ, Zhu XZ, Cao Y, et al., 2019. VL-BERT: pre-training of generic visual-linguistic representations. https://arxiv.org/abs/1908.08530v1
[38]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9:2579-2605.
[39]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[40]Velič;cković P, Cucurull G, Casanova A, et al., 2018. Graph attention networks. Proc Int Conf on Learning Representations.
[41]Wu AM, Zhu LC, Han YH, et al., 2019. Connective cognition network for directional visual commonsense reasoning. Proc 33rd Conf on Neural Information Processing Systems, p.5669-5679.
[42]Xu K, Ba JL, Kiros R, et al., 2015. Show, attend and tell: neural image caption generation with visual attention. Proc 32nd Int Conf on Machine Learning, p.2048-2057.
[43]Xu K, Wu LF, Wang ZG, et al., 2018. Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. Proc Conf on Empirical Methods in Natural Language Processing, p.918-924.
[44]Zellers R, Bisk Y, Farhadi A, et al., 2019. From recognition to cognition: visual commonsense reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6713-6724.
[45]Zhou J, Cui GQ, Zhang ZY, et al., 2018. Graph neural networks: a review of methods and applications. https://arxiv.org/abs/1812.08434v3
Open peer comments: Debate/Discuss/Question/Opinion
<1>