JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Visual knowledge in the big model era: retrospect and prospect

Author(s): Wenguan WANG, Yi YANG, Yunhe PAN
Affiliation(s): College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Corresponding email(s): yangyics@zju.edu.cn
Key Words: Visual knowledge; Artificial intelligence; Foundation model; Deep learning

Share this article to： More \|Next Paper >>>

Wenguan WANG, Yi YANG, Yunhe PAN. Visual knowledge in the big model era: retrospect and prospect[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400250

@article{title="Visual knowledge in the big model era: retrospect and prospect",
author="Wenguan WANG, Yi YANG, Yunhe PAN",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400250"
}

%0 Journal Article
%T Visual knowledge in the big model era: retrospect and prospect
%A Wenguan WANG
%A Yi YANG
%A Yunhe PAN
%J Frontiers of Information Technology & Electronic Engineering
%P 1-19
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400250"

TY - JOUR
T1 - Visual knowledge in the big model era: retrospect and prospect
A1 - Wenguan WANG
A1 - Yi YANG
A1 - Yunhe PAN
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1
EP - 19
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400250"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner, with a deep root in cognitive psychology. As the knowledge of the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence. With the recent advance of artificial intelligence (AI) techniques, large AI models (or foundation models) have emerged as a potent tool capable of extracting versatile patterns from broad data as implicit knowledge, and abstracting them into an outrageous amount of numeric parameters. To pave the way for creating visual knowledge empowered AI machines in this coming wave, we present a timely review that investigates the origins and development of visual knowledge in the pre-big-model era, and accentuates the opportunities and unique role of visual knowledge in the big model era.

大模型时代的视觉知识：回顾与展望

王文冠，杨易，潘云鹤
浙江大学计算机科学与技术学院，中国杭州市，310027
摘要：视觉知识是一种新型知识表达形式，其理论之根深植于认知科学；视觉知识旨在为视觉智能的核心要素--如视觉概念、视觉关系、视觉操作和视觉推理--提供统一、全面且可解释的理论框架和建模方法。认知科学的研究实证了视觉相关知识在人类认知过程和智能行为中扮演着不可或缺的角色，由此可以推断，视觉知识的表达与学习将对发展视觉智能和机器智能起到重要作用。近年来，人工智能不断取得进步，尤其是人工智能大模型涌现出超越传统模型的智能水平，大模型能够自动从海量数据中发现普遍性规律，并将这些规律编码进超大规模神经网络的参数之中，实现了大规模知识自动提取和隐式知识参数化存储。这场由大模型驱动的新一轮人工智能技术革命，将为构建具备视觉知识的先进智能体带来新的机遇和挑战。对此，本文深入剖析视觉知识的理论基础，系统性回顾近年来视觉知识相关领域的发展状况。同时，针对大模型时代下视觉知识的发展方向以及其可能发挥的关键作用，提出前瞻性观点和展望。

关键词组：视觉知识；人工智能；基础模型；深度学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alahi A, Goel K, Ramanathan V, et al., 2016. Social LSTM: human trajectory prediction in crowded spaces. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.961-971.

[2]AlKhamissi B, Li M, Celikyilmaz A, et al., 2022. A review on language models as knowledge bases. https://arxiv.org/abs/2204.06031

[3]Allen JF, 1983. Maintaining knowledge about temporal intervals. Commun ACM, 26(11):832-843.

[4]Amizadeh S, Palangi H, Polozov A, et al., 2020. Neuro-symbolic visual reasoning: disentangling “visual” from “reasoning”. Proc 37^th Int Conf on Machine Learning, Article 27.

[5]Anderson JR, 2005. Cognitive Psychology and Its Implications. Worth Publishers, New York, USA.

[6]Andreas J, Rohrbach M, Darrell T, et al., 2016. Neural module networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.39-48.

[7]Bigelow J, Poremba A, 2014. Achilles’ ear? Inferior human short-term and recognition memory in the auditory modality. PLOS ONE, 9(2):e89914.

[8]Bottou L, 2014. From machine learning to machine reasoning: an essay. Mach Learn, 94(2):133-149.

[9]Brady TF, Konkle T, Alvarez GA, et al., 2008. Compression in visual short-term memory: using statistical regularities to form more efficient memory representations. J Vis, 8(6):199.

[10]Brady TF, Konkle T, Alvarez GA, 2011. A review of visual memory capacity: beyond individual items and toward structured representations. J Vis, 11(5):4.

[11]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[12]Carey S, 2000. The origin of concepts. J Cogn Dev, 1(1):37-41.

[13]Chen GK, Wang WG, 2024. A survey on 3D Gaussian splatting. https://arxiv.org/abs/2401.03890

[14]Christiansen R, Pfister N, Jakobsen ME, et al., 2022. A causal framework for distribution generalization. IEEE Trans Patt Anal Mach Intell, 44(10):6614-6630.

[15]Cover T, Hart P, 1967. Nearest neighbor pattern classification. IEEE Trans Inform Theory, 13(1):21-27.

[16]Fix E, Hodges JLJr, 1952. Discriminatory analysis-nonparametric discrimination: small sample performance. Int Stat Rev, 57(3):238-247.

[17]Gal R, Alaluf Y, Atzmon Y, et al., 2023. An image is worth one word: personalizing text-to-image generation using textual inversion. Proc 11^th Int Conf on Learning Representations.

[18]Garcez A, Gori M, Lamb LC, et al., 2019. Neural-symbolic computing: an effective methodology for principled integration of machine learning and reasoning. J Appl Log, 6(4):611-632.

[19]Goyal R, Ebrahimi Kahou S, Michalski V, et al., 2017. The “something something” video database for learning and evaluating visual common sense. Proc IEEE Int Conf on Computer Vision, p.5843-5851.

[20]Gupta A, Kembhavi A, Davis LS, 2009. Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Patt Anal Mach Intell, 31(10):1775-1789.

[21]Gupta T, Kembhavi A, 2023. Visual programming: compositional visual reasoning without training. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14953-14962.

[22]Hendrycks D, Dietterich TG, 2019. Benchmarking neural network robustness to common corruptions and perturbations. Proc 7^th Int Conf on Learning Representations.

[23]Ji ZW, Lee N, Frieske R, et al., 2023. Survey of hallucination in natural language generation. ACM Comput Surv, 55(12):248.

[24]Johnson J, Krishna R, Stark M, et al., 2015. Image retrieval using scene graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3668-3678.

[25]Kerbl B, Kopanas G, Leimkühler T, et al., 2023. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans Graph, 42(4):139.

[26]Kirillov A, Mintun E, Ravi N, et al., 2023. Segment anything. Proc IEEE/CVF Int Conf on Computer Vision, p.3992-4003.

[27]Kosslyn SM, Ball TM, Reiser BJ, 1978. Visual images preserve metric spatial information: evidence from studies of image scanning. J Exp Psychol Human Percept Perform, 4(1):47-60.

[28]Krishna R, Zhu YK, Groth O, et al., 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis, 123(1):32-73.

[29]Li LL, Zhou TF, Wang WG, et al., 2022. Deep hierarchical semantic segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1236-1247.

[30]Li LL, Wang WG, Yi Y, 2023a. LogicSeg: parsing visual semantics with neural logic learning and reasoning. Proc IEEE/CVF Int Conf on Computer Vision, p.4099-4110.

[31]Li LL, Wei JN, Wang WG, et al., 2023b. Neural-logic human-object interaction detection. Proc Int Conf on Neural Information Processing Systems.

[32]Li LL, Wang WG, Zhou TF, et al., 2024. Semantic hierarchy-aware segmentation. IEEE Trans Patt Anal Mach Intell, 46(4):2123-2138.

[33]Li YL, Xu Y, Xu XY, et al., 2023. Beyond object recognition: a new benchmark towards object concept learning. Proc IEEE/CVF Int Conf on Computer Vision, p.19972-19983.

[34]Li YZ, Torralba A, Anandkumar A, et al., 2020. Causal discovery in physical systems from videos. Proc 34^th Int Conf on Neural Information Processing Systems, Article 770.

[35]Liang C, Wang WG, Miao JX, et al., 2022a. GMMSeg: Gaussian mixture based generative semantic segmentation models. Proc 36^th Int Conf on Neural Information Processing Systems, Article 2274.

[36]Liang C, Wang WG, Zhou TF, et al., 2022b. Visual abductive reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15544-15554.

[37]Lin LQ, Li ZK, Li RK, et al., 2024. Diffusion models for time-series applications: a survey. Front Inform Technol Electron Eng, 25(1):19-41.

[38]Luiten J, Kopanas G, Leibe B, et al., 2023. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis. https://arxiv.org/abs/2308.09713

[39]Mackowiak R, Ardizzone L, Köthe U, et al., 2021. Generative classifiers as a basis for trustworthy image classification. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2970-2980.

[40]Margolis E, Laurence S, 1999. Concepts: Core Readings. MIT Press, Cambridge, USA.

[41]Margolis E, Laurence S, 2015. The Conceptual Mind: New Directions in the Study of Concepts. MIT Press, Cambridge, USA.

[42]Mathieu M, Couprie C, LeCun Y, 2016. Deep multi-scale video prediction beyond mean square error. Proc 4^th Int Conf on Learning Representations.

[43]McCulloch WS, Pitts W, 1943. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys, 5(4):115-133.

[44]Mildenhall B, Srinivasan PP, Tancik M, et al., 2020. NeRF: representing scenes as neural radiance fields for view synthesis. Proc 16^th European Conf on Computer Vision, p.405-421.

[45]Milner AD, Goodale MA, 2006. The Visual Brain in Action (2^nd Ed.). Oxford University Press, Oxford, UK.

[46]Moyer RS, 1973. Comparing objects in memory: evidence suggesting an internal psychophysics. Percept Psychophys, 13(2):180-184.

[47]Nersessian NJ, 2010. Creating Scientific Concepts. MIT Press, Cambridge, USA.

[48]Pan YH, 1996. The synthesis reasoning. Patt Recogn Artif Intell, 9(3):201-208 (in Chinese).

[49]Pan YH, 2019. On visual knowledge. Front Inform Technol Electron Eng, 20(8):1021-1025.

[50]Pan YH, 2020. Multiple knowledge representation of artificial intelligence. Engineering, 6(3):216-217.

[51]Pearl J, 2009. Causality (2^nd Ed.). Cambridge University Press, Cambridge, UK.

[52]Pumarola A, Corona E, Pons-Moll G, et al., 2021. D-NeRF: neural radiance fields for dynamic scenes. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10313-10322.

[53]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38^th Int Conf on Machine Learning, p.8821-8831.

[54]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. https://arxiv.org/abs/2204.06125

[55]Reed SE, Akata Z, Yan XC, et al., 2016. Generative adversarial text to image synthesis. Proc 33^rd Int Conf on Machine Learning, p.1060-1069.

[56]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685.

[57]Rosch E, Mervis CB, 1975. Family resemblances: studies in the internal structure of categories. Cogn Psychol, 7(4):573-605.

[58]Rudin C, Chen CF, Chen Z, et al., 2022. Interpretable machine learning: fundamental principles and 10 grand challenges. Stat Surv, 16:1-85.

[59]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510.

[60]Russakovsky O, Deng J, Su H, et al., 2015. ImageNet large scale visual recognition challenge. Int J Comput Vis, 115(3):211-252.

[61]Ryoo MS, 2011. Human activity prediction: early recognition of ongoing activities from streaming videos. Proc Int Conf on Computer Vision, p.1036-1043.

[62]Sabour S, Frosst N, Hinton GE, 2017. Dynamic routing between capsules. Proc 31^st Int Conf on Neural Information Processing Systems, p.3859-3869.

[63]Safavi T, Koutra D, 2021. Relational world knowledge representation in contextual language models: a review. Proc Conf on Empirical Methods in Natural Language Processing, p.1053-1067.

[64]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36^th Int Conf on Neural Information Processing Systems.

[65]Schölkopf B, Locatello F, Bauer S, et al., 2021. Toward causal representation learning. Proc IEEE, 109(5):612-634.

[66]Scolari M, Vogel EK, Awh E, 2008. Perceptual expertise enhances the resolution but not the number of representations in working memory. Psychon Bull Rev, 15(1):215-222.

[67]Shen YL, Song KT, Tan X, et al., 2023. HuggingGPT: solving AI tasks with ChatGPT and its friends in HuggingFace. https://arxiv.org/abs/2303.17580v1

[68]Shepard RN, Feng C, 1972. A chronometric study of mental paper folding. Cogn Psychol, 3(2):228-243.

[69]Shepard S, Metzler D, 1988. Mental rotation: effects of dimensionality of objects and type of task. J Exp Psychol Human Percept Perform, 14(1):3-11.

[70]Shi WJ, Huang G, Song SJ, et al., 2022. Temporal-spatial causal interpretations for vision-based reinforcement learning. IEEE Trans Patt Anal Mach Intell, 44(12):10222-10235.

[71]Snell J, Swersky K, Zemel R, 2017. Prototypical networks for few-shot learning. Proc 31^st Int Conf on Neural Information Processing Systems, p.4080-4090.

[72]Stark L, Bowyer K, 1991. Achieving generalized object recognition through reasoning about association of function to structure. IEEE Trans Patt Anal Mach Intell, 13(10):1097-1104.

[73]Suzuki T, Kataoka H, Aoki Y, et al., 2018. Anticipating traffic accidents with adaptive loss and large-scale incident DB. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3521-3529.

[74]Thagard P, 2013. Cognitive science. In: Curd M, Psillos S (Eds.), The Routledge Companion to Philosophy of Science (2^nd Ed). Routledge, London, UK, p.597-608.

[75]Vedantam R, Desai K, Lee S, et al., 2019. Probabilistic neural symbolic models for interpretable visual question answering. Proc 36^th Int Conf on Machine Learning, p.6428-6437.

[76]Wang T, Huang JQ, Zhang HW, et al., 2020. Visual commonsense R-CNN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10757-10767.

[77]Wang WG, Zhang ZJ, Qi SY, et al., 2019. Learning compositional neural information fusion for human parsing. Proc IEEE/CVF Int Conf on Computer Vision, p.5702-5712.

[78]Wang WG, Zhu HL, Dai JF, et al., 2020. Hierarchical human parsing with typed part-relation reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8926-8936.

[79]Wang WG, Han C, Zhou TF, et al., 2023. Visual recognition with deep nearest centroids. Proc 11^th Int Conf on Learning Representations.

[80]Wang WG, Yang Y, Wu F, 2025. Towards data- and knowledge-driven artificial intelligence: a survey on neuro-symbolic computing. IEEE Trans Patt Anal Mach Intell, 47(2):878-899.

[81]Watters N, Tacchetti A, Weber T, et al., 2017. Visual interaction networks: learning a physics simulator from video. Proc 31^st Int Conf on Neural Information Processing Systems, p.4542-4550.

[82]Wei J, Wang XZ, Schuurmans D, et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1800.

[83]Xing JB, Xia MH, Liu YX, et al., 2024. Make-your-video: customized video generation using textual and structural guidance. IEEE Trans Visual Comput Graph, 31(2):1526-1541.

[84]Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551-1558.

[85]Yang ZX, Chen GK, Li XD, et al., 2024. DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). https://arxiv.org/abs/2401.08392

[86]Yang ZY, Yang HY, Pan ZJ, et al., 2024. Real-time photorealistic dynamic scene representation and rendering with 4D Gaussian splatting. https://arxiv.org/abs/2310.10642

[87]Yao SY, Yu D, Zhao J, et al., 2023. Tree of thoughts: deliberate problem solving with large language models. Proc 37^th Int Conf on Neural Information Processing Systems, Article 517.

[88]Yi KX, Wu JJ, Gan C, et al., 2018. Neural-symbolic VQA: disentangling reasoning from vision and language understanding. Proc 32^nd Int Conf on Neural Information Processing Systems, p.1039-1050.

[89]Yue ZQ, Wang T, Sun QR, et al., 2021. Counterfactual zero-shot and open-set visual recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15399-15409.

[90]Zellers R, Bisk Y, Farhadi A, et al., 2019. From recognition to cognition: visual commonsense reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6713-6724.

[91]Zhou J, Ke P, Qiu XP et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, 25(1):6-11.

[92]Zhou TF, Qi SY, Wang WG, et al., 2022a. Cascaded parsing of human-object interaction recognition. IEEE Trans Patt Anal Mach Intell, 44(6):2827-2840.

[93]Zhou TF, Wang WG, Konukoglu E, et al., 2022b. Rethinking semantic segmentation: a prototype view. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2572-2583.

[94]Zhou TF, Porikli F, Crandall DJ, et al., 2023. A survey on deep learning technique for video segmentation. IEEE Trans Patt Anal Mach Intell, 45(6):7099-7122.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

大模型时代的视觉知识：回顾与展望

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference