CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-10-17
Cited: 0
Clicked: 1873
Citations: Bibtex RefMan EndNote GB/T7714
Yiming LEI, Jingqi LI, Zilong LI, Yuan CAO, Hongming SHAN. Prompt learning in computer vision: a survey[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300389 @article{title="Prompt learning in computer vision: a survey", %0 Journal Article TY - JOUR
计算机视觉中的提示学习:综述1上海市智能信息处理重点实验室,计算机科学技术学院,复旦大学,中国上海市,200438 2类脑智能科学与技术研究院,复旦大学,中国上海市,200433 3脑科学前沿科学中心,复旦大学,中国上海市,200433 4上海脑科学与类脑研究中心,中国上海市,201210 摘要:自大型预训练视觉-语言模型(VLM)爆发以来,提示学习已在计算机视觉领域引发广泛关注。基于VLM构建的视觉和语言信息之间的密切关系,提示学习成为许多重要应用领域(如人工智能内容生成(AIGC))中的关键技术。本综述循序渐进且全面地总结了与AIGC相关的视觉提示学习。首先介绍了VLM,它是视觉提示学习的基础。然后,回顾了视觉提示学习方法和提示引导生成模型,并讨论了如何提高将AIGC模型适用于下游特定任务的效率。最后,提供了一些有前景的关于提示学习的研究方向。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Abdal R, Qin YP, Wonka P, 2019. Image2StyleGAN: how to embed images into the StyleGAN latent space? Proc IEEE/CVF Int Conf on Computer Vision, p.4431-4440. [2]Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187-18197. [3]Bahng H, Jahanian A, Sankaranarayanan S, et al., 2022. Exploring visual prompts for adapting large-scale models. [4]Bar A, Gandelsman Y, Darrell T, et al., 2022. Visual prompting via image inpainting. Proc 36th Conf on Neural Information Processing Systems, p.25005-25017. [5]Barnes C, Shechtman E, Finkelstein A, et al., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24. [6]Cao Y, Zhang DC, Zheng X, et al., 2023. Mutual information boosted precipitation nowcasting from radar images. Remote Sens, 15(6):1639. [7]Chao HQ, Wang K, He YW, et al., 2022. GaitSet: cross-view gait recognition through utilizing gait as a deep set. IEEE Trans Patt Anal Mach Intell, 44(7):3467-3478. [8]Chen AC, Yao YG, Chen PY, et al., 2023. Understanding and improving visual prompting: a label-mapping perspective. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19133-19143. [9]Chen GY, Yao WR, Song XC, et al., 2023. PLOT: prompt learning with optimal transport for vision-language models. Proc 11th Int Conf on Learning Representations. [10]Chen Z, Duan YC, Wang WH, et al., 2023. Vision Transformer adapter for dense predictions. Proc 11th Int Conf on Learning Representations. [11]Cuturi M, 2013. Sinkhorn distances: lightspeed computation of optimal transport. Proc 26th Int Conf on Neural Information Processing Systems, p.2292-2300. [12]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. [13]Dong BW, Zhou P, Yan SC, et al., 2023. LPT: long-tailed prompt tuning for image classification. Proc 11th Int Conf on Learning Representations. [14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9th Int Conf on Learning Representations. [15]Du Y, Wei FY, Zhang ZH, et al., 2022. Learning to prompt for open-vocabulary object detection with vision-language model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14064-14073. [16]Feng CM, Li BJ, Xu XX, et al., 2023. Learning federated visual prompt in null space for MRI reconstruction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8064-8073. [17]Gao P, Geng SJ, Zhang RR, et al., 2021. CLIP-Adapter: better vision-language models with feature adapters. [18]Ge CJ, Huang R, Xie MX, et al., 2022. Domain adaptation via prompt learning. [19]Ge JX, Luo HY, Qian SY, et al., 2023. Chain of thought prompt tuning in vision language models. [20]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144. [21]Gu XY, Lin TY, Kuo WC, et al., 2022. Open-vocabulary object detection via vision and language knowledge distillation. Proc 10th Int Conf on Learning Representations. [22]Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87-110. [23]He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423-2435. [24]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. [25]He KM, Chen XL, Xie SN, et al., 2022. Masked autoencoders are scalable vision learners. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15979-15988. [26]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, p.574. [27]Houlsby N, Giurgiu A, Jastrzebski S, et al., 2019. Parameter-efficient transfer learning for NLP. Proc 36th Int Conf on Machine Learning, p.2790-2799. [28]Hu EJ, Shen YL, Wallis P, et al., 2022. LoRA: low-rank adaptation of large language models. Proc 10th Int Conf on Learning Representations. [29]Huang ST, Gong B, Pan YL, et al., 2023. VoP: text-video co-operative prompt tuning for cross-modal retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6565-6574. [30]Huang ZC, Zeng ZY, Liu B, et al., 2020. Pixel-BERT: aligning image pixels with text by deep multi-modal Transformers. [31]Iizuka S, Simo-Serra E, Ishikawa H, 2017. Globally and locally consistent image completion. ACM Trans Graph, 36(4):107. [32]Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904-4916. [33]Jia ML, Tang LM, Chen BC, et al., 2022. Visual prompt tuning. Proc 17th European Conf on Computer Vision, p.709-727. [34]Ju C, Han TD, Zheng KH, et al., 2022. Prompting visual-language models for efficient video understanding. Proc 17th European Conf on Computer Vision, p.105-124. [35]Kang M, Zhu JY, Zhang R, et al., 2023. Scaling up GANs for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10124-10134. [36]Kaplan J, McCandlish S, Henighan T, et al., 2020. Scaling laws for neural language models. [37]Karras T, Laine S, Aila T, 2019. A style-based generator architecture for generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4396-4405. [38]Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107-8116. [39]Karras T, Aittala M, Laine S, et al., 2021. Alias-free generative adversarial networks. Proc 35th Conf on Neural Information Processing Systems, p.852-863. [40]Kawar B, Zada S, Lang O, et al., 2023. Imagic: text-based real image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6007-6017. [41]Khan S, Naseer M, Hayat M, et al., 2022. Transformers in vision: a survey. ACM Comput Surv, 54(10s):200. [42]Khattak MU, Rasheed H, Maaz M, et al., 2023. MaPLe: multi-modal prompt learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19113-19122. [43]Kim W, Son B, Kim I, 2021. ViLT: vision-and-language Transformer without convolution or region supervision. Proc 38th Int Conf on Machine Learning, p.5583-5594. [44]Kingma DP, Welling M, 2013. Auto-encoding variational Bayes. [45]Kirillov A, Mintun E, Ravi N, et al., 2023. Segment anything. [46]Kojima T, Gu SS, Reid M, et al., 2022. Large language models are zero-shot reasoners. Proc 36th Conf on Neural Information Processing Systems. [47]Kwon H, Song T, Jeong S, et al., 2023. Probabilistic prompt learning for dense prediction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6768-6777. [48]Lee JH, Choi I, Kim MH, 2016. Laplacian patch-based image synthesis. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2727-2735. [49]Lei YM, Zhang JP, Shan HM, 2021. Strided self-supervised low-dose CT denoising for lung nodule classification. Phenomics, 1(6):257-268. [50]Lei YM, Zhu HP, Zhang JP, et al., 2022. Meta ordinal regression forest for medical image classification with ordinal labels. IEEE/CAA J Autom Sin, 9(7):1233-1247. [51]Lei YM, Li ZL, Shen Y, et al., 2023a. CLIP-Lung: textual knowledge-guided lung nodule malignancy prediction. Proc 26th Int Conf on Medical Image Computing and Computer-Assisted Intervention, p.403-412. [52]Lei YM, Li ZL, Li YY, et al., 2023b. LICO: explainable models with language-image consistency. [53]Li JN, Li DX, Xiong CM, et al., 2022. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. Proc 39th Int Conf on Machine Learning, p.12888-12900. [54]Li JQ, Gao JQ, Zhang YZ, et al., 2023a. Motion matters: a novel motion modeling for cross-view gait feature learning. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5. [55]Li JQ, Zhang YZ, Shan HM, et al., 2023b. Gaitcotr: improved spatial-temporal representation for gait recognition with a hybrid convolution-Transformer framework. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5. [56]Li MK, Xu P, Li CG, et al., 2023. MaskCL: semantic mask-driven contrastive learning for unsupervised person re-identification with clothes change. [57]Li WH, Huang XK, Zhu Z, et al., 2022. OrdinalCLIP: learning rank prompts for language-guided ordinal regression. Proc 36th Conf on Neural Information Processing Systems. [58]Lin BB, Zhang SL, Yu X, 2021. Gait recognition via effective global-local feature representation and local temporal aggregation. Proc IEEE/CVF Int Conf on Computer Vision, p.14628-14636. [59]Lin HZ, Cheng X, Wu XY, et al., 2022. CAT: cross attention in vision Transformer. Proc IEEE Int Conf on Multimedia and Expo, p.1-6. [60]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007. [61]Lin Y, Zhao ZC, Zhu ZJ, et al., 2023. Exploring visual prompts for whole slide image classification with multiple instance learning. [62]Ling H, Kreis K, Li DQ, et al., 2021. EditGAN: high-precision semantic image editing. Proc 35th Conf on Neural Information Processing Systems, p.16331-16345. [63]Liu PF, Yuan WZ, Fu JL, et al., 2023. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv, 55(9):195. [64]Liu WH, Shen X, Pun CM, et al., 2023. Explicit visual prompting for low-level structure segmentations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19434-19445. [65]Liu YJ, Lu YN, Liu H, et al., 2023. Hierarchical prompt learning for multi-task learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10888-10898. [66]Lu JS, Clark C, Zellers R, et al., 2023. Unified-IO: a unified model for vision, language, and multi-modal tasks. Proc 11th Int Conf on Learning Representations. [67]Lu P, Mishra S, Xia T, et al., 2022. Learn to explain: multimodal reasoning via thought chains for science question answering. Proc 36th Conf on Neural Information Processing Systems, p.2507-2521. [68]Lu YN, Liu JZ, Zhang YG, et al., 2022. Prompt distribution learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5196-5205. [69]Lugmayr A, Danelljan M, Romero A, et al., 2022. Repaint: inpainting using denoising diffusion probabilistic models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11451-11461. [70]Ma ZY, Luo G, Gao J, et al., 2022. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14054-14063. [71]Mao CZ, Teotia R, Sundar A, et al., 2023. Doubly right object recognition: a why prompt for visual rationales. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2722-2732. [72]Milletari F, Navab N, Ahmadi SA, 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation. Proc 4th Int Conf on 3D Vision, p.565-571. [73]Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39th Int Conf on Machine Learning, p.16784-16804. [74]Oh C, Hwang H, Lee HY, et al., 2023. BlackVIP: black-box visual prompting for robust transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.24224-24235. [75]Perarnau G, van de Weijer J, Raducanu B, et al., 2016. Invertible conditional GANs for image editing. [76]Pfeiffer J, Kamath A, Rücklé A, et al., 2020a. AdapterFusion: non-destructive task composition for transfer learning. Proc 16th Conf of the European Chapter of the Association for Computational Linguistics: Main Volume, p.487-503. [77]Pfeiffer J, Rücklé A, Poth C, et al., 2020b. AdapterHub: a framework for adapting Transformers. Proc Conf on Empirical Methods in Natural Language Processing: System Demonstrations, p.46-54. [78]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763. [79]Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40th Int Conf on Machine Learning, p.28492-28518. [80]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38th Int Conf on Machine Learning, p.8821-8831. [81]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. [82]Reed S, Akata Z, Yan XC, et al., 2016a. Generative adversarial text to image synthesis. Proc 33rd Int Conf on Machine Learning, p.1060-1069. [83]Reed S, Akata Z, Mohan S, et al., 2016b. Learning what and where to draw. Proc 30th Int Conf on Neural Information Processing Systems, p.217-225. [84]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685. [85]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510. [86]Selvaraju RR, Cogswell M, Das A, et al., 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization. Proc IEEE Int Conf on Computer Vision, p.618-626. [87]Shamshad F, Khan S, Zamir SW, et al., 2023. Transformers in medical imaging: a survey. Med Image Anal, 88:102802. [88]Smith JS, Hsu YC, Zhang LY, et al., 2023. Continual diffusion: continual customization of text-to-image diffusion with C-LoRA. [89]Sohn K, Chang HW, Lezama J, et al., 2023. Visual prompt tuning for generative transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19840-19851. [90]Sung YL, Cho J, Bansal M, 2022. VL-Adapter: parameter-efficient transfer learning for vision-and-language tasks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5217-5227. [91]Suvorov R, Logacheva E, Mashikhin A, et al., 2022. Resolution-robust large mask inpainting with Fourier convolutions. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.3172-3182. [92]Tao M, Tang H, Wu F, et al., 2022. DF-GAN: a simple and effective baseline for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16494-16504. [93]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. [94]Wang F, Li ML, Lin XD, et al., 2023. Learning to decompose visual features with latent textual prompts. Proc 11th Int Conf on Learning Representations. [95]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369. [96]Wang TC, Liu MY, Zhu JY, et al., 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8798-8807. [97]Wang XL, Wang W, Cao Y, et al., 2023. Images speak in images: a generalist painter for in-context visual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6830-6839. [98]Wang ZF, Zhang ZZ, Lee CY, et al., 2022. Learning to prompt for continual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.139-149. [99]Wei J, Wang XZ, Schuurmans D, et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Proc 36th Conf on Neural Information Processing Systems. [100]Xiao ZX, Chen YZ, Zhang L, et al., 2023. Instruction-ViT: multi-modal prompts for instruction learning in ViT. [101]Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428-22437. [102]Xing YH, Wu QR, Cheng D, et al., 2022. Class-aware visual prompt tuning for vision-language pre-trained model. [103]Xu T, Zhang PC, Huang QY, et al., 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1316-1324. [104]Xu ZB, Sun J, 2010. Image inpainting by patch propagation using patch sparsity. IEEE Trans Image Process, 19(5):1153-1165. [105]Xu ZH, Shen B, Tang YL, et al., 2022. Deep clinical phenotyping of Parkinson’s disease: towards a new era of research and clinical care. Phenomics, 2(5):349-361. [106]Xue H, Salim FD, 2022. Prompt-based time series forecasting: a new task and dataset. http://export.arxiv.org/abs/2210.08964v1 [107]Yao Y, Zhang A, Zhang ZY, et al., 2021. CPT: colorful prompt tuning for pre-trained vision-language models. [108]Yu JH, Lin Z, Yang JM, et al., 2018. Generative image inpainting with contextual attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5505-5514. [109]Yu JH, Lin Z, Yang JM, et al., 2019. Free-form image inpainting with gated convolution. Proc IEEE/CVF Int Conf on Computer Vision, p.4470-4479. [110]Yu WW, Liu YL, Hua W, et al., 2023. Turning a CLIP model into a scene text detector. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6978-6988. [111]Yu Y, Rong L, Wang MY, et al., 2022. Prompt learning for multi-modal COVID-19 diagnosis. Proc IEEE Int Conf on Bioinformatics and Biomedicine, p.2803-2807. [112]Zhang H, Xu T, Li HS, et al., 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. Proc IEEE Int Conf on Computer Vision, p.5908-5916. [113]Zhang LM, Rao A, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models. [114]Zhang ZJ, Zhao Z, Zhang Z, et al., 2020. Text-guided image inpainting. Proc 28th ACM Int Conf on Multimedia, p.4079-4087. [115]Zhang ZS, Zhang A, Li M, et al., 2022. Automatic chain of thought prompting in large language models. Proc 11th Int Conf on Learning Representations. [116]Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models. [117]Zhou KY, Yang JK, Loy CC, et al., 2022a. Conditional prompt learning for vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16795-16804. [118]Zhou KY, Yang JK, Loy CC, et al., 2022b. Learning to prompt for vision-language models. Int J Comput Vis, 130(9):2337-2348. [119]Zhou YQ, Barnes C, Shechtman E, et al., 2021. TransFill: reference-guided image inpainting by merging multiple color and spatial transformations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2266-2267. [120]Zhu HP, Shan HM, Zhang YH, et al., 2022. Convolutional ordinal regression forest for image ordinal estimation. IEEE Trans Neur Netw Learn Syst, 33(8):4084-4095. [121]Zhu JW, Lai SM, Chen X, et al., 2023. Visual prompt multi-modal tracking. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9516-9526. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>