CLC number: TP391.4
On-line Access: 2025-07-28
Received: 2024-10-12
Revision Accepted: 2025-01-24
Crosschecked: 2025-07-30
Cited: 0
Clicked: 474
Citations: Bibtex RefMan EndNote GB/T7714
Qi LIU, Shuanglin YANG, Zejian LI, Lefan HOU, Chenye MENG, Ying ZHANG, Lingyun SUN. Image generation evaluation: a comprehensive survey of human and automatic evaluations[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400904 @article{title="Image generation evaluation: a comprehensive survey of human and automatic evaluations", %0 Journal Article TY - JOUR
图像生成评估:人类评估与自动评估的全面综述1浙江大学软件学院,中国宁波市,315100 2东南大学计算机科学与工程学院,中国南京市,211189 3浙江大学计算机科学与技术学院,中国杭州市,310027 摘要:图像生成模型取得了显著进展,其中图像评估在解释和推动这些模型的发展方面至关重要。现有研究广泛探讨了图像生成的人类评估与自动评估。本文对相关研究进行了系统综述,重点涵盖两个核心部分:评估协议与评估方法。首先,总结了10类图像生成任务,重点关注它们在评估方面的差异。基于此,提出一种新的评估协议,以涵盖不同图像生成任务所需的人类与自动评估的重要评估方面。其次,重点回顾过去5年中提出的自动评估方法。据我们所知,本文是对人工评估的首次全面总结,涵盖评估方法、工具、评估细节及数据分析方法。最后,探讨了当前图像生成评估面临的挑战及未来发展方向。希望本综述能够帮助研究人员系统理解图像生成评估,掌握该领域最新进展,并推动相关研究的开展。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Agustsson E, Tschannen M, Mentzer F, et al., 2019. Generative adversarial networks for extreme learned image compression. Proc IEEE/CVF Int Conf on Computer Vision, p.221-231. ![]() [2]Ak K, Kassim A, Lim JH, et al., 2019. Attribute manipulation generative adversarial networks for fashion images. Proc IEEE/CVF Int Conf on Computer Vision, p.10540-10549. ![]() [3]Alhabeeb SK, Al-Shargabi AA, 2024. Text-to-image synthesis with generative models: methods, datasets, performance metrics, challenges, and future direction. IEEE Access, 12:24412-24427. ![]() [4]Alqahtani H, Kavakli-Thorne M, Kumar G, 2021. Applications of generative adversarial networks (GANs): an updated review. Arch Comput Methods Eng, 28(2):525-552. ![]() [5]Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3686-3693. ![]() [6]Ashual O, Wolf L, 2019. Specifying object attributes and relations in interactive scene generation. Proc IEEE/CVF Int Conf on Computer Vision, p.4560-4568. ![]() [7]Bakr EM, Sun PZ, Shen XQ, et al., 2023. HRS-bench: holistic, reliable and scalable benchmark for text-to-image models. Proc IEEE/CVF Int Conf on Computer Vision, p.19984-19996. ![]() [8]Baraheem SS, Le TN, Nguyen TV, 2023. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook. Artif Intell Rev, 56(10):10813-10865. ![]() [9]Bar-Tal O, Yariv L, Lipman Y, et al., 2023. MultiDiffusion: fusing diffusion paths for controlled image generation. Proc 40th Int Conf on Machine Learning, p.1737-1752. ![]() [10]Bau D, Zhu JY, Wulff J, et al., 2019. Seeing what a GAN cannot generate. Proc IEEE/CVF Int Conf on Computer Vision, p.4501-4510. ![]() [11]Benny Y, Galanti T, Benaim S, et al., 2021. Evaluation metrics for conditional image generation. Int J Comput Vis, 129(5):1712-1731. ![]() [12]Betti F, Staiano J, Baraldi L, et al., 2023. Let’s ViCE! Mimicking human cognitive behavior in image generation evaluation. Proc 31st ACM Int Conf on Multimedia, p.9306-9312. ![]() [13]Bińkowski M, Sutherland DJ, Arbel M, et al., 2021. Demystifying MMD GANs. https://arxiv.org/abs/1801.01401 ![]() [14]Blau Y, Michaeli T, 2018. The perception-distortion tradeoff. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6228-6237. ![]() [15]Borji A, 2019. Pros and cons of GAN evaluation measures. Comput Vis Image Underst, 179:41-65. ![]() [16]Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. https://arxiv.org/abs/1809.11096 ![]() [17]Cai ZP, Mueller M, Birkl R, et al., 2024. L-MAGIC: language model assisted generation of images with coherence. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7049-7058. ![]() [18]Cao KD, Wei C, Gaidon A, et al., 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Proc 33rd Int Conf on Neural Information Processing Systems, p.1567-1578. ![]() [19]Chan C, Ginosar S, Zhou TH, et al., 2019. Everybody dance now. Proc IEEE/CVF Int Conf on Computer Vision, p.5932-5941. ![]() [20]Chang XJ, Ren PZ, Xu PF, et al., 2023. A comprehensive survey of scene graphs: generation and application. IEEE Trans Patt Anal Mach Intell, 45(1):1-26. ![]() [21]Chen JS, Ge CJ, Xie EZ, et al., 2024. PIXART-∑: weak-to-strong training of diffusion transformer for 4k text-to-image generation. 18th European Conf on Computer Vision, p.74-91. ![]() [22]Chen JX, Fan JY, Ye HC, et al., 2023. Exploring kernel-based texture transfer for pose-guided person image generation. IEEE Trans Multim, 25:7337-7349. ![]() [23]Chen P, Li ZJ, Zhang YK, et al., 2022. USIS: a unified semantic image synthesis model trained on a single or multiple samples. Neurocomputing, 514:70-82. ![]() [24]Chen WH, Hu HX, Saharia C, et al., 2022. Re-Imagen: retrieval-augmented text-to-image generator. https://arxiv.org/abs/2209.14491 ![]() [25]Chen X, Song J, Hilliges O, 2019. Unpaired pose guided human image generation. ![]() [26]Chen YC, Shen XH, Lin Z, et al., 2019. Semantic component decomposition for face attribute manipulation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9851-9859. ![]() [27]Cheng JX, Liang X, Shi XJ, et al., 2023. LayoutDiffuse: adapting foundational diffusion models for layout-to-image generation. https://arxiv.org/abs/2302.08908 ![]() [28]Cho J, Zala A, Bansal M, 2023. DALL-EVAL: probing the reasoning skills and social biases of text-to-image generation models. Proc IEEE/CVF Int Conf on Computer Vision, p.3020-3031. ![]() [29]Cho J, Hu YS, Baldridge J, et al., 2024. Davidsonian scene graph: improving reliability in fine-grained evaluation for text-to-image generation. ![]() [30]Cho SJ, Ji SW, Hong JP, et al., 2021. Rethinking coarse-to-fine approach in single image deblurring. Proc IEEE/CVF Int Conf on Computer Vision, p.4621-4630. ![]() [31]Croitoru FA, Hondru V, Ionescu RT, et al., 2023. Diffusion models in vision: a survey. IEEE Trans Patt Anal Mach Intell, 45(9):10850-10869. ![]() [32]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf on Computer Vision and Pattern Recognition, p.248-255. ![]() [33]DeVon HA, Block ME, Moyle-Wright P, et al., 2007. A psychometric toolbox for testing validity and reliability. J Nurs Schol, 39(2):155-164. ![]() [34]Ding M, Yang ZY, Hong WY, et al., 2021. CogView: mastering text-to-image generation via Transformers. Proc 35th Int Conf on Neural Information Processing Systems, Article 1516. ![]() [35]Ding M, Zheng WD, Hong WY, et al., 2022. CogView2: faster and better text-to-image generation via hierarchical Transformers. Proc 36th Int Conf on Neural Information Processing Systems, Article 1229. ![]() [36]Dinh TM, Nguyen R, Hua BS, 2022. TISE: bag of metrics for text-to-image synthesis evaluation. 17th European Conf on Computer Vision, p.594-609. ![]() [37]Dong YS, Tan W, Tao DC, et al., 2022. CartoonLossGAN: learning surface and coloring of images for cartoonization. IEEE Trans Image Process, 31:485-498. ![]() [38]Duan YP, Han CY, Tao XM, et al., 2020. Panoramic image generation: from 2-D sketch to spherical image. IEEE J Sel Top Signal Process, 14(1):194-208. ![]() [39]Elasri M, Elharrouss O, Al-Maadeed S, et al., 2022. Image generation: a review. Neur Process Lett, 54(5):4609-4646. ![]() [40]Fan MH, Wang WJ, Yang WH, et al., 2020. Integrating semantic segmentation and retinex model for low-light image enhancement. Proc 28th ACM Int Conf on Multimedia, p.2317-2325. ![]() [41]Farshad A, Yeganeh Y, Chi Y, et al., 2023. SceneGenie: scene graph guided diffusion models for image synthesis. Proc IEEE/CVF Int Conf on Computer Vision, p.88-98. ![]() [42]Fleisig E, Blodgett SL, Klein D, et al., 2024. The perspectivist paradigm shift: assumptions and challenges of capturing human labels. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.2279-2292. ![]() [43]Foo LG, Rahmani H, Liu J, 2023. AI-generated content (AIGC) for various data modalities: a survey. https://arxiv.org/abs/2308.14177 ![]() [44]Frolov S, Hinz T, Raue F, et al., 2021. Adversarial text-to-image synthesis: a review. Neur Netw, 144:187-209. ![]() [45]Frühstück A, Singh KK, Shechtman E, et al., 2022. InsetGAN for full-body image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7713-7722. ![]() [46]Gao CY, Liu Q, Xu Q, et al., 2020. SketchyCOCO: image generation from freehand scene sketches. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5173-5182. ![]() [47]Gao YX, Min XK, Zhu YC, et al., 2022. Image quality assessment: from mean opinion score to opinion score distribution. Proc 30th ACM Int Conf on Multimedia, p.997-1005. ![]() [48]Graziotin D, Lenberg P, Feldt R, et al., 2021. Psychometrics in behavioral software engineering: a methodological introduction with guidelines. ACM Trans Softw Eng Methodol, 31(1):7. ![]() [49]Grigorev A, Sevastopolsky A, Vakhitov A, et al., 2019. Coordinate-based texture inpainting for pose-guided human image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12127-12136. ![]() [50]Gu Z, Li WB, Huo J, et al., 2021. LoFGAN: fusing local representations for few-shot image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.8443-8451. ![]() [51]Guo XF, Yang HY, Huang D, 2021. Image inpainting via conditional texture and structure dual generation. Proc IEEE/CVF Int Conf on Computer Vision, p.14114-14123. ![]() [52]Habtegebrial TA, Jampani V, Gallo O, et al., 2020. Generative view synthesis: from single-view semantics to novel-view images. Proc 34th Int Conf on Neural Information Processing Systems, p.4745-4755. ![]() [53]Hall M, Bell SJ, Ross C, et al., 2024. Towards geographic inclusion in the evaluation of text-to-image models. ACM Conf on Fairness, Accountability, and Transparency, p.585-601. ![]() [54]Hara T, Mukuta Y, Harada T, 2021. Spherical image generation from a single image by considering scene symmetry. Proc AAAI Conf Artif Intell, 35(2):1513-1521. ![]() [55]Hassan MU, Alaliyat S, Hameed IA, 2023. Image generation models from scene graphs and layouts: a comparative analysis. J King Saud Univ-Comput Inform Sci, 35(5): 101543. ![]() [56]He S, Liao WT, Yang MY, et al., 2021. Context-aware layout to image generation with enhanced object appearance. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15044-15053. ![]() [57]He YT, Salakhutdinov R, Zico Kolter J, 2023. Localized text-to-image generation for free via cross attention control. https://arxiv.org/abs/2306.14636 ![]() [58]Hessel J, Holtzman A, Forbes M, et al., 2021. CLIPScore: a reference-free evaluation metric for image captioning. Proc Conf on Empirical Methods in Natural Language Processing, p.7514-7528. ![]() [59]Heusel M, Ramsauer H, Unterthiner T, et al., 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Proc 31st Int Conf on Neural Information Processing Systems, p.6626-6637. ![]() [60]Hinz T, Heinrich S, Wermter S, 2022. Semantic object accuracy for generative text-to-image synthesis. IEEE Trans Patt Anal Mach Intell, 44(3):1552-1565. ![]() [61]Ho J, Saharia C, Chan W, et al., 2022. Cascaded diffusion models for high fidelity image generation. J Mach Learn Res, 23(1):47. ![]() [62]Ho TT, Virtusio JJ, Chen YY, et al., 2020. Sketch-guided deep portrait generation. ACM Trans Multim Comput Commun Appl, 16(3):1-18. ![]() [63]Hong S, Yan XC, Huang T, et al., 2018. Learning hierarchical semantic image manipulation through structured representations. Proc 32nd Int Conf on Neural Information Processing Systems, p.2713-2723. ![]() [64]Hong Y, Niu L, Zhang JF, et al., 2020. F2GAN: fusing-and-filling GAN for few-shot image generation. Proc 28th ACM Int Conf on Multimedia, p.2535-2543. ![]() [65]Hu YS, Liu BL, Kasai J, et al., 2023. TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. Proc IEEE/CVF Int Conf on Computer Vision, p.20349-20360. ![]() [66]Hua TY, Zheng HD, Bai YL, et al., 2021. Exploiting relationship for complex-scene image generation. Proc AAAI Conf Artif Intell, 35(2):1584-1592. ![]() [67]Hua XS, Li J, 2015. Prajna: towards recognizing whatever you want from images without image labeling. Proc AAAI Conf Artif Intell, 29(1):137-144. ![]() [68]Huang KY, Sun KY, Xie EZ, et al., 2023. T2I-compBench: a comprehensive benchmark for open-world compositional text-to-image generation. Proc 37th Int Conf on Neural Information Processing Systems, Article 3443. ![]() [69]Huang ST, Gong B, Feng YT, et al., 2024. Learning disentangled identifiers for action-customized text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7797-7806. ![]() [70]Huh M, Zhang R, Zhu JY, et al., 2020. Transforming and projecting images into class-conditional generative networks. 16th European Conf on Computer Vision, p.17-34. ![]() [71]Hulzebosch N, Ibrahimi S, Worring M, 2020. Detecting CNN-generated facial images in real-world scenarios. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.2729-2738. ![]() [72]Ioannou E, Maddock S, 2024. Evaluation in neural style transfer: a review. Comput Graph Forum, 43(6): e15165. ![]() [73]Jayant NS, Noll P, 1984. Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice-Hall, Englewood Cliffs, NJ, USA, p.139-140. ![]() [74]Jayasumana S, Ramalingam S, Veit A, et al., 2024. Rethinking FID: towards a better evaluation metric for image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9307-9315. ![]() [75]Jin D, Ma L, Liu RS, et al., 2021. Bridging the gap between low-light scenes: bilevel learning for fast adaptation. Proc 29th ACM Int Conf on Multimedia, p.2401-2409. ![]() [76]Johnson J, Gupta A, Li FF, 2018. Image generation from scene graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1219-1228. ![]() [77]Joo D, Kim D, Kim J, 2018. Generating a fusion image: one’s identity and another’s shape. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1635-1643. ![]() [78]Khashabi D, Stanovsky G, Bragg J, et al., 2022. GENIE: toward reproducible and standardized human evaluation for text generation. Proc Conf on Empirical Methods in Natural Language Processing, p.11444-11458. ![]() [79]Kim T, Song G, Lee S, et al., 2022. L-Verse: bidirectional generation between image and text. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16505-16515. ![]() [80]Kirstain Y, Polyak A, Singer U, et al., 2023. Pick-a-Pic: an open dataset of user preferences for text-to-image generation. Proc 37th Int Conf on Neural Information Processing Systems, p.36652-36663. ![]() [81]Klie JC, de Castilho RE, Gurevych I, 2024. Analyzing dataset annotation quality management in the wild. Comput Ling, 50(3):817-866. ![]() [82]Koley S, Bhunia AK, Sain A, et al., 2023. Picture that sketch: photorealistic image generation from abstract sketches. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6850-6861. ![]() [83]Krizhevsky A, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, University of Toronto, Toronto, Canada. ![]() [84]Ku M, Li T, Zhang K, et al., 2024. ImagenHub: standardizing the evaluation of conditional image generation models. https://arxiv.org/abs/2310.01596 ![]() [85]Kynkäänniemi T, Karras T, Laine S, et al., 2019. Improved precision and recall metric for assessing generative models. Proc 33rd Int Conf on Neural Information Processing Systems, p.3927-3936. ![]() [86]Kynkäänniemi T, Karras T, Aittala M, et al., 2023. The role of ImageNet classes in Fréchet inception distance. https://arxiv.org/abs/2203.06026 ![]() [87]Ledig C, Theis L, Huszár F, et al., 2017. Photo-realistic single image super-resolution using a generative adversarial network. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.105-114. ![]() [88]Lee T, Yasunaga M, Meng CL, et al., 2023. Holistic evaluation of text-to-image models. Proc 37th Int Conf on Neural Information Processing Systems, p.69981-70011. ![]() [89]Li BY, Ren WQ, Fu DP, et al., 2019. Benchmarking single-image dehazing and beyond. IEEE Trans Image Process, 28(1):492-505. ![]() [90]Li G, Zhao XF, Cao Y, et al., 2022. Manipulated face detection and localization based on semantic segmentation. Int Workshop on Digital Watermarking, p.98-113. ![]() [91]Li H, Shen CZ, Torr P, et al., 2024. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12006-12016. ![]() [92]Li HL, Pan SJ, Wang SQ, et al., 2018. Domain generalization with adversarial feature learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5400-5409. ![]() [93]Li J, Yu KW, Zhao YF, et al., 2019. Cross-reference stitching quality assessment for 360° omnidirectional images. Proc 27th ACM Int Conf on Multimedia, p.2360-2368. ![]() [94]Li JY, Wang N, Zhang LF, et al., 2020. Recurrent feature reasoning for image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7757-7765. ![]() [95]Li LJ, Wang G, Li FF, 2007. OPTIMOL: automatic online picture collection via incremental model learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1-8. ![]() [96]Li LX, Zhang Y, Wang SH, 2023. The Euclidean space is evil: hyperbolic attribute editing for few-shot image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.22657-22667. ![]() [97]Li SK, Fu JL, Liu KY, et al., 2024. CosmicMan: a text-to-image foundation model for humans. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6955-6965. ![]() [98]Li YJ, Zhang R, Lu JC, et al., 2020. Few-shot image generation with elastic weight consolidation. Proc 34th Int Conf on Neural Information Processing Systems, p.15885-15896. ![]() [99]Li ZJ, Wu JY, Koh I, et al., 2021. Image synthesis from layout with locality-aware mask adaption. Proc IEEE/CVF Int Conf on Computer Vision, p.13799-13808. ![]() [100]Liang XD, Zhang H, Lin L, et al., 2018. Generative semantic manipulation with mask-contrasting GAN. Proc 15th European Conf on Computer Vision, p.558-573. ![]() [101]Liang YW, He JF, Li G, et al., 2024. Rich human feedback for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19401-19411. ![]() [102]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13th European Conf on Computer Vision, p.740-755. ![]() [103]Lin ZQ, Pathak D, Li BQ, et al., 2025. Evaluating text-to-visual generation with image-to-text generation. 18th European Conf on Computer Vision, p.366-384. ![]() [104]Liu JX, Liu Q, 2024. R3CD: scene graph to image generation with relation-aware compositional contrastive control diffusion. Proc AAAI Conf Artif Intell, 38(4):3657-3665. ![]() [105]Liu RS, Ma L, Ma TY, et al., 2022. Learning with nested scene modeling and cooperative architecture search for low-light vision. IEEE Trans Patt Anal Mach Intell, 45(5):5953-5969. ![]() [106]Liu YF, Qin ZC, Wan T, et al., 2018. Auto-painter: cartoon image generation from sketch by using conditional Wasserstein generative adversarial networks. Neurocomputing, 311:78-87. ![]() [107]Lu JW, Wang H, Shao TJ, et al., 2022. Pose guided image generation from misaligned sources via residual flow based correction. Proc AAAI Conf Artif Intell, 36(2):1863-1871. ![]() [108]Luan FJ, Paris S, Shechtman E, et al., 2017. Deep photo style transfer. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6997-7005. ![]() [109]Lucic M, Kurach K, Michalski M, et al., 2018. Are GANs created equal? A large-scale study. Proc 32nd Int Conf on Neural Information Processing Systems, p.698-707. ![]() [110]Luo A, Zhang ZT, Wu JJ, et al., 2020. End-to-end optimization of scene layout. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3753-3762. ![]() [111]Lv Z, Li X, Li X, et al., 2021. Learning semantic person image generation by region-adaptive normalization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10801-10810. ![]() [112]Ma KD, Fang YM, 2021. Image quality assessment in the modern age. Proc 29th ACM Int Conf on Multimedia, p.5664-5666. ![]() [113]Ma LQ, Jia X, Sun QR, et al., 2017. Pose guided person image generation. Proc 31st Int Conf on Neural Information Processing Systems, p.406-416. ![]() [114]Machajdik J, Hanbury A, 2010. Affective image classification using features inspired by psychology and art theory. Proc 18th ACM Int Conf on Multimedia, p.83-92. ![]() [115]Marrinan T, Papka ME, 2021. Real-time omnidirectional stereo rendering: generating 360° surround-view panoramic images for comfortable immersive viewing. IEEE Trans Vis Comput Graph, 27(5):2587-2596. ![]() [116]Meng FQ, Shao WQ, Luo LX, et al., 2024. PhyBench: a physical commonsense benchmark for evaluating text-to-image models. https://arxiv.org/abs/2406.11802 ![]() [117]Miyake R, Matsukawa T, Suzuki E, 2024. Image generation from hyper scene graph with multiple types of trinomial hyperedges. SN Comput Sci, 5(5):624. ![]() [118]Mondal AK, Tiwary P, Singla P, et al., 2023. Few-shot cross-domain image generation via inference-time latent-code learning. 11th Int Conf on Learning Representations. ![]() [119]Naeem MF, Oh SJ, Uh Y, et al., 2020. Reliable fidelity and diversity metrics for generative models. 37th Int Conf on Machine Learning, p.7176-7185. ![]() [120]Nazeri K, Ng E, Ebrahimi M, 2018. Image colorization using generative adversarial networks. 10th Int Conf on Articulated Motion and Deformable Objects, p.85-94. ![]() [121]Odena A, Olah C, Shlens J, 2017. Conditional image synthesis with auxiliary classifier GANs. Int Conf on Machine Learning, p.2642-2651. ![]() [122]Oh C, Cho W, Chae Y, et al., 2022. BIPS: bi-modal indoor panorama synthesis via residual depth-aided adversarial learning. Proc 17th European Conf on Computer Vision, p.352-371. ![]() [123]Ojha U, Li YJ, Lu JW, et al., 2021. Few-shot image generation via cross-domain correspondence. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10738-10747. ![]() [124]Otani M, Togashi R, Sawai Y, et al., 2023. Toward verifiable and reproducible human evaluation for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14277-14286. ![]() [125]Pang YX, Lin JX, Qin T, et al., 2022. Image-to-image translation: methods and applications. IEEE Trans Multim, 24:3859-3881. ![]() [126]Park T, Liu MY, Wang TC, et al., 2019. Semantic image synthesis with spatially-adaptive normalization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2332-2341. ![]() [127]Parmar G, Zhang R, Zhu JY, 2022. On aliased resizing and surprising subtleties in GAN evaluation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11410-11420. ![]() [128]Petsiuk V, Siemenn AE, Surbehera S, et al., 2022. Human evaluation of text-to-image models on a multi-task benchmark. https://arxiv.org/abs/2211.12112 ![]() [129]Phaphuangwittayakul A, Guo Y, Ying FL, 2022. Fast adaptive meta-learning for few-shot image generation. IEEE Trans Multim, 24:2205-2217. ![]() [130]Phung Q, Ge SW, Huang JB, 2024. Grounded text-to-image synthesis with attention refocusing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7932-7942. ![]() [131]Qiao TT, Zhang J, Xu DQ, et al., 2019a. Learn, imagine and create: text-to-image generation from prior knowledge. Proc 33rd Int Conf on Neural Information Processing Systems, p.885-895. ![]() [132]Qiao TT, Zhang J, Xu DQ, et al., 2019b. MirrorGAN: learning text-to-image generation by redescription. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1505-1514. ![]() [133]Quan FN, Lang B, 2024. Boundary-aware GAN for multiple overlapping objects in layout-to-image generation. Multim Syst, 30(2):88. ![]() [134]Quan WZ, Zhang RS, Zhang Y, et al., 2022. Image inpainting with local and global refinement. IEEE Trans Image Process, 31:2405-2420. ![]() [135]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. https://arxiv.org/abs/2204.06125 ![]() [136]Ravuri S, Vinyals O, 2019. Classification accuracy score for conditional generative models. Proc 33rd Int Conf on Neural Information Processing Systems, p.12268-12279. ![]() [137]Regmi K, Borji A, 2018. Cross-view image synthesis using conditional GANs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3501-3510. ![]() [138]Ren YR, Li G, Liu S, et al., 2020. Deep spatial transformation for pose-guided person image generation and animation. IEEE Trans Image Process, 29:8622-8635. ![]() [139]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36th Int Conf on Neural Information Processing Systems, p.36479-36494. ![]() [140]Sajjadi MSM, Bachem O, Lucic M, et al., 2018. Assessing generative models via precision and recall. Proc 32nd Int Conf on Neural Information Processing Systems, p.5228-5237. ![]() [141]Salimans T, Goodfellow I, Zaremba W, et al., 2016. Improved techniques for training GANs. Proc 30th Int Conf on Neural Information Processing Systems, p.2234-2242. ![]() [142]Saseendran A, Skubch K, Keuper M, 2021. Multi-class multi-instance count conditioned adversarial image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.6742-6751. ![]() [143]Sauer A, Schwarz K, Geiger A, 2022. StyleGAN-XL: scaling styleGAN to large diverse datasets. ACM SIGGRAPH, Article 49. ![]() [144]Schroff F, Criminisi A, Zisserman A, 2011. Harvesting image databases from the web. IEEE Trans Patt Anal Mach Intell, 33(4):754-766. ![]() [145]Shen GB, Wang LZ, Lin JT, et al., 2024. SG-Adapter: enhancing text-to-image generation with scene graph guidance. https://arxiv.org/abs/2405.15321 ![]() [146]Shen W, Liu RJ, 2017. Learning residual images for face attribute manipulation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1225-1233. ![]() [147]Sheynin S, Polyak A, Singer U, et al., 2024. Emu Edit: precise image editing via recognition and generation tasks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8871-8879. ![]() [148]Shi HY, Wang L, Zheng NN, et al., 2022. Loss functions for pose guided person image generation. Patt Recogn, 122: 108351. ![]() [149]Shibata K, Araki S, Maeda K, et al., 2014. High-quality panoramic image generation using multiple PAL images. Electr Commun Jpn, 97(6):58-66. ![]() [150]Shocher A, Gandelsman Y, Mosseri I, et al., 2020. Semantic pyramid for image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7455-7464. ![]() [151]Siarohin A, Sangineto E, Lathuilière S, et al., 2018. Deformable GANs for pose-based human image generation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3408-3416. ![]() [152]Sun W, Wu TF, 2021. Learning layout and style reconfigurable GANs for controllable image synthesis. IEEE Trans Patt Anal Mach Intell, 44(9):5070-5087. ![]() [153]Sushko V, Schönfeld E, Zhang D, et al., 2021. You only need adversarial supervision for semantic image synthesis. https://arxiv.org/abs/2012.04781 ![]() [154]Sylvain T, Zhang PC, Bengio Y, et al., 2021. Object-centric image generation from layouts. Proc AAAI Conf Artif Intell, 35(3):2647-2655. ![]() [155]Tan ZT, Chai ML, Chen DD, et al., 2021. Diverse semantic image synthesis via probability distribution modeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7958-7967. ![]() [156]Tang H, Xu D, Sebe N, et al., 2019. Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2412-2421. ![]() [157]Tang H, Xu D, Yan Y, et al., 2020a. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7867-7876. ![]() [158]Tang H, Bai S, Zhang L, et al., 2020b. XingGAN for person image generation. 16th European Conf on Computer Vision, p.717-734. ![]() [159]Tang H, Shao L, Torr PHS, et al., 2023. Local and global GANs with semantic-aware upsampling for image generation. IEEE Trans Patt Anal Mach Intell, 45(1):768-784. ![]() [160]Tang JL, Yuan Y, Shao TJ, et al., 2021. Structure-aware person image generation with pose decomposition and semantic correlation. Proc AAAI Conf Artif Intell, 35(3):2656-2664. ![]() [161]Tao X, Gao HY, Shen XY, et al., 2018. Scale-recurrent network for deep image deblurring. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.8174-8182. ![]() [162]Torralba A, Efros AA, 2011. Unbiased look at dataset bias. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1521-1528. ![]() [163]Tripathi S, Sridhar SN, Sundaresan S, et al., 2019a. Compact scene graphs for layout composition and patch retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.676-683. ![]() [164]Tripathi S, Bhiwandiwalla A, Bastidas A, et al., 2019b. Using scene graph context to improve image generation. https://arxiv.org/abs/1901.03762 ![]() [165]Volokitin A, Konukoglu E, van Gool L, 2020. Decomposing image generation into layout prediction and conditional synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1530-1538. ![]() [166]Wang J, Chen ZW, Yuan CF, et al., 2023. Hierarchical curriculum learning for no-reference image quality assessment. Int J Comput Vis, 131(11):3074-3093. ![]() [167]Wang JR, Duan HY, Liu J, et al., 2023. AIGCIQA2023: a large-scale image quality assessment database for AI generated images: from the perspectives of quality, authenticity and correspondence. CAAI Int Conf on Artificial Intelligence, p.46-57. ![]() [168]Wang L, Chen W, Yang WJ, et al., 2020. A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access, 8:63514-63537. ![]() [169]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369. ![]() [170]Wang Y, Tao X, Qi XJ, et al., 2018. Image inpainting via generative multi-column convolutional neural networks. Proc 32nd Int Conf on Neural Information Processing Systems, p.331-340. ![]() [171]Wang Y, He YN, Li YZ, et al., 2024. InternVid: a large-scale video-text dataset for multimodal understanding and generation. https://arxiv.org/abs/2307.06942 ![]() [172]Wang YH, Wang Q, Zhang DY, 2022. Few-shot generation by modeling stereoscopic priors. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2370-2374. ![]() [173]Wang Z, Simoncelli EP, Bovik AC, 2003. Multiscale structural similarity for image quality assessment. 37th Asilomar Conf on Signals, Systems & Computers, p.1398-1402. ![]() [174]Wang Z, Bovik AC, Sheikh HR, et al., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process, 13(4):600-612. ![]() [175]Wang ZH, Chen J, Hoi SCH, 2021. Deep learning for image super-resolution: a survey. IEEE Trans Patt Anal Mach Intell, 43(10):3365-3387. ![]() [176]Wang ZJ, Qi XQ, Yuan K, et al., 2022. Self-supervised correlation mining network for person image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7693-7702. ![]() [177]Wang ZM, Li YX, Huang DL, et al., 2023. DeformSg2im: scene graph based multi-instance image generation with a deformable geometric layout. Neurocomputing, 558: 126684. ![]() [178]Wu JY, Li ZJ, Zhang SY, et al., 2022. Amodal layout completion in complex outdoor scenes. 2nd CAAI Int Conf on Artificial Intelligence, p.30-41. ![]() [179]Wu JY, Gan WS, Chen ZF, et al., 2023. AI-generated content (AIGC): a survey. https://arxiv.org/abs/2304.06632 ![]() [180]Wu SS, Tang H, Jing XY, et al., 2023. Cross-view panorama image synthesis. IEEE Trans Multim, 25:3546-3559. ![]() [181]Wu XS, Sun KQ, Zhu F, et al., 2023a. Human preference score: better aligning text-to-image models with human preference. Proc IEEE/CVF Int Conf on Computer Vision, p.2096-2105. ![]() [182]Wu XS, Hao YM, Sun KQ, et al., 2023b. Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis. https://arxiv.org/abs/2306.09341 ![]() [183]Wu YZ, Wang XT, Li Y, et al., 2021. Towards vivid and diverse image colorization with generative color prior. Proc IEEE/CVF Int Conf on Computer Vision, p.14357-14366. ![]() [184]Wu ZB, Deng HG, Wang Q, et al., 2023. SketchScene: scene sketch to image generation with diffusion models. IEEE Int Conf on Multimedia and Expo, p.2087-2092. ![]() [185]Xia WH, Yang YJ, Xue JH, 2021a. Cali-sketch: stroke calibration and completion for high-quality face image generation from human-like sketches. Neurocomputing, 460:256-265. ![]() [186]Xia WH, Yang YJ, Xue JH, et al., 2021b. TediGAN: text-guided diverse face image generation and manipulation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2256-2265. ![]() [187]Xie Y, Fu YW, Tai Y, et al., 2022. Learning to memorize feature hallucination for one-shot image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9120-9129. ![]() [188]Xu HR, Chen SY, Zhang Y, 2023. Magical Brush: a symbol-based modern Chinese painting system for novices. Proc CHI Conf on Human Factors in Computing Systems, Article 131. ![]() [189]Xu JZ, Liu X, Wu YC, et al., 2023. ImageReward: learning and evaluating human preferences for text-to-image generation. Proc 37th Int Conf on Neural Information Processing Systems, Article 700. ![]() [190]Xu QQ, Huang QM, Yao Y, 2012. Online crowdsourcing subjective image quality assessment. Proc 20th ACM Int Conf on Multimedia, p.359-368. ![]() [191]Xu T, Zhang PC, Huang QY, et al., 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1316-1324. ![]() [192]Yan K, Ji L, Wu CF, et al., 2022. Trace controlled text to image generation. 17th European Conf on Computer Vision, p.59-75. ![]() [193]Yang H, Zhang RM, Guo XB, et al., 2020. Towards photo-realistic virtual try-on by adaptively generating ↔ preserving image content. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7847-7856. ![]() [194]Yang L, Zhang ZL, Song Y, et al., 2023. Diffusion models: a comprehensive survey of methods and applications. ACM Comput Surv, 56(4):105. ![]() [195]Yang S, Jiang LM, Liu ZW, et al., 2022. Unsupervised image-to-image translation with generative prior. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18311-18320. ![]() [196]Yang Y, Lin YQ, Liu H, et al., 2024. Position: towards implicit prompt for text-to-image models. 41st Int Conf on Machine Learning, Article 2320. ![]() [197]Yao YZ, Hua XS, Shen FM, et al., 2016. A domain robust approach for image dataset construction. Proc 24th ACM Int Conf on Multimedia, p.212-216. ![]() [198]Yu JH, Xu YZ, Koh JY, et al., 2022. Scaling autoregressive models for content-rich text-to-image generation. https://arxiv.org/abs/2206.10789 ![]() [199]Yu YC, Zhan FN, Lu SJ, et al., 2021. WaveFill: a wavelet-based generation network for image inpainting. Proc IEEE/CVF Int Conf on Computer Vision, p.14094-14103. ![]() [200]Yuan XD, Tang ST, Li KJ, et al., 2024. CamFreeDiff: camera-free image to panorama generation with diffusion model. https://arxiv.org/abs/2407.07174 ![]() [201]Zhai GT, Min XK, 2020. Perceptual image quality assessment: a survey. Sci China Inform Sci, 63(11): 211301. ![]() [202]Zhang C, Wu QY, Gambardella CC, et al., 2024. Taming stable diffusion for text to 360° panorama image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6347-6357. ![]() [203]Zhang F, Tian SL, Huang ZQ, et al., 2024. Evaluation agent: efficient and promptable evaluation framework for visual generative models. https://arxiv.org/abs/2412.09645 ![]() [204]Zhang H, Koh JY, Baldridge J, et al., 2021. Cross-modal contrastive learning for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.833-842. ![]() [205]Zhang HG, Dai YC, Li HD, et al., 2019. Deep stacked hierarchical multi-patch network for image deblurring. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5971-5979. ![]() [206]Zhang JM, Ma CX, Yang KL, et al., 2022. Transfer beyond the field of view: dense panoramic semantic segmentation via unsupervised domain adaptation. IEEE Trans Intell Transp Syst, 23(7):9478-9491. ![]() [207]Zhang KH, Ren WQ, Luo WH, et al., 2022. Deep image deblurring: a survey. Int J Comput Vis, 130(9):2103-2130. ![]() [208]Zhang PZ, Yang LX, Xie XH, et al., 2022. Lightweight texture correlation network for pose guided person image generation. IEEE Trans Circ Syst Video Technol, 32(7):4584-4598. ![]() [209]Zhang R, Isola P, Efros AA, 2016. Colorful image colorization. 14th European Conf on Computer, p.649-666. ![]() [210]Zhang R, Isola P, Efros AA, et al., 2018. The unreasonable effectiveness of deep features as a perceptual metric. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.586-595. ![]() [211]Zhang SX, Wang BH, Wu JQ, et al., 2024. Learning multi-dimensional human preference for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8018-8027. ![]() [212]Zhang XB, Zhai DH, Li TR, et al., 2023. Image inpainting based on deep learning: a review. Inform Fus, 90:74-94. ![]() [213]Zhang YK, Meng CY, Li ZJ, et al., 2023. Learning object consistency and interaction in image generation from scene graphs. Proc 32nd Int Joint Conf on Artificial Intelligence, p.1731-1739. ![]() [214]Zhao B, Meng LL, Yin WD, et al., 2019. Image generation from layout. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8576-8585. ![]() [215]Zhao K, Yuan K, Sun M, et al., 2023. Quality-aware pretrained models for blind image quality assessment. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22302-22313. ![]() [216]Zhao SY, Zhang L, Shen Y, et al., 2021. RefineDNet: a weakly supervised refinement framework for single image dehazing. IEEE Trans Image Process, 30:3391-3404. ![]() [217]Zhao Y, Ren DY, Chen Y, et al., 2022. Cartoon image processing: a survey. Int J Comput Vis, 130(11):2733-2769. ![]() [218]Zhao YQ, Ding HH, Huang HJ, et al., 2022. A closer look at few-shot image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9130-9140. ![]() [219]Zhao YQ, Chandrasegaran K, Abdollahzadeh M, et al., 2023. AdAM: few-shot image generation via adaptation-aware kernel modulation. https://arxiv.org/abs/2307.01465 ![]() [220]Zheng BY, Gu JJ, Li SJ, et al., 2024. LM4LV: a frozen large language model for low-level vision tasks. https://arxiv.org/abs/2405.15734 ![]() [221]Zheng WD, Teng JY, Yang ZY, et al., 2024. CogView3: finer and faster text-to-image generation via relay diffusion. 18th European Conf on Computer Vision, p.1-22. ![]() [222]Zhou MQ, Wang YX, Hou J, et al., 2024. SceneX: procedural controllable large-scale scene generation. https://arxiv.org/abs/2403.15698 ![]() [223]Zhou S, Gordon ML, Krishna R, et al., 2019. HYPE: a benchmark for human eye perceptual evaluation of generative models. Proc 33rd Int Conf on Neural Information Processing Systems, p.3449-3461. ![]() [224]Zhou YF, Liu BC, Zhu YZ, et al., 2023. Shifted diffusion for text-to-image generation. Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10157-10166. ![]() [225]Zhou YF, Zhang RY, Zheng KZ, et al., 2024. Toffee: efficient million-scale dataset construction for subject-driven text-to-image generation. https://arxiv.org/abs/2406.09305 ![]() [226]Zhu JY, Ma HM, Chen JS, et al., 2024. High-quality and diverse few-shot image generation via masked discrimination. IEEE Trans Image Process, 33:2950-2965. ![]() [227]Zhu WH, Zhai GT, Hu MH, et al., 2018. Arrow’s impossibility theorem inspired subjective image quality assessment approach. Signal Process, 145:193-201. ![]() [228]Zhu Z, Huang TT, Shi BG, et al., 2019. Progressive pose attention transfer for person image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2342-2351. ![]() [229]Zhu Z, Xu ZL, You AS, et al., 2020. Semantically multi-modal image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5466-5475. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>