JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Image generation evaluation: a comprehensive survey of human and automatic evaluations

Author(s): Qi LIU, Shuanglin YANG, Zejian LI, Lefan HOU, Chenye MENG, Ying ZHANG, Lingyun SUN
Affiliation(s): School of Software Technology, Zhejiang University, Ningbo 315100, China; more
Corresponding email(s): zejianlee@zju.edu.cn
Key Words: Image generation evaluation; Human evaluation; Automatic evaluation; Evaluation protocols; Evaluation aspects

Share this article to： More <<< Previous Paper \|Next Paper >>>

Qi LIU, Shuanglin YANG, Zejian LI, Lefan HOU, Chenye MENG, Ying ZHANG, Lingyun SUN. Image generation evaluation: a comprehensive survey of human and automatic evaluations[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400904

@article{title="Image generation evaluation: a comprehensive survey of human and automatic evaluations",
author="Qi LIU, Shuanglin YANG, Zejian LI, Lefan HOU, Chenye MENG, Ying ZHANG, Lingyun SUN",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400904"
}

%0 Journal Article
%T Image generation evaluation: a comprehensive survey of human and automatic evaluations
%A Qi LIU
%A Shuanglin YANG
%A Zejian LI
%A Lefan HOU
%A Chenye MENG
%A Ying ZHANG
%A Lingyun SUN
%J Frontiers of Information Technology & Electronic Engineering
%P 1027-1065
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400904"

TY - JOUR
T1 - Image generation evaluation: a comprehensive survey of human and automatic evaluations
A1 - Qi LIU
A1 - Shuanglin YANG
A1 - Zejian LI
A1 - Lefan HOU
A1 - Chenye MENG
A1 - Ying ZHANG
A1 - Lingyun SUN
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1027
EP - 1065
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400904"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Image generation models have made remarkable progress, and image evaluation is crucial for explaining and driving the development of these models. Previous studies have extensively explored human and automatic evaluations of image generation. Herein, these studies are comprehensively surveyed, specifically for two main parts: evaluation protocols and evaluation methods. First, 10 image generation tasks are summarized with focus on their differences in evaluation aspects. Based on this, a novel protocol is proposed to cover human and automatic evaluation aspects required for various image generation tasks. Second, the review of automatic evaluation methods in the past five years is highlighted. To our knowledge, this paper presents the first comprehensive summary of human evaluation, encompassing evaluation methods, tools, details, and data analysis methods. Finally, the challenges and potential directions for image generation evaluation are discussed. We hope that this survey will help researchers develop a systematic understanding of image generation evaluation, stay updated with the latest advancements in the field, and encourage further research.

图像生成评估：人类评估与自动评估的全面综述

刘绮¹，杨双林²，李泽健¹，侯乐凡³，孟辰烨³，张颖¹，孙凌云³
¹浙江大学软件学院，中国宁波市，315100
²东南大学计算机科学与工程学院，中国南京市，211189
³浙江大学计算机科学与技术学院，中国杭州市，310027
摘要：图像生成模型取得了显著进展，其中图像评估在解释和推动这些模型的发展方面至关重要。现有研究广泛探讨了图像生成的人类评估与自动评估。本文对相关研究进行了系统综述，重点涵盖两个核心部分：评估协议与评估方法。首先，总结了10类图像生成任务，重点关注它们在评估方面的差异。基于此，提出一种新的评估协议，以涵盖不同图像生成任务所需的人类与自动评估的重要评估方面。其次，重点回顾过去5年中提出的自动评估方法。据我们所知，本文是对人工评估的首次全面总结，涵盖评估方法、工具、评估细节及数据分析方法。最后，探讨了当前图像生成评估面临的挑战及未来发展方向。希望本综述能够帮助研究人员系统理解图像生成评估，掌握该领域最新进展，并推动相关研究的开展。

关键词组：图像生成评估；人类评估；自动评估；评估协议；评估方面

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agustsson E, Tschannen M, Mentzer F, et al., 2019. Generative adversarial networks for extreme learned image compression. Proc IEEE/CVF Int Conf on Computer Vision, p.221-231.

[2]Ak K, Kassim A, Lim JH, et al., 2019. Attribute manipulation generative adversarial networks for fashion images. Proc IEEE/CVF Int Conf on Computer Vision, p.10540-10549.

[3]Alhabeeb SK, Al-Shargabi AA, 2024. Text-to-image synthesis with generative models: methods, datasets, performance metrics, challenges, and future direction. IEEE Access, 12:24412-24427.

[4]Alqahtani H, Kavakli-Thorne M, Kumar G, 2021. Applications of generative adversarial networks (GANs): an updated review. Arch Comput Methods Eng, 28(2):525-552.

[5]Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3686-3693.

[6]Ashual O, Wolf L, 2019. Specifying object attributes and relations in interactive scene generation. Proc IEEE/CVF Int Conf on Computer Vision, p.4560-4568.

[7]Bakr EM, Sun PZ, Shen XQ, et al., 2023. HRS-bench: holistic, reliable and scalable benchmark for text-to-image models. Proc IEEE/CVF Int Conf on Computer Vision, p.19984-19996.

[8]Baraheem SS, Le TN, Nguyen TV, 2023. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook. Artif Intell Rev, 56(10):10813-10865.

[9]Bar-Tal O, Yariv L, Lipman Y, et al., 2023. MultiDiffusion: fusing diffusion paths for controlled image generation. Proc 40^th Int Conf on Machine Learning, p.1737-1752.

[10]Bau D, Zhu JY, Wulff J, et al., 2019. Seeing what a GAN cannot generate. Proc IEEE/CVF Int Conf on Computer Vision, p.4501-4510.

[11]Benny Y, Galanti T, Benaim S, et al., 2021. Evaluation metrics for conditional image generation. Int J Comput Vis, 129(5):1712-1731.

[12]Betti F, Staiano J, Baraldi L, et al., 2023. Let’s ViCE! Mimicking human cognitive behavior in image generation evaluation. Proc 31^st ACM Int Conf on Multimedia, p.9306-9312.

[13]Bińkowski M, Sutherland DJ, Arbel M, et al., 2021. Demystifying MMD GANs. https://arxiv.org/abs/1801.01401

[14]Blau Y, Michaeli T, 2018. The perception-distortion tradeoff. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6228-6237.

[15]Borji A, 2019. Pros and cons of GAN evaluation measures. Comput Vis Image Underst, 179:41-65.

[16]Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. https://arxiv.org/abs/1809.11096

[17]Cai ZP, Mueller M, Birkl R, et al., 2024. L-MAGIC: language model assisted generation of images with coherence. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7049-7058.

[18]Cao KD, Wei C, Gaidon A, et al., 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Proc 33^rd Int Conf on Neural Information Processing Systems, p.1567-1578.

[19]Chan C, Ginosar S, Zhou TH, et al., 2019. Everybody dance now. Proc IEEE/CVF Int Conf on Computer Vision, p.5932-5941.

[20]Chang XJ, Ren PZ, Xu PF, et al., 2023. A comprehensive survey of scene graphs: generation and application. IEEE Trans Patt Anal Mach Intell, 45(1):1-26.

[21]Chen JS, Ge CJ, Xie EZ, et al., 2024. PIXART-∑: weak-to-strong training of diffusion transformer for 4k text-to-image generation. 18^th European Conf on Computer Vision, p.74-91.

[22]Chen JX, Fan JY, Ye HC, et al., 2023. Exploring kernel-based texture transfer for pose-guided person image generation. IEEE Trans Multim, 25:7337-7349.

[23]Chen P, Li ZJ, Zhang YK, et al., 2022. USIS: a unified semantic image synthesis model trained on a single or multiple samples. Neurocomputing, 514:70-82.

[24]Chen WH, Hu HX, Saharia C, et al., 2022. Re-Imagen: retrieval-augmented text-to-image generator. https://arxiv.org/abs/2209.14491

[25]Chen X, Song J, Hilliges O, 2019. Unpaired pose guided human image generation.

[26]Chen YC, Shen XH, Lin Z, et al., 2019. Semantic component decomposition for face attribute manipulation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9851-9859.

[27]Cheng JX, Liang X, Shi XJ, et al., 2023. LayoutDiffuse: adapting foundational diffusion models for layout-to-image generation. https://arxiv.org/abs/2302.08908

[28]Cho J, Zala A, Bansal M, 2023. DALL-EVAL: probing the reasoning skills and social biases of text-to-image generation models. Proc IEEE/CVF Int Conf on Computer Vision, p.3020-3031.

[29]Cho J, Hu YS, Baldridge J, et al., 2024. Davidsonian scene graph: improving reliability in fine-grained evaluation for text-to-image generation.

[30]Cho SJ, Ji SW, Hong JP, et al., 2021. Rethinking coarse-to-fine approach in single image deblurring. Proc IEEE/CVF Int Conf on Computer Vision, p.4621-4630.

[31]Croitoru FA, Hondru V, Ionescu RT, et al., 2023. Diffusion models in vision: a survey. IEEE Trans Patt Anal Mach Intell, 45(9):10850-10869.

[32]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.

[33]DeVon HA, Block ME, Moyle-Wright P, et al., 2007. A psychometric toolbox for testing validity and reliability. J Nurs Schol, 39(2):155-164.

[34]Ding M, Yang ZY, Hong WY, et al., 2021. CogView: mastering text-to-image generation via Transformers. Proc 35^th Int Conf on Neural Information Processing Systems, Article 1516.

[35]Ding M, Zheng WD, Hong WY, et al., 2022. CogView2: faster and better text-to-image generation via hierarchical Transformers. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1229.

[36]Dinh TM, Nguyen R, Hua BS, 2022. TISE: bag of metrics for text-to-image synthesis evaluation. 17^th European Conf on Computer Vision, p.594-609.

[37]Dong YS, Tan W, Tao DC, et al., 2022. CartoonLossGAN: learning surface and coloring of images for cartoonization. IEEE Trans Image Process, 31:485-498.

[38]Duan YP, Han CY, Tao XM, et al., 2020. Panoramic image generation: from 2-D sketch to spherical image. IEEE J Sel Top Signal Process, 14(1):194-208.

[39]Elasri M, Elharrouss O, Al-Maadeed S, et al., 2022. Image generation: a review. Neur Process Lett, 54(5):4609-4646.

[40]Fan MH, Wang WJ, Yang WH, et al., 2020. Integrating semantic segmentation and retinex model for low-light image enhancement. Proc 28^th ACM Int Conf on Multimedia, p.2317-2325.

[41]Farshad A, Yeganeh Y, Chi Y, et al., 2023. SceneGenie: scene graph guided diffusion models for image synthesis. Proc IEEE/CVF Int Conf on Computer Vision, p.88-98.

[42]Fleisig E, Blodgett SL, Klein D, et al., 2024. The perspectivist paradigm shift: assumptions and challenges of capturing human labels. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.2279-2292.

[43]Foo LG, Rahmani H, Liu J, 2023. AI-generated content (AIGC) for various data modalities: a survey. https://arxiv.org/abs/2308.14177

[44]Frolov S, Hinz T, Raue F, et al., 2021. Adversarial text-to-image synthesis: a review. Neur Netw, 144:187-209.

[45]Frühstück A, Singh KK, Shechtman E, et al., 2022. InsetGAN for full-body image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7713-7722.

[46]Gao CY, Liu Q, Xu Q, et al., 2020. SketchyCOCO: image generation from freehand scene sketches. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5173-5182.

[47]Gao YX, Min XK, Zhu YC, et al., 2022. Image quality assessment: from mean opinion score to opinion score distribution. Proc 30^th ACM Int Conf on Multimedia, p.997-1005.

[48]Graziotin D, Lenberg P, Feldt R, et al., 2021. Psychometrics in behavioral software engineering: a methodological introduction with guidelines. ACM Trans Softw Eng Methodol, 31(1):7.

[49]Grigorev A, Sevastopolsky A, Vakhitov A, et al., 2019. Coordinate-based texture inpainting for pose-guided human image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12127-12136.

[50]Gu Z, Li WB, Huo J, et al., 2021. LoFGAN: fusing local representations for few-shot image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.8443-8451.

[51]Guo XF, Yang HY, Huang D, 2021. Image inpainting via conditional texture and structure dual generation. Proc IEEE/CVF Int Conf on Computer Vision, p.14114-14123.

[52]Habtegebrial TA, Jampani V, Gallo O, et al., 2020. Generative view synthesis: from single-view semantics to novel-view images. Proc 34^th Int Conf on Neural Information Processing Systems, p.4745-4755.

[53]Hall M, Bell SJ, Ross C, et al., 2024. Towards geographic inclusion in the evaluation of text-to-image models. ACM Conf on Fairness, Accountability, and Transparency, p.585-601.

[54]Hara T, Mukuta Y, Harada T, 2021. Spherical image generation from a single image by considering scene symmetry. Proc AAAI Conf Artif Intell, 35(2):1513-1521.

[55]Hassan MU, Alaliyat S, Hameed IA, 2023. Image generation models from scene graphs and layouts: a comparative analysis. J King Saud Univ-Comput Inform Sci, 35(5): 101543.

[56]He S, Liao WT, Yang MY, et al., 2021. Context-aware layout to image generation with enhanced object appearance. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15044-15053.

[57]He YT, Salakhutdinov R, Zico Kolter J, 2023. Localized text-to-image generation for free via cross attention control. https://arxiv.org/abs/2306.14636

[58]Hessel J, Holtzman A, Forbes M, et al., 2021. CLIPScore: a reference-free evaluation metric for image captioning. Proc Conf on Empirical Methods in Natural Language Processing, p.7514-7528.

[59]Heusel M, Ramsauer H, Unterthiner T, et al., 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Proc 31^st Int Conf on Neural Information Processing Systems, p.6626-6637.

[60]Hinz T, Heinrich S, Wermter S, 2022. Semantic object accuracy for generative text-to-image synthesis. IEEE Trans Patt Anal Mach Intell, 44(3):1552-1565.

[61]Ho J, Saharia C, Chan W, et al., 2022. Cascaded diffusion models for high fidelity image generation. J Mach Learn Res, 23(1):47.

[62]Ho TT, Virtusio JJ, Chen YY, et al., 2020. Sketch-guided deep portrait generation. ACM Trans Multim Comput Commun Appl, 16(3):1-18.

[63]Hong S, Yan XC, Huang T, et al., 2018. Learning hierarchical semantic image manipulation through structured representations. Proc 32^nd Int Conf on Neural Information Processing Systems, p.2713-2723.

[64]Hong Y, Niu L, Zhang JF, et al., 2020. F2GAN: fusing-and-filling GAN for few-shot image generation. Proc 28^th ACM Int Conf on Multimedia, p.2535-2543.

[65]Hu YS, Liu BL, Kasai J, et al., 2023. TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. Proc IEEE/CVF Int Conf on Computer Vision, p.20349-20360.

[66]Hua TY, Zheng HD, Bai YL, et al., 2021. Exploiting relationship for complex-scene image generation. Proc AAAI Conf Artif Intell, 35(2):1584-1592.

[67]Hua XS, Li J, 2015. Prajna: towards recognizing whatever you want from images without image labeling. Proc AAAI Conf Artif Intell, 29(1):137-144.

[68]Huang KY, Sun KY, Xie EZ, et al., 2023. T2I-compBench: a comprehensive benchmark for open-world compositional text-to-image generation. Proc 37^th Int Conf on Neural Information Processing Systems, Article 3443.

[69]Huang ST, Gong B, Feng YT, et al., 2024. Learning disentangled identifiers for action-customized text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7797-7806.

[70]Huh M, Zhang R, Zhu JY, et al., 2020. Transforming and projecting images into class-conditional generative networks. 16^th European Conf on Computer Vision, p.17-34.

[71]Hulzebosch N, Ibrahimi S, Worring M, 2020. Detecting CNN-generated facial images in real-world scenarios. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.2729-2738.

[72]Ioannou E, Maddock S, 2024. Evaluation in neural style transfer: a review. Comput Graph Forum, 43(6): e15165.

[73]Jayant NS, Noll P, 1984. Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice-Hall, Englewood Cliffs, NJ, USA, p.139-140.

[74]Jayasumana S, Ramalingam S, Veit A, et al., 2024. Rethinking FID: towards a better evaluation metric for image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9307-9315.

[75]Jin D, Ma L, Liu RS, et al., 2021. Bridging the gap between low-light scenes: bilevel learning for fast adaptation. Proc 29^th ACM Int Conf on Multimedia, p.2401-2409.

[76]Johnson J, Gupta A, Li FF, 2018. Image generation from scene graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1219-1228.

[77]Joo D, Kim D, Kim J, 2018. Generating a fusion image: one’s identity and another’s shape. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1635-1643.

[78]Khashabi D, Stanovsky G, Bragg J, et al., 2022. GENIE: toward reproducible and standardized human evaluation for text generation. Proc Conf on Empirical Methods in Natural Language Processing, p.11444-11458.

[79]Kim T, Song G, Lee S, et al., 2022. L-Verse: bidirectional generation between image and text. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16505-16515.

[80]Kirstain Y, Polyak A, Singer U, et al., 2023. Pick-a-Pic: an open dataset of user preferences for text-to-image generation. Proc 37^th Int Conf on Neural Information Processing Systems, p.36652-36663.

[81]Klie JC, de Castilho RE, Gurevych I, 2024. Analyzing dataset annotation quality management in the wild. Comput Ling, 50(3):817-866.

[82]Koley S, Bhunia AK, Sain A, et al., 2023. Picture that sketch: photorealistic image generation from abstract sketches. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6850-6861.

[83]Krizhevsky A, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, University of Toronto, Toronto, Canada.

[84]Ku M, Li T, Zhang K, et al., 2024. ImagenHub: standardizing the evaluation of conditional image generation models. https://arxiv.org/abs/2310.01596

[85]Kynkäänniemi T, Karras T, Laine S, et al., 2019. Improved precision and recall metric for assessing generative models. Proc 33^rd Int Conf on Neural Information Processing Systems, p.3927-3936.

[86]Kynkäänniemi T, Karras T, Aittala M, et al., 2023. The role of ImageNet classes in Fréchet inception distance. https://arxiv.org/abs/2203.06026

[87]Ledig C, Theis L, Huszár F, et al., 2017. Photo-realistic single image super-resolution using a generative adversarial network. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.105-114.

[88]Lee T, Yasunaga M, Meng CL, et al., 2023. Holistic evaluation of text-to-image models. Proc 37^th Int Conf on Neural Information Processing Systems, p.69981-70011.

[89]Li BY, Ren WQ, Fu DP, et al., 2019. Benchmarking single-image dehazing and beyond. IEEE Trans Image Process, 28(1):492-505.

[90]Li G, Zhao XF, Cao Y, et al., 2022. Manipulated face detection and localization based on semantic segmentation. Int Workshop on Digital Watermarking, p.98-113.

[91]Li H, Shen CZ, Torr P, et al., 2024. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12006-12016.

[92]Li HL, Pan SJ, Wang SQ, et al., 2018. Domain generalization with adversarial feature learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5400-5409.

[93]Li J, Yu KW, Zhao YF, et al., 2019. Cross-reference stitching quality assessment for 360° omnidirectional images. Proc 27^th ACM Int Conf on Multimedia, p.2360-2368.

[94]Li JY, Wang N, Zhang LF, et al., 2020. Recurrent feature reasoning for image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7757-7765.

[95]Li LJ, Wang G, Li FF, 2007. OPTIMOL: automatic online picture collection via incremental model learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1-8.

[96]Li LX, Zhang Y, Wang SH, 2023. The Euclidean space is evil: hyperbolic attribute editing for few-shot image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.22657-22667.

[97]Li SK, Fu JL, Liu KY, et al., 2024. CosmicMan: a text-to-image foundation model for humans. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6955-6965.

[98]Li YJ, Zhang R, Lu JC, et al., 2020. Few-shot image generation with elastic weight consolidation. Proc 34^th Int Conf on Neural Information Processing Systems, p.15885-15896.

[99]Li ZJ, Wu JY, Koh I, et al., 2021. Image synthesis from layout with locality-aware mask adaption. Proc IEEE/CVF Int Conf on Computer Vision, p.13799-13808.

[100]Liang XD, Zhang H, Lin L, et al., 2018. Generative semantic manipulation with mask-contrasting GAN. Proc 15^th European Conf on Computer Vision, p.558-573.

[101]Liang YW, He JF, Li G, et al., 2024. Rich human feedback for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19401-19411.

[102]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13^th European Conf on Computer Vision, p.740-755.

[103]Lin ZQ, Pathak D, Li BQ, et al., 2025. Evaluating text-to-visual generation with image-to-text generation. 18^th European Conf on Computer Vision, p.366-384.

[104]Liu JX, Liu Q, 2024. R3CD: scene graph to image generation with relation-aware compositional contrastive control diffusion. Proc AAAI Conf Artif Intell, 38(4):3657-3665.

[105]Liu RS, Ma L, Ma TY, et al., 2022. Learning with nested scene modeling and cooperative architecture search for low-light vision. IEEE Trans Patt Anal Mach Intell, 45(5):5953-5969.

[106]Liu YF, Qin ZC, Wan T, et al., 2018. Auto-painter: cartoon image generation from sketch by using conditional Wasserstein generative adversarial networks. Neurocomputing, 311:78-87.

[107]Lu JW, Wang H, Shao TJ, et al., 2022. Pose guided image generation from misaligned sources via residual flow based correction. Proc AAAI Conf Artif Intell, 36(2):1863-1871.

[108]Luan FJ, Paris S, Shechtman E, et al., 2017. Deep photo style transfer. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6997-7005.

[109]Lucic M, Kurach K, Michalski M, et al., 2018. Are GANs created equal? A large-scale study. Proc 32^nd Int Conf on Neural Information Processing Systems, p.698-707.

[110]Luo A, Zhang ZT, Wu JJ, et al., 2020. End-to-end optimization of scene layout. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3753-3762.

[111]Lv Z, Li X, Li X, et al., 2021. Learning semantic person image generation by region-adaptive normalization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10801-10810.

[112]Ma KD, Fang YM, 2021. Image quality assessment in the modern age. Proc 29^th ACM Int Conf on Multimedia, p.5664-5666.

[113]Ma LQ, Jia X, Sun QR, et al., 2017. Pose guided person image generation. Proc 31^st Int Conf on Neural Information Processing Systems, p.406-416.

[114]Machajdik J, Hanbury A, 2010. Affective image classification using features inspired by psychology and art theory. Proc 18^th ACM Int Conf on Multimedia, p.83-92.

[115]Marrinan T, Papka ME, 2021. Real-time omnidirectional stereo rendering: generating 360° surround-view panoramic images for comfortable immersive viewing. IEEE Trans Vis Comput Graph, 27(5):2587-2596.

[116]Meng FQ, Shao WQ, Luo LX, et al., 2024. PhyBench: a physical commonsense benchmark for evaluating text-to-image models. https://arxiv.org/abs/2406.11802

[117]Miyake R, Matsukawa T, Suzuki E, 2024. Image generation from hyper scene graph with multiple types of trinomial hyperedges. SN Comput Sci, 5(5):624.

[118]Mondal AK, Tiwary P, Singla P, et al., 2023. Few-shot cross-domain image generation via inference-time latent-code learning. 11^th Int Conf on Learning Representations.

[119]Naeem MF, Oh SJ, Uh Y, et al., 2020. Reliable fidelity and diversity metrics for generative models. 37^th Int Conf on Machine Learning, p.7176-7185.

[120]Nazeri K, Ng E, Ebrahimi M, 2018. Image colorization using generative adversarial networks. 10^th Int Conf on Articulated Motion and Deformable Objects, p.85-94.

[121]Odena A, Olah C, Shlens J, 2017. Conditional image synthesis with auxiliary classifier GANs. Int Conf on Machine Learning, p.2642-2651.

[122]Oh C, Cho W, Chae Y, et al., 2022. BIPS: bi-modal indoor panorama synthesis via residual depth-aided adversarial learning. Proc 17^th European Conf on Computer Vision, p.352-371.

[123]Ojha U, Li YJ, Lu JW, et al., 2021. Few-shot image generation via cross-domain correspondence. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10738-10747.

[124]Otani M, Togashi R, Sawai Y, et al., 2023. Toward verifiable and reproducible human evaluation for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14277-14286.

[125]Pang YX, Lin JX, Qin T, et al., 2022. Image-to-image translation: methods and applications. IEEE Trans Multim, 24:3859-3881.

[126]Park T, Liu MY, Wang TC, et al., 2019. Semantic image synthesis with spatially-adaptive normalization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2332-2341.

[127]Parmar G, Zhang R, Zhu JY, 2022. On aliased resizing and surprising subtleties in GAN evaluation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11410-11420.

[128]Petsiuk V, Siemenn AE, Surbehera S, et al., 2022. Human evaluation of text-to-image models on a multi-task benchmark. https://arxiv.org/abs/2211.12112

[129]Phaphuangwittayakul A, Guo Y, Ying FL, 2022. Fast adaptive meta-learning for few-shot image generation. IEEE Trans Multim, 24:2205-2217.

[130]Phung Q, Ge SW, Huang JB, 2024. Grounded text-to-image synthesis with attention refocusing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7932-7942.

[131]Qiao TT, Zhang J, Xu DQ, et al., 2019a. Learn, imagine and create: text-to-image generation from prior knowledge. Proc 33^rd Int Conf on Neural Information Processing Systems, p.885-895.

[132]Qiao TT, Zhang J, Xu DQ, et al., 2019b. MirrorGAN: learning text-to-image generation by redescription. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1505-1514.

[133]Quan FN, Lang B, 2024. Boundary-aware GAN for multiple overlapping objects in layout-to-image generation. Multim Syst, 30(2):88.

[134]Quan WZ, Zhang RS, Zhang Y, et al., 2022. Image inpainting with local and global refinement. IEEE Trans Image Process, 31:2405-2420.

[135]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. https://arxiv.org/abs/2204.06125

[136]Ravuri S, Vinyals O, 2019. Classification accuracy score for conditional generative models. Proc 33^rd Int Conf on Neural Information Processing Systems, p.12268-12279.

[137]Regmi K, Borji A, 2018. Cross-view image synthesis using conditional GANs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3501-3510.

[138]Ren YR, Li G, Liu S, et al., 2020. Deep spatial transformation for pose-guided person image generation and animation. IEEE Trans Image Process, 29:8622-8635.

[139]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36^th Int Conf on Neural Information Processing Systems, p.36479-36494.

[140]Sajjadi MSM, Bachem O, Lucic M, et al., 2018. Assessing generative models via precision and recall. Proc 32^nd Int Conf on Neural Information Processing Systems, p.5228-5237.

[141]Salimans T, Goodfellow I, Zaremba W, et al., 2016. Improved techniques for training GANs. Proc 30^th Int Conf on Neural Information Processing Systems, p.2234-2242.

[142]Saseendran A, Skubch K, Keuper M, 2021. Multi-class multi-instance count conditioned adversarial image generation. Proc IEEE/CVF Int Conf on Computer Vision, p.6742-6751.

[143]Sauer A, Schwarz K, Geiger A, 2022. StyleGAN-XL: scaling styleGAN to large diverse datasets. ACM SIGGRAPH, Article 49.

[144]Schroff F, Criminisi A, Zisserman A, 2011. Harvesting image databases from the web. IEEE Trans Patt Anal Mach Intell, 33(4):754-766.

[145]Shen GB, Wang LZ, Lin JT, et al., 2024. SG-Adapter: enhancing text-to-image generation with scene graph guidance. https://arxiv.org/abs/2405.15321

[146]Shen W, Liu RJ, 2017. Learning residual images for face attribute manipulation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1225-1233.

[147]Sheynin S, Polyak A, Singer U, et al., 2024. Emu Edit: precise image editing via recognition and generation tasks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8871-8879.

[148]Shi HY, Wang L, Zheng NN, et al., 2022. Loss functions for pose guided person image generation. Patt Recogn, 122: 108351.

[149]Shibata K, Araki S, Maeda K, et al., 2014. High-quality panoramic image generation using multiple PAL images. Electr Commun Jpn, 97(6):58-66.

[150]Shocher A, Gandelsman Y, Mosseri I, et al., 2020. Semantic pyramid for image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7455-7464.

[151]Siarohin A, Sangineto E, Lathuilière S, et al., 2018. Deformable GANs for pose-based human image generation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3408-3416.

[152]Sun W, Wu TF, 2021. Learning layout and style reconfigurable GANs for controllable image synthesis. IEEE Trans Patt Anal Mach Intell, 44(9):5070-5087.

[153]Sushko V, Schönfeld E, Zhang D, et al., 2021. You only need adversarial supervision for semantic image synthesis. https://arxiv.org/abs/2012.04781

[154]Sylvain T, Zhang PC, Bengio Y, et al., 2021. Object-centric image generation from layouts. Proc AAAI Conf Artif Intell, 35(3):2647-2655.

[155]Tan ZT, Chai ML, Chen DD, et al., 2021. Diverse semantic image synthesis via probability distribution modeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7958-7967.

[156]Tang H, Xu D, Sebe N, et al., 2019. Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2412-2421.

[157]Tang H, Xu D, Yan Y, et al., 2020a. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7867-7876.

[158]Tang H, Bai S, Zhang L, et al., 2020b. XingGAN for person image generation. 16^th European Conf on Computer Vision, p.717-734.

[159]Tang H, Shao L, Torr PHS, et al., 2023. Local and global GANs with semantic-aware upsampling for image generation. IEEE Trans Patt Anal Mach Intell, 45(1):768-784.

[160]Tang JL, Yuan Y, Shao TJ, et al., 2021. Structure-aware person image generation with pose decomposition and semantic correlation. Proc AAAI Conf Artif Intell, 35(3):2656-2664.

[161]Tao X, Gao HY, Shen XY, et al., 2018. Scale-recurrent network for deep image deblurring. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.8174-8182.

[162]Torralba A, Efros AA, 2011. Unbiased look at dataset bias. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1521-1528.

[163]Tripathi S, Sridhar SN, Sundaresan S, et al., 2019a. Compact scene graphs for layout composition and patch retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.676-683.

[164]Tripathi S, Bhiwandiwalla A, Bastidas A, et al., 2019b. Using scene graph context to improve image generation. https://arxiv.org/abs/1901.03762

[165]Volokitin A, Konukoglu E, van Gool L, 2020. Decomposing image generation into layout prediction and conditional synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1530-1538.

[166]Wang J, Chen ZW, Yuan CF, et al., 2023. Hierarchical curriculum learning for no-reference image quality assessment. Int J Comput Vis, 131(11):3074-3093.

[167]Wang JR, Duan HY, Liu J, et al., 2023. AIGCIQA2023: a large-scale image quality assessment database for AI generated images: from the perspectives of quality, authenticity and correspondence. CAAI Int Conf on Artificial Intelligence, p.46-57.

[168]Wang L, Chen W, Yang WJ, et al., 2020. A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access, 8:63514-63537.

[169]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369.

[170]Wang Y, Tao X, Qi XJ, et al., 2018. Image inpainting via generative multi-column convolutional neural networks. Proc 32^nd Int Conf on Neural Information Processing Systems, p.331-340.

[171]Wang Y, He YN, Li YZ, et al., 2024. InternVid: a large-scale video-text dataset for multimodal understanding and generation. https://arxiv.org/abs/2307.06942

[172]Wang YH, Wang Q, Zhang DY, 2022. Few-shot generation by modeling stereoscopic priors. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2370-2374.

[173]Wang Z, Simoncelli EP, Bovik AC, 2003. Multiscale structural similarity for image quality assessment. 37^th Asilomar Conf on Signals, Systems & Computers, p.1398-1402.

[174]Wang Z, Bovik AC, Sheikh HR, et al., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process, 13(4):600-612.

[175]Wang ZH, Chen J, Hoi SCH, 2021. Deep learning for image super-resolution: a survey. IEEE Trans Patt Anal Mach Intell, 43(10):3365-3387.

[176]Wang ZJ, Qi XQ, Yuan K, et al., 2022. Self-supervised correlation mining network for person image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7693-7702.

[177]Wang ZM, Li YX, Huang DL, et al., 2023. DeformSg2im: scene graph based multi-instance image generation with a deformable geometric layout. Neurocomputing, 558: 126684.

[178]Wu JY, Li ZJ, Zhang SY, et al., 2022. Amodal layout completion in complex outdoor scenes. 2^nd CAAI Int Conf on Artificial Intelligence, p.30-41.

[179]Wu JY, Gan WS, Chen ZF, et al., 2023. AI-generated content (AIGC): a survey. https://arxiv.org/abs/2304.06632

[180]Wu SS, Tang H, Jing XY, et al., 2023. Cross-view panorama image synthesis. IEEE Trans Multim, 25:3546-3559.

[181]Wu XS, Sun KQ, Zhu F, et al., 2023a. Human preference score: better aligning text-to-image models with human preference. Proc IEEE/CVF Int Conf on Computer Vision, p.2096-2105.

[182]Wu XS, Hao YM, Sun KQ, et al., 2023b. Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis. https://arxiv.org/abs/2306.09341

[183]Wu YZ, Wang XT, Li Y, et al., 2021. Towards vivid and diverse image colorization with generative color prior. Proc IEEE/CVF Int Conf on Computer Vision, p.14357-14366.

[184]Wu ZB, Deng HG, Wang Q, et al., 2023. SketchScene: scene sketch to image generation with diffusion models. IEEE Int Conf on Multimedia and Expo, p.2087-2092.

[185]Xia WH, Yang YJ, Xue JH, 2021a. Cali-sketch: stroke calibration and completion for high-quality face image generation from human-like sketches. Neurocomputing, 460:256-265.

[186]Xia WH, Yang YJ, Xue JH, et al., 2021b. TediGAN: text-guided diverse face image generation and manipulation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2256-2265.

[187]Xie Y, Fu YW, Tai Y, et al., 2022. Learning to memorize feature hallucination for one-shot image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9120-9129.

[188]Xu HR, Chen SY, Zhang Y, 2023. Magical Brush: a symbol-based modern Chinese painting system for novices. Proc CHI Conf on Human Factors in Computing Systems, Article 131.

[189]Xu JZ, Liu X, Wu YC, et al., 2023. ImageReward: learning and evaluating human preferences for text-to-image generation. Proc 37^th Int Conf on Neural Information Processing Systems, Article 700.

[190]Xu QQ, Huang QM, Yao Y, 2012. Online crowdsourcing subjective image quality assessment. Proc 20^th ACM Int Conf on Multimedia, p.359-368.

[191]Xu T, Zhang PC, Huang QY, et al., 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1316-1324.

[192]Yan K, Ji L, Wu CF, et al., 2022. Trace controlled text to image generation. 17^th European Conf on Computer Vision, p.59-75.

[193]Yang H, Zhang RM, Guo XB, et al., 2020. Towards photo-realistic virtual try-on by adaptively generating ↔ preserving image content. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7847-7856.

[194]Yang L, Zhang ZL, Song Y, et al., 2023. Diffusion models: a comprehensive survey of methods and applications. ACM Comput Surv, 56(4):105.

[195]Yang S, Jiang LM, Liu ZW, et al., 2022. Unsupervised image-to-image translation with generative prior. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18311-18320.

[196]Yang Y, Lin YQ, Liu H, et al., 2024. Position: towards implicit prompt for text-to-image models. 41^st Int Conf on Machine Learning, Article 2320.

[197]Yao YZ, Hua XS, Shen FM, et al., 2016. A domain robust approach for image dataset construction. Proc 24^th ACM Int Conf on Multimedia, p.212-216.

[198]Yu JH, Xu YZ, Koh JY, et al., 2022. Scaling autoregressive models for content-rich text-to-image generation. https://arxiv.org/abs/2206.10789

[199]Yu YC, Zhan FN, Lu SJ, et al., 2021. WaveFill: a wavelet-based generation network for image inpainting. Proc IEEE/CVF Int Conf on Computer Vision, p.14094-14103.

[200]Yuan XD, Tang ST, Li KJ, et al., 2024. CamFreeDiff: camera-free image to panorama generation with diffusion model. https://arxiv.org/abs/2407.07174

[201]Zhai GT, Min XK, 2020. Perceptual image quality assessment: a survey. Sci China Inform Sci, 63(11): 211301.

[202]Zhang C, Wu QY, Gambardella CC, et al., 2024. Taming stable diffusion for text to 360° panorama image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6347-6357.

[203]Zhang F, Tian SL, Huang ZQ, et al., 2024. Evaluation agent: efficient and promptable evaluation framework for visual generative models. https://arxiv.org/abs/2412.09645

[204]Zhang H, Koh JY, Baldridge J, et al., 2021. Cross-modal contrastive learning for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.833-842.

[205]Zhang HG, Dai YC, Li HD, et al., 2019. Deep stacked hierarchical multi-patch network for image deblurring. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5971-5979.

[206]Zhang JM, Ma CX, Yang KL, et al., 2022. Transfer beyond the field of view: dense panoramic semantic segmentation via unsupervised domain adaptation. IEEE Trans Intell Transp Syst, 23(7):9478-9491.

[207]Zhang KH, Ren WQ, Luo WH, et al., 2022. Deep image deblurring: a survey. Int J Comput Vis, 130(9):2103-2130.

[208]Zhang PZ, Yang LX, Xie XH, et al., 2022. Lightweight texture correlation network for pose guided person image generation. IEEE Trans Circ Syst Video Technol, 32(7):4584-4598.

[209]Zhang R, Isola P, Efros AA, 2016. Colorful image colorization. 14^th European Conf on Computer, p.649-666.

[210]Zhang R, Isola P, Efros AA, et al., 2018. The unreasonable effectiveness of deep features as a perceptual metric. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.586-595.

[211]Zhang SX, Wang BH, Wu JQ, et al., 2024. Learning multi-dimensional human preference for text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8018-8027.

[212]Zhang XB, Zhai DH, Li TR, et al., 2023. Image inpainting based on deep learning: a review. Inform Fus, 90:74-94.

[213]Zhang YK, Meng CY, Li ZJ, et al., 2023. Learning object consistency and interaction in image generation from scene graphs. Proc 32^nd Int Joint Conf on Artificial Intelligence, p.1731-1739.

[214]Zhao B, Meng LL, Yin WD, et al., 2019. Image generation from layout. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8576-8585.

[215]Zhao K, Yuan K, Sun M, et al., 2023. Quality-aware pretrained models for blind image quality assessment. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22302-22313.

[216]Zhao SY, Zhang L, Shen Y, et al., 2021. RefineDNet: a weakly supervised refinement framework for single image dehazing. IEEE Trans Image Process, 30:3391-3404.

[217]Zhao Y, Ren DY, Chen Y, et al., 2022. Cartoon image processing: a survey. Int J Comput Vis, 130(11):2733-2769.

[218]Zhao YQ, Ding HH, Huang HJ, et al., 2022. A closer look at few-shot image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9130-9140.

[219]Zhao YQ, Chandrasegaran K, Abdollahzadeh M, et al., 2023. AdAM: few-shot image generation via adaptation-aware kernel modulation. https://arxiv.org/abs/2307.01465

[220]Zheng BY, Gu JJ, Li SJ, et al., 2024. LM4LV: a frozen large language model for low-level vision tasks. https://arxiv.org/abs/2405.15734

[221]Zheng WD, Teng JY, Yang ZY, et al., 2024. CogView3: finer and faster text-to-image generation via relay diffusion. 18^th European Conf on Computer Vision, p.1-22.

[222]Zhou MQ, Wang YX, Hou J, et al., 2024. SceneX: procedural controllable large-scale scene generation. https://arxiv.org/abs/2403.15698

[223]Zhou S, Gordon ML, Krishna R, et al., 2019. HYPE: a benchmark for human eye perceptual evaluation of generative models. Proc 33^rd Int Conf on Neural Information Processing Systems, p.3449-3461.

[224]Zhou YF, Liu BC, Zhu YZ, et al., 2023. Shifted diffusion for text-to-image generation. Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10157-10166.

[225]Zhou YF, Zhang RY, Zheng KZ, et al., 2024. Toffee: efficient million-scale dataset construction for subject-driven text-to-image generation. https://arxiv.org/abs/2406.09305

[226]Zhu JY, Ma HM, Chen JS, et al., 2024. High-quality and diverse few-shot image generation via masked discrimination. IEEE Trans Image Process, 33:2950-2965.

[227]Zhu WH, Zhai GT, Hu MH, et al., 2018. Arrow’s impossibility theorem inspired subjective image quality assessment approach. Signal Process, 145:193-201.

[228]Zhu Z, Huang TT, Shi BG, et al., 2019. Progressive pose attention transfer for person image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2342-2351.

[229]Zhu Z, Xu ZL, You AS, et al., 2020. Semantically multi-modal image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5466-5475.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

图像生成评估：人类评估与自动评估的全面综述

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference