JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Multistage guidance on the diffusion model inspired by human artists’ creative thinking

Author(s): Wang QI, Huanghuang DENG, Taihao LI
Affiliation(s): AI Research Institute, Zhejiang Lab, Hangzhou 311121, China; more
Corresponding email(s): qiwang@zhejianglab.com, dhh2012@zju.edu.cn, lith@zhejianglab.com
Key Words:

Share this article to： More <<< Previous Paper \|

Wang QI, Huanghuang DENG, Taihao LI. Multistage guidance on the diffusion model inspired by human artists’ creative thinking[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300313

@article{title="Multistage guidance on the diffusion model inspired by human artists’ creative thinking",
author="Wang QI, Huanghuang DENG, Taihao LI",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2300313"
}

%0 Journal Article
%T Multistage guidance on the diffusion model inspired by human artists’ creative thinking
%A Wang QI
%A Huanghuang DENG
%A Taihao LI
%J Frontiers of Information Technology & Electronic Engineering
%P 170-178
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2300313"

TY - JOUR
T1 - Multistage guidance on the diffusion model inspired by human artists’ creative thinking
A1 - Wang QI
A1 - Huanghuang DENG
A1 - Taihao LI
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 170
EP - 178
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2300313"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Current research on text-conditional image generation shows parallel performance with ordinary painters but still has much room for improvement when compared to that of artist-ability paintings, which usually represent multilevel semantics by gathering features ofmultiple objects into one object. In a preliminary experiment, we confirm this and then seek the opinions of three groups of individuals with varying levels of art appreciation ability to determine the distinctions that exist between painters and artists. We then use these opinions to improve an artificial intelligence (AI) painting system from painter-level image generation toward artistic-level image generation. Specifically, we propose a multistage text-conditioned approach without any further pretraining to help the diffusion model (DM) move toward multilevel semantic representation in a generated image. Both machine and manual evaluations of the main experiment verify the effectiveness of our approach. In addition, different from previous onestage guidance, our method is able to control the extent to which features of an object are represented in a painting by controlling guiding steps between the different stages.

受艺术家创造性思维启发的扩散模型多阶段引导

齐旺¹，邓晃煌²，李太豪¹
¹之江实验室跨媒体智能研究中心，中国杭州市，311500
²浙江大学计算机科学与技术学院，中国杭州市，310027
摘要：目前文本生成图像的研究已显示出与普通画家类似的水平，但与艺术家绘画水平相比仍有很大改进空间；艺术家水平的绘画通常将多个意象的特征融合到一个意象中，以表示多层次语义信息。在预实验中，我们证实了这一点，并咨询了3个具有不同艺术欣赏能力的群体的意见，以确定画家和艺术家之间绘画水平的区别。之后，利用这些观点帮助人工智能绘画系统从普通画家水平的图像生成改进为艺术家水平的图像生成。具体来说，提出一种无需任何进一步预训练的、基于文本的多阶段引导方法，帮助扩散模型在生成的图像中向多层次语义表示迈进。实验中的机器和人工评估都验证了所提方法的有效性。此外，与之前单阶段引导方法不同，该方法能够通过控制不同阶段之间的指导步数来控制各个意象特征在绘画中的表现程度。

关键词组：文本生成图像；扩散模型；多层次语义；多阶段引导

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Arjovsky M, Chintala S, Bottou L, 2017. Wasserstein GAN. https://arxiv.org/abs/1701.07875

[2]Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. Proc 7^th Int Conf on Learning Representations.

[3]Chen M, Radford A, Child R, et al., 2020. Generative pretraining from pixels. Proc 37^th Int Conf on Machine Learning, p.1691-1703.

[4]Chen N, Zhang Y, Zen H, et al., 2021. WaveGrad: estimating gradients for waveform generation. Proc 9^th Int Conf on Learning Representations.

[5]Child R, Gray S, Radford A, et al., 2019. Generating long sequences with sparse transformers. https://arxiv.org/abs/1904.10509

[6]Dinh L, Krueger D, Bengio Y, 2015. NICE: non-linear independent components estimation. Proc 3^rd Int Conf on Learning Representations.

[7]Dinh L, Sohl-Dickstein J, Bengio S, 2017. Density estimation using real NVP. Proc 5^th Int Conf on Learning Representations.

[8]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.

[9]Gulrajani I, Ahmed F, Arjovsky M, et al., 2017. Improved training of wasserstein GANs. Proc 31^st Int Conf on Neural Information Processing Systems, p.5767-5777.

[10]Ho J, Salimans T, 2021. Classifier-free diffusion guidance. Proc Workshop on Deep Generative Models and Downstream Applications.

[11]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34^th Int Conf on Neural Information Processing Systems, Article 574.

[12]Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107-8116.

[13]Karras T, Laine S, Aila T, 2021. A style-based generator architecture for generative adversarial networks. IEEE Trans Patt Anal Mach Intell, 43(12):4217-4228.

[14]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2^nd Int Conf on Learning Representations.

[15]Kingma DP, Salimans T, Poole B, et al., 2021. Variational diffusion models. https://arxiv.org/abs/2107.00630

[16]Kong ZF, Ping W, Huang JJ, et al., 2021. DiffWave: a versatile diffusion model for audio synthesis. Proc 9^th Int Conf on Learning Representations.

[17]Mescheder L, 2018. On the convergence properties of GAN training. https://arxiv.org/abs/1801.04406v1

[18]Metz L, Poole B, Pfau D, et al., 2017. Unrolled generative adversarial networks. Proc 5^th Int Conf on Learning Representations.

[19]Mittal G, Engel JH, Hawthorne C, et al., 2021. Symbolic music generation with diffusion models. Proc 22^nd Int Society for Music Information Retrieval Conf, p.468-475.

[20]Nichol AQ, Dhariwal P, 2021. Improved denoising diffusion probabilistic models. Proc 38^th Int Conf on Machine Learning, p.8162-8171.

[21]Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39^th Int Conf on Machine Learning, p.16784-16804.

[22]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38^th Int Conf on Machine Learning, p.8821-8831.

[23]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with clip latents. https://arxiv.org/abs/2204.06125

[24]Razavi A, van den Oord A, Vinyals O, 2019. Generating diverse high-fidelity images with VQ-VAE-2. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 1331.

[25]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10684-10695.

[26]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36^th Int Conf on Neural Information Processing Systems, p.36479-36494.

[27]Sohl-Dickstein J, Weiss EA, Maheswaranathan N, et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. Proc 32^nd Int Conf on Machine Learning, p.2256-2265.

[28]Song J, Meng C, Ermon S, 2021. Denoising diffusion implicit models. Proc 9^th Int Conf on Learning Representations.

[29]Song Y, Sohl-Dickstein J, Kingma DP, et al., 2021. Score-based generative modeling through stochastic differential equations. Proc 9^th Int Conf on Learning Representations.

[30]van den Oord A, Kalchbrenner N, Espeholt L, et al., 2016a. Conditional image generation with pixelcnn decoders. Proc 30^th Int Conf on Neural Information Processing Systems, p.4797-4805.

[31]van den Oord A, Kalchbrenner N, Kavukcuoglu K, 2016b. Pixel recurrent neural networks. Proc 33^rd Int Conf on Machine Learning, p.1747-1756.

[32]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

受艺术家创造性思维启发的扩散模型多阶段引导

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference