CLC number:
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-10-13
Cited: 0
Clicked: 1047
Wang QI, Huanghuang DENG, Taihao LI. Multistage guidance on the diffusion model inspired by human artists’ creative thinking[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(1): 170-178.
@article{title="Multistage guidance on the diffusion model inspired by human artists’ creative thinking",
author="Wang QI, Huanghuang DENG, Taihao LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="1",
pages="170-178",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300313"
}
%0 Journal Article
%T Multistage guidance on the diffusion model inspired by human artists’ creative thinking
%A Wang QI
%A Huanghuang DENG
%A Taihao LI
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 1
%P 170-178
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300313
TY - JOUR
T1 - Multistage guidance on the diffusion model inspired by human artists’ creative thinking
A1 - Wang QI
A1 - Huanghuang DENG
A1 - Taihao LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 1
SP - 170
EP - 178
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300313
Abstract: Current research on text-conditional image generation shows parallel performance with ordinary painters but still has much room for improvement when compared to that of artist-ability paintings, which usually represent multilevel semantics by gathering features ofmultiple objects into one object. In a preliminary experiment, we confirm this and then seek the opinions of three groups of individuals with varying levels of art appreciation ability to determine the distinctions that exist between painters and artists. We then use these opinions to improve an artificial intelligence (AI) painting system from painter-level image generation toward artistic-level image generation. Specifically, we propose a multistage text-conditioned approach without any further pretraining to help the diffusion model (DM) move toward multilevel semantic representation in a generated image. Both machine and manual evaluations of the main experiment verify the effectiveness of our approach. In addition, different from previous onestage guidance, our method is able to control the extent to which features of an object are represented in a painting by controlling guiding steps between the different stages.
[1]Arjovsky M, Chintala S, Bottou L, 2017. Wasserstein GAN. https://arxiv.org/abs/1701.07875
[2]Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. Proc 7th Int Conf on Learning Representations.
[3]Chen M, Radford A, Child R, et al., 2020. Generative pretraining from pixels. Proc 37th Int Conf on Machine Learning, p.1691-1703.
[4]Chen N, Zhang Y, Zen H, et al., 2021. WaveGrad: estimating gradients for waveform generation. Proc 9th Int Conf on Learning Representations.
[5]Child R, Gray S, Radford A, et al., 2019. Generating long sequences with sparse transformers. https://arxiv.org/abs/1904.10509
[6]Dinh L, Krueger D, Bengio Y, 2015. NICE: non-linear independent components estimation. Proc 3rd Int Conf on Learning Representations.
[7]Dinh L, Sohl-Dickstein J, Bengio S, 2017. Density estimation using real NVP. Proc 5th Int Conf on Learning Representations.
[8]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.
[9]Gulrajani I, Ahmed F, Arjovsky M, et al., 2017. Improved training of wasserstein GANs. Proc 31st Int Conf on Neural Information Processing Systems, p.5767-5777.
[10]Ho J, Salimans T, 2021. Classifier-free diffusion guidance. Proc Workshop on Deep Generative Models and Downstream Applications.
[11]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, Article 574.
[12]Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107-8116.
[13]Karras T, Laine S, Aila T, 2021. A style-based generator architecture for generative adversarial networks. IEEE Trans Patt Anal Mach Intell, 43(12):4217-4228.
[14]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations.
[15]Kingma DP, Salimans T, Poole B, et al., 2021. Variational diffusion models. https://arxiv.org/abs/2107.00630
[16]Kong ZF, Ping W, Huang JJ, et al., 2021. DiffWave: a versatile diffusion model for audio synthesis. Proc 9th Int Conf on Learning Representations.
[17]Mescheder L, 2018. On the convergence properties of GAN training. https://arxiv.org/abs/1801.04406v1
[18]Metz L, Poole B, Pfau D, et al., 2017. Unrolled generative adversarial networks. Proc 5th Int Conf on Learning Representations.
[19]Mittal G, Engel JH, Hawthorne C, et al., 2021. Symbolic music generation with diffusion models. Proc 22nd Int Society for Music Information Retrieval Conf, p.468-475.
[20]Nichol AQ, Dhariwal P, 2021. Improved denoising diffusion probabilistic models. Proc 38th Int Conf on Machine Learning, p.8162-8171.
[21]Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39th Int Conf on Machine Learning, p.16784-16804.
[22]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38th Int Conf on Machine Learning, p.8821-8831.
[23]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with clip latents. https://arxiv.org/abs/2204.06125
[24]Razavi A, van den Oord A, Vinyals O, 2019. Generating diverse high-fidelity images with VQ-VAE-2. Proc 33rd Int Conf on Neural Information Processing Systems, Article 1331.
[25]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10684-10695.
[26]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36th Int Conf on Neural Information Processing Systems, p.36479-36494.
[27]Sohl-Dickstein J, Weiss EA, Maheswaranathan N, et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. Proc 32nd Int Conf on Machine Learning, p.2256-2265.
[28]Song J, Meng C, Ermon S, 2021. Denoising diffusion implicit models. Proc 9th Int Conf on Learning Representations.
[29]Song Y, Sohl-Dickstein J, Kingma DP, et al., 2021. Score-based generative modeling through stochastic differential equations. Proc 9th Int Conf on Learning Representations.
[30]van den Oord A, Kalchbrenner N, Espeholt L, et al., 2016a. Conditional image generation with pixelcnn decoders. Proc 30th Int Conf on Neural Information Processing Systems, p.4797-4805.
[31]van den Oord A, Kalchbrenner N, Kavukcuoglu K, 2016b. Pixel recurrent neural networks. Proc 33rd Int Conf on Machine Learning, p.1747-1756.
[32]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
Open peer comments: Debate/Discuss/Question/Opinion
<1>