CLC number: TP183
On-line Access: 2025-06-04
Received: 2024-03-14
Revision Accepted: 2024-10-25
Crosschecked: 2025-09-04
Cited: 0
Clicked: 765
Citations: Bibtex RefMan EndNote GB/T7714
Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN. Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(8): 1428-1440.
@article{title="Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning",
author="Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="8",
pages="1428-1440",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400395"
}
%0 Journal Article
%T Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning
%A Shiyuan YANG
%A Zheng GU
%A Wenyue HAO
%A Yi WANG
%A Huaiyu CAI
%A Xiaodong CHEN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 8
%P 1428-1440
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400395
TY - JOUR
T1 - Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning
A1 - Shiyuan YANG
A1 - Zheng GU
A1 - Wenyue HAO
A1 - Yi WANG
A1 - Huaiyu CAI
A1 - Xiaodong CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 8
SP - 1428
EP - 1440
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400395
Abstract: Text-to-image diffusion models have demonstrated impressive capabilities in image generation and have been effectively applied to image inpainting. While text prompt provides an intuitive guidance for conditional inpainting, users often seek the ability to inpaint a specific object with customized appearance by providing an exemplar image. Unfortunately, existing methods struggle to achieve high fidelity in exemplar-driven inpainting. To address this, we use a plug-and-play low-rank adaptation (LoRA) module based on a pretrained text-driven inpainting model. The LoRA module is dedicated to learn the exemplar-specific concepts through few-shot fine-tuning, bringing improved fitting capability to customized exemplar images, without intensive training on large-scale datasets. Additionally, we introduce GPT-4V prompting and prior noise initialization techniques to further facilitate the fidelity in inpainting results. In brief, the denoising diffusion process first starts with the noise derived from a composite exemplar–background image, and is subsequently guided by an expressive prompt generated from the exemplar using the GPT-4V model. Extensive experiments demonstrate that our method achieves state-of-the-art performance, qualitatively and quantitatively, offering users an exemplar-driven inpainting tool with enhanced customization capability.
[1]Achiam J, Adler S, Agarwal S, et al., 2024. GPT-4 technical report.
[2]Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187-18197.
[3]Avrahami O, Fried O, Lischinski D, 2023. Blended latent diffusion. ACM Trans Graph, 42(4):149.
[4]Barnes C, Shechtman E, Finkelstein A, et al., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24.
[5]Brooks T, Holynski A, Efros AA, 2023. InstructPix2Pix: learning to follow image editing instructions. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18392-18402.
[6]Chefer H, Alaluf Y, Vinker Y, et al., 2023. Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans Graph, 42(4):148.
[7]Chen JS, Yu JC, Ge CJ, et al., 2023. PixArt-α: fast training of diffusion Transformer for photorealistic text-to-image synthesis.
[8]Criminisi A, Pérez P, Toyama K, 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Trans Image Process, 13(9):1200-1212.
[9]Dhariwal P, Nichol A, 2021. Diffusion models beat GANs on image synthesis. Proc 35th Int Conf on Neural Information Processing Systems, p.8780-8794.
[10]Gal R, Alaluf Y, Atzmon Y, et al., 2022. An image is worth one word: personalizing text-to-image generation using textual inversion.
[11]He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423-2435.
[12]Hertz A, Mokady R, Tenenbaum J, et al., 2022. Prompt-to-Prompt image editing with cross attention control.
[13]Hessel J, Holtzman A, Forbes M, et al., 2022. CLIPScore: a reference-free evaluation metric for image captioning.
[14]Heusel M, Ramsauer H, Unterthiner T, et al., 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Proc 31st Int Conf on Neural Information Processing Systems, p.6629-6640.
[15]Ho J, Salimans T, 2022. Classifier-free diffusion guidance.
[16]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, p.6840-6851.
[17]Hu EJ, Shen YL, Wallis P, et al., 2021. LoRA: low-rank adaptation of large language models.
[18]Kingma DP, Ba J, 2017. Adam: a method for stochastic optimization.
[19]Kumari N, Zhang BL, Zhang R, et al., 2023. Multi-concept customization of text-to-image diffusion. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1931-1941.
[20]Lei YM, Li JQ, Li ZL, et al., 2024. Prompt learning in computer vision: a survey. Front Inform Technol Electron Eng, 25(1):42-63.
[21]Li JY, Wang N, Zhang LF, et al., 2020. Recurrent feature reasoning for image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7757-7765.
[22]Li YH, Liu HT, Wu QY, et al., 2023. GLIGEN: open-set grounded text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22511-22521.
[23]Liu GL, Reda FA, Shih KJ, et al., 2018. Image inpainting for irregular holes using partial convolutions. Proc 15th European Conf on Computer Vision, p.89-105.
[24]Meng CL, He YT, Song Y, et al., 2022. SDEdit: guided image synthesis and editing with stochastic differential equations.
[25]Mou C, Wang XT, Xie LB, et al., 2023. T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models.
[26]Nazeri K, Ng E, Joseph T, et al., 2019. EdgeConnect: generative image inpainting with adversarial edge learning.
[27]Pathak D, Krähenbühl P, Donahue J, et al., 2016. Context encoders: feature learning by inpainting. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2536-2544.
[28]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763.
[29]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents.
[30]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685.
[31]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510.
[32]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding.
[33]Schuhmann C, Vencu R, Beaumont R, et al., 2021. LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs.
[34]Song JM, Meng CL, Ermon S, 2022. Denoising diffusion implicit models.
[35]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369.
[36]Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428-22437.
[37]Yang BX, Gu SY, Zhang B, et al., 2023. Paint by example: exemplar-based image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18381-18391.
[38]Yang H, Zhang RM, Guo XB, et al., 2020. Towards photo-realistic virtual try-on by adaptively generating↔ preserving image content. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7847-7856.
[39]Yang SY, Wang Y, Cai HY, et al., 2022. Residual inpainting using selective free-form attention. Neurocomputing, 510:149-158.
[40]Ye H, Zhang J, Liu SB, et al., 2023. IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models.
[41]Yu JH, Lin Z, Yang JM, et al., 2018. Generative image inpainting with contextual attention. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5505-5514.
[42]Zhang HR, Hu ZZ, Luo CZ, et al., 2018. Semantic image inpainting with progressive generative networks. Proc 26th ACM Int Conf on Multimedia, p.1939-1947.
[43]Zhang JP, Sun LY, Jin C, et al., 2024. Recent advances in artificial intelligence generated content. Front Inform Technol Electron Eng, 25(1):1-5.
[44]Zhang LM, Rao AY, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models. Proc IEEE/CVF Int Conf on Computer Vision, p.3813-3824.
[45]Zhou J, Ke P, Qiu XP, et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, 25(1):6-11.
Open peer comments: Debate/Discuss/Question/Opinion
<1>