|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.8 P.1428-1440
Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning
Abstract: Text-to-image diffusion models have demonstrated impressive capabilities in image generation and have been effectively applied to image inpainting. While text prompt provides an intuitive guidance for conditional inpainting, users often seek the ability to inpaint a specific object with customized appearance by providing an exemplar image. Unfortunately, existing methods struggle to achieve high fidelity in exemplar-driven inpainting. To address this, we use a plug-and-play low-rank adaptation (LoRA) module based on a pretrained text-driven inpainting model. The LoRA module is dedicated to learn the exemplar-specific concepts through few-shot fine-tuning, bringing improved fitting capability to customized exemplar images, without intensive training on large-scale datasets. Additionally, we introduce GPT-4V prompting and prior noise initialization techniques to further facilitate the fidelity in inpainting results. In brief, the denoising diffusion process first starts with the noise derived from a composite exemplar–background image, and is subsequently guided by an expressive prompt generated from the exemplar using the GPT-4V model. Extensive experiments demonstrate that our method achieves state-of-the-art performance, qualitatively and quantitatively, offering users an exemplar-driven inpainting tool with enhanced customization capability.
Key words: Diffusion model; Image inpainting; Exemplar-driven; Few-shot fine-tuning
1天津大学精密仪器与光电子工程学院光电信息技术教育部重点实验室,中国天津市,300072
2南京大学计算机软件新技术国家重点实验室,中国南京市,210008
摘要:文本到图像的扩散模型在图像生成方面展现了卓越的能力,并已广泛应用于图像补全任务。尽管文本提示能够为有条件的图像补全提供直观指导,但用户往往希望通过提供参考图像为特定对象补全个性化外观。然而,现有的参考图驱动图像补全方法难以实现高保真度的补全效果。为解决这一问题,我们基于预训练的文本驱动图像补全模型提出一种即插即用的低秩适配(LoRA)模块。该模块通过少样本微调学习参考图像的特定特征,显著提升了对自定义参考图像的拟合能力,并且无需在大规模数据集上进行大量训练。此外,引入GPT-4V提示词和先验噪声初始化技术,进一步提升补全结果的保真度。简而言之,去噪扩散过程首先从由复合参考-背景图像派生的初始噪声开始,进而由GPT-4V从参考图中生成的丰富提示词引导后续生成过程。大量实验表明,我们的方法在定性和定量指标上都达到目前最高水平,为用户提供了一个具有更强定制化能力的参考图驱动图像补全工具。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2400395
CLC number:
TP183
Download Full Text:
Downloaded:
1682
Download summary:
<Click Here>Downloaded:
222Clicked:
999
Cited:
0
On-line Access:
2025-06-04
Received:
2024-03-14
Revision Accepted:
2024-10-25
Crosschecked:
2025-09-04