High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Zhang, Libo; Yu, Yongsheng; Yao, Jiali; Fan, Heng

doi:10.1007/s11263-025-02448-w

High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Published: 16 May 2025

Volume 133, pages 5788–5805, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Libo Zhang¹^na1,
Yongsheng Yu¹^na1,
Jiali Yao ORCID: orcid.org/0009-0009-9673-3658² &
…
Heng Fan³

746 Accesses
6 Citations
Explore all metrics

Abstract

Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with \( \mathcal {F} \& \mathcal {W}^+\) latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the \( \mathcal {F} \& \mathcal {W}^+\) latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, including CelebA-HQ, Places2, OST, CityScapes, MetFaces and Scenery, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively. Our project webpage including code and results will be available at https://yeates.github.io/mm-invertfill.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

High-Fidelity Image Inpainting with GAN Inversion

MM2Latent: Text-to-Facial Image Generation and Editing in GANs with Multimodal Assistance

Multi-scale patch-GAN with edge detection for image inpainting

Article 04 June 2022

Data Availability

The data supporting the findings of this study are openly available, and our codes will be released at https://yeates.github.io/mm-invertfill.

References

Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In ICCV
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In CVPR
Ardino, P., Liu, Y., & Ricci, E., et al. (2020). Semantic-guided inpainting network for complex urban scenes manipulation. In ICPR
Ba, L .J., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv
Cao, C., & Fu, Y. (2021). Learning a sketch tensor space for image inpainting of man-made scenes. In ICCV
Chen, L., Zhu, Y., & Papandreou, G., et al. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV
Cheng, Y. C., Lin, C. H., & Lee, H. Y., et al. (2022). Inout: Diverse image outpainting via gan inversion. In CVPR
Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR
Criminisi, A., Pérez P., & Toyama, K. (2003). Object removal by exemplar-based inpainting. In CVPR
Deng, Y., Hui ,S., & Zhou, S., et al. (2021). Learning contextual transformer network for image inpainting. In MM (pp. 2529–2538)
Dong, Q., Cao, C., & Fu, Y. (2022). Incremental transformer structure enhanced image inpainting with masking positional encoding. In CVPR
Dosovitskiy, A., Beyer, L., & Kolesnikov A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
Goodfellow, I. J., Pouget-Abadie, J., & Mirza, M., et al. (2014). Generative adversarial nets. In NIPS
Gu, J., Shen, Y., & Zhou, B. (2020). Image processing using multi-code GAN prior. In CVPR
Gulrajani, I., Ahmed, F., & Arjovsky, M., et al. (2017). Improved training of wasserstein gans. In NIPS
Guo, X., Yang, H., & Huang, D. (2021). Image inpainting via conditional texture and structure dual generation. In ICCV
He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In CVPR
Heusel, M., Ramsauer, H., & Unterthiner, T., et al. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS
Huang, X., & Belongie, S. J. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV
Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. TOG, 36(4), 1–14.
Article Google Scholar
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR
Karras, T., Aittala, M., & Hellsten, J., et al. (2020a). Training generative adversarial networks with limited data. In NeurIPS
Karras, T., Laine, S., & Aittala, M., et al. (2020b). Analyzing and improving the image quality of stylegan. In CVPR
Krishnan, D., Teterwak, P., & Sarna, A., et al. (2019). Boundless: Generative adversarial networks for image extension. In ICCV
Lee, C., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In CVPR
Li, J., Wang, N., & Zhang, L., et al. (2020). Recurrent feature reasoning for image inpainting. In CVPR
Li, W., Lin, Z., & Zhou, K., et al. (2022a). Mat: Mask-aware transformer for large hole image inpainting. In CVPR
Li, X., Guo, Q., & Lin, D., et al. (2022b). Misf: Multi-level interactive siamese filtering for high-fidelity image inpainting. In CVPR
Liao, L., Xiao, J., & Wang, Z., et al. (2020). Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In ECCV
Liao L., Xiao J., & Wang Z., et al. (2021). Image inpainting guided by coherence priors of semantics and textures. In CVPR
Lillicrap, T. P., Hunt, J. J., & Pritzel A., et al. (2016). Continuous control with deep reinforcement learning. In ICLR
Liu, G., Reda, F. A., & Shih, K. J., et al. (2018). Image inpainting for irregular holes using partial convolutions. In ECCV
Liu, H., Jiang, B., & Xiao, Y., et al. (2019). Coherent semantic attention for image inpainting. In ICCV
Liu, H., Jiang, B., & Song, Y., et al. (2020). Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV
Mao, X., Li, Q., & Xie, H., et al. (2017). Least squares generative adversarial networks. In ICCV
Miyato, T., Kataoka, T., & Koyama, M., et al. (2018). Spectral normalization for generative adversarial networks. In ICLR
Nazeri, K., Ng, E., & Joseph, T., et al. (2019). Edgeconnect: Structure guided image inpainting using edge prediction. In ICCVW
Park, T., Liu, M., & Wang, T., et al. (2019). Semantic image synthesis with spatially-adaptive normalization. In CVPR
Pathak, D., Krähenbühl, P., & Donahue, J., et al. (2016). Context encoders: Feature learning by inpainting. In CVPR
Richardson, E., Alaluf, Y., & Patashnik, O., et al. (2021). Encoding in style: A stylegan encoder for image-to-image translation. In CVPR
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI
Salimans, T., Goodfellow, I. J., Zaremba, W., et al. (2016). Improved techniques for training gans. In NIPS
Shetty, R., Fritz, M., & Schiele, B. (2018). Adversarial scene editing: Automatic object removal from weak supervision. In NIPS
Song, L., Cao, J., & Song, L., et al. (2019). Geometry-aware face completion and editing. In AAAI
Song, Y., Yang, C., & Shen, Y., et al. (2018). Spg-net: Segmentation prediction and guidance network for image inpainting. In BMVC
Song, Y., Sohl-Dickstein, J., & Kingma, D. P., et al. (2021). Score-based generative modeling through stochastic differential equations. In ICLR
Suvorov, R., Logacheva, E., & Mashikhin, A., et al. (2022). Resolution-robust large mask inpainting with fourier convolutions. In WACV
Ulyanov, D., & Vedaldi, A., & Lempitsky, V. S. Instance normalization: The missing ingredient for fast stylization. arXiv.
Wan, Z., Zhang, J., & Chen, D., et al. (2021). High-fidelity pluralistic image completion with transformers. In ICCV
Wang, P., Chen, P., & Yuan, Y., et al. (2018a). Understanding convolution for semantic segmentation. In WACV
Wang, X., Yu, K., & Dong, C., et al. (2018b). Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR
Wang, Y., Tao, X., & Qi, X., et al. (2018c). Image inpainting via generative multi-column convolutional neural networks. In NIPS
Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Woo, S., Park, J., & Lee, J., et al. (2018). CBAM: Convolutional block attention module. In ECCV
Wu, H., Zheng, S., & Zhang, J., et al. (2019). GP-GAN: Towards realistic high-resolution image blending. In ACM MM
Xia, W., Zhang, Y., & Yang, Y., et al. (2022). Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
Xiong, W., Yu, J., & Lin, Z., et al. (2019). Foreground-aware image inpainting. In CVPR
Xu, Y., Shen, Y., & Zhu, J., et al. (2021). Generative hierarchical features from synthesizing images. In CVPR
Yan, Z., Li, X., & Li, M., et al. (2018). Shift-net: Image inpainting via deep feature rearrangement. In ECCV
Yang, J., Qi, Z., & Shi, Y. (2020). Learning to incorporate structure knowledge for image inpainting. In AAAI
Yang, S., Jiang, L., & Liu, Z., et al. (2022). Vtoonify: Controllable high-resolution portrait video style transfer. TOG
Yang, Z., Dong, J., & Liu, P., et al. (2019). Very long natural scenery image prediction by outpainting. In ICCV
Yu F., Koltun V., & Funkhouser, T. A. (2017). Dilated residual networks. In CVPR
Yu, J., Lin, Z., Yang, J., et al. (2018). Generative image inpainting with contextual attention. In CVPR
Yu, J., Lin, Z., & Yang, J., et al. (2019). Free-form image inpainting with gated convolution. In ICCV
Yu, Y., Zhan, F., & Wu, R., et al. (2021). Diverse image inpainting with bidirectional and autoregressive transformers. In MM
Yu, Y., Du, D., & Zhang, L., et al. (2022a). Unbiased multi-modality guidance for image inpainting. In ECCV
Yu, Y., Zhang, L., & Fan, H., et al. (2022b). High-fidelity image inpainting with gan inversion. In ECCV
Zeng, Y., Lin, Z., & Yang, J., et al. (2020). High-resolution image inpainting with iterative confidence feedback and guided upsampling. In ECCV
Zeng, Y., Lin, Z., & Lu, H., et al. (2021). Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In ICCV
Zeng, Y., Fu, J., & Chao, H., et al. (2022). Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics
Zhang, R., Isola, P., & Efros, A. A., et al. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR
Zhao, S., Cui, J., & Sheng, Y., et al. (2021). Large scale image completion via co-modulated generative adversarial networks. In ICLR
Zheng, C., Cham, T., & Cai, J. (2019). Pluralistic image completion. In CVPR
Zhou, B., Lapedriza, A., Khosla, A., et al. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464.
Article Google Scholar
Zhou, B., Zhao H., & Puig X., et al. (2017b). Scene parsing through ade20k dataset. In CVPR
Zhu, J., Krähenbühl P., & Shechtman E., et al. (2016). Generative visual manipulation on the natural image manifold. In ECCV
Zhu, J., Shen Y., & Zhao D., et al. (2020). In-domain GAN inversion for real image editing. In ECCV

Download references

Acknowledgements

Libo Zhang is supported by National Natural Science Foundation of China (No. 62476266). Heng Fan has not been supported by any funding for this work at any stage.

Author information

Libo Zhang and Yongsheng Yu have contributed equally to this work.

Authors and Affiliations

Institute of Software Chinese Academy of Sciences, Beijing, China
Libo Zhang & Yongsheng Yu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
Jiali Yao
Department of Computer Science and Engineering, University of North Texas, Denton, TX, USA
Heng Fan

Authors

Libo Zhang
View author publications
Search author on:PubMed Google Scholar
Yongsheng Yu
View author publications
Search author on:PubMed Google Scholar
Jiali Yao
View author publications
Search author on:PubMed Google Scholar
Heng Fan
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Heng Fan.

Additional information

Communicated by Maja Pantic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, L., Yu, Y., Yao, J. et al. High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion. Int J Comput Vis 133, 5788–5805 (2025). https://doi.org/10.1007/s11263-025-02448-w

Download citation

Received: 26 November 2023
Accepted: 24 March 2025
Published: 16 May 2025
Version of record: 16 May 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02448-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High-Fidelity Image Inpainting with GAN Inversion

MM2Latent: Text-to-Facial Image Generation and Editing in GANs with Multimodal Assistance

Multi-scale patch-GAN with edge detection for image inpainting

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now