Skip to main content
Log in

High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with \( \mathcal {F} \& \mathcal {W}^+\) latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the \( \mathcal {F} \& \mathcal {W}^+\) latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, including CelebA-HQ, Places2, OST, CityScapes, MetFaces and Scenery, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively. Our project webpage including code and results will be available at https://yeates.github.io/mm-invertfill.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.
Fig. 9
The alternative text for this image may have been generated using AI.
Fig. 10
The alternative text for this image may have been generated using AI.
Fig. 11
The alternative text for this image may have been generated using AI.
Fig. 12
The alternative text for this image may have been generated using AI.
Fig. 13
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data Availability

The data supporting the findings of this study are openly available, and our codes will be released at https://yeates.github.io/mm-invertfill.

References

  • Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In ICCV

  • Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In CVPR

  • Ardino, P., Liu, Y., & Ricci, E., et al. (2020). Semantic-guided inpainting network for complex urban scenes manipulation. In ICPR

  • Ba, L .J., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv

  • Cao, C., & Fu, Y. (2021). Learning a sketch tensor space for image inpainting of man-made scenes. In ICCV

  • Chen, L., Zhu, Y., & Papandreou, G., et al. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV

  • Cheng, Y. C., Lin, C. H., & Lee, H. Y., et al. (2022). Inout: Diverse image outpainting via gan inversion. In CVPR

  • Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR

  • Criminisi, A., Pérez P., & Toyama, K. (2003). Object removal by exemplar-based inpainting. In CVPR

  • Deng, Y., Hui ,S., & Zhou, S., et al. (2021). Learning contextual transformer network for image inpainting. In MM (pp. 2529–2538)

  • Dong, Q., Cao, C., & Fu, Y. (2022). Incremental transformer structure enhanced image inpainting with masking positional encoding. In CVPR

  • Dosovitskiy, A., Beyer, L., & Kolesnikov A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR

  • Goodfellow, I. J., Pouget-Abadie, J., & Mirza, M., et al. (2014). Generative adversarial nets. In NIPS

  • Gu, J., Shen, Y., & Zhou, B. (2020). Image processing using multi-code GAN prior. In CVPR

  • Gulrajani, I., Ahmed, F., & Arjovsky, M., et al. (2017). Improved training of wasserstein gans. In NIPS

  • Guo, X., Yang, H., & Huang, D. (2021). Image inpainting via conditional texture and structure dual generation. In ICCV

  • He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In CVPR

  • Heusel, M., Ramsauer, H., & Unterthiner, T., et al. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS

  • Huang, X., & Belongie, S. J. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV

  • Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. TOG, 36(4), 1–14.

    Article  Google Scholar 

  • Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR

  • Karras, T., Aittala, M., & Hellsten, J., et al. (2020a). Training generative adversarial networks with limited data. In NeurIPS

  • Karras, T., Laine, S., & Aittala, M., et al. (2020b). Analyzing and improving the image quality of stylegan. In CVPR

  • Krishnan, D., Teterwak, P., & Sarna, A., et al. (2019). Boundless: Generative adversarial networks for image extension. In ICCV

  • Lee, C., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In CVPR

  • Li, J., Wang, N., & Zhang, L., et al. (2020). Recurrent feature reasoning for image inpainting. In CVPR

  • Li, W., Lin, Z., & Zhou, K., et al. (2022a). Mat: Mask-aware transformer for large hole image inpainting. In CVPR

  • Li, X., Guo, Q., & Lin, D., et al. (2022b). Misf: Multi-level interactive siamese filtering for high-fidelity image inpainting. In CVPR

  • Liao, L., Xiao, J., & Wang, Z., et al. (2020). Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In ECCV

  • Liao L., Xiao J., & Wang Z., et al. (2021). Image inpainting guided by coherence priors of semantics and textures. In CVPR

  • Lillicrap, T. P., Hunt, J. J., & Pritzel A., et al. (2016). Continuous control with deep reinforcement learning. In ICLR

  • Liu, G., Reda, F. A., & Shih, K. J., et al. (2018). Image inpainting for irregular holes using partial convolutions. In ECCV

  • Liu, H., Jiang, B., & Xiao, Y., et al. (2019). Coherent semantic attention for image inpainting. In ICCV

  • Liu, H., Jiang, B., & Song, Y., et al. (2020). Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV

  • Mao, X., Li, Q., & Xie, H., et al. (2017). Least squares generative adversarial networks. In ICCV

  • Miyato, T., Kataoka, T., & Koyama, M., et al. (2018). Spectral normalization for generative adversarial networks. In ICLR

  • Nazeri, K., Ng, E., & Joseph, T., et al. (2019). Edgeconnect: Structure guided image inpainting using edge prediction. In ICCVW

  • Park, T., Liu, M., & Wang, T., et al. (2019). Semantic image synthesis with spatially-adaptive normalization. In CVPR

  • Pathak, D., Krähenbühl, P., & Donahue, J., et al. (2016). Context encoders: Feature learning by inpainting. In CVPR

  • Richardson, E., Alaluf, Y., & Patashnik, O., et al. (2021). Encoding in style: A stylegan encoder for image-to-image translation. In CVPR

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI

  • Salimans, T., Goodfellow, I. J., Zaremba, W., et al. (2016). Improved techniques for training gans. In NIPS

  • Shetty, R., Fritz, M., & Schiele, B. (2018). Adversarial scene editing: Automatic object removal from weak supervision. In NIPS

  • Song, L., Cao, J., & Song, L., et al. (2019). Geometry-aware face completion and editing. In AAAI

  • Song, Y., Yang, C., & Shen, Y., et al. (2018). Spg-net: Segmentation prediction and guidance network for image inpainting. In BMVC

  • Song, Y., Sohl-Dickstein, J., & Kingma, D. P., et al. (2021). Score-based generative modeling through stochastic differential equations. In ICLR

  • Suvorov, R., Logacheva, E., & Mashikhin, A., et al. (2022). Resolution-robust large mask inpainting with fourier convolutions. In WACV

  • Ulyanov, D., & Vedaldi, A., & Lempitsky, V. S. Instance normalization: The missing ingredient for fast stylization. arXiv.

  • Wan, Z., Zhang, J., & Chen, D., et al. (2021). High-fidelity pluralistic image completion with transformers. In ICCV

  • Wang, P., Chen, P., & Yuan, Y., et al. (2018a). Understanding convolution for semantic segmentation. In WACV

  • Wang, X., Yu, K., & Dong, C., et al. (2018b). Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR

  • Wang, Y., Tao, X., & Qi, X., et al. (2018c). Image inpainting via generative multi-column convolutional neural networks. In NIPS

  • Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Woo, S., Park, J., & Lee, J., et al. (2018). CBAM: Convolutional block attention module. In ECCV

  • Wu, H., Zheng, S., & Zhang, J., et al. (2019). GP-GAN: Towards realistic high-resolution image blending. In ACM MM

  • Xia, W., Zhang, Y., & Yang, Y., et al. (2022). Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Xiong, W., Yu, J., & Lin, Z., et al. (2019). Foreground-aware image inpainting. In CVPR

  • Xu, Y., Shen, Y., & Zhu, J., et al. (2021). Generative hierarchical features from synthesizing images. In CVPR

  • Yan, Z., Li, X., & Li, M., et al. (2018). Shift-net: Image inpainting via deep feature rearrangement. In ECCV

  • Yang, J., Qi, Z., & Shi, Y. (2020). Learning to incorporate structure knowledge for image inpainting. In AAAI

  • Yang, S., Jiang, L., & Liu, Z., et al. (2022). Vtoonify: Controllable high-resolution portrait video style transfer. TOG

  • Yang, Z., Dong, J., & Liu, P., et al. (2019). Very long natural scenery image prediction by outpainting. In ICCV

  • Yu F., Koltun V., & Funkhouser, T. A. (2017). Dilated residual networks. In CVPR

  • Yu, J., Lin, Z., Yang, J., et al. (2018). Generative image inpainting with contextual attention. In CVPR

  • Yu, J., Lin, Z., & Yang, J., et al. (2019). Free-form image inpainting with gated convolution. In ICCV

  • Yu, Y., Zhan, F., & Wu, R., et al. (2021). Diverse image inpainting with bidirectional and autoregressive transformers. In MM

  • Yu, Y., Du, D., & Zhang, L., et al. (2022a). Unbiased multi-modality guidance for image inpainting. In ECCV

  • Yu, Y., Zhang, L., & Fan, H., et al. (2022b). High-fidelity image inpainting with gan inversion. In ECCV

  • Zeng, Y., Lin, Z., & Yang, J., et al. (2020). High-resolution image inpainting with iterative confidence feedback and guided upsampling. In ECCV

  • Zeng, Y., Lin, Z., & Lu, H., et al. (2021). Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In ICCV

  • Zeng, Y., Fu, J., & Chao, H., et al. (2022). Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics

  • Zhang, R., Isola, P., & Efros, A. A., et al. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR

  • Zhao, S., Cui, J., & Sheng, Y., et al. (2021). Large scale image completion via co-modulated generative adversarial networks. In ICLR

  • Zheng, C., Cham, T., & Cai, J. (2019). Pluralistic image completion. In CVPR

  • Zhou, B., Lapedriza, A., Khosla, A., et al. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464.

    Article  Google Scholar 

  • Zhou, B., Zhao H., & Puig X., et al. (2017b). Scene parsing through ade20k dataset. In CVPR

  • Zhu, J., Krähenbühl P., & Shechtman E., et al. (2016). Generative visual manipulation on the natural image manifold. In ECCV

  • Zhu, J., Shen Y., & Zhao D., et al. (2020). In-domain GAN inversion for real image editing. In ECCV

Download references

Acknowledgements

Libo Zhang is supported by National Natural Science Foundation of China (No. 62476266). Heng Fan has not been supported by any funding for this work at any stage.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng Fan.

Additional information

Communicated by Maja Pantic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, L., Yu, Y., Yao, J. et al. High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion. Int J Comput Vis 133, 5788–5805 (2025). https://doi.org/10.1007/s11263-025-02448-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02448-w

Keywords