Abstract
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with \( \mathcal {F} \& \mathcal {W}^+\) latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the \( \mathcal {F} \& \mathcal {W}^+\) latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, including CelebA-HQ, Places2, OST, CityScapes, MetFaces and Scenery, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively. Our project webpage including code and results will be available at https://yeates.github.io/mm-invertfill.













Similar content being viewed by others
Data Availability
The data supporting the findings of this study are openly available, and our codes will be released at https://yeates.github.io/mm-invertfill.
References
Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In ICCV
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In CVPR
Ardino, P., Liu, Y., & Ricci, E., et al. (2020). Semantic-guided inpainting network for complex urban scenes manipulation. In ICPR
Ba, L .J., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv
Cao, C., & Fu, Y. (2021). Learning a sketch tensor space for image inpainting of man-made scenes. In ICCV
Chen, L., Zhu, Y., & Papandreou, G., et al. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV
Cheng, Y. C., Lin, C. H., & Lee, H. Y., et al. (2022). Inout: Diverse image outpainting via gan inversion. In CVPR
Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR
Criminisi, A., Pérez P., & Toyama, K. (2003). Object removal by exemplar-based inpainting. In CVPR
Deng, Y., Hui ,S., & Zhou, S., et al. (2021). Learning contextual transformer network for image inpainting. In MM (pp. 2529–2538)
Dong, Q., Cao, C., & Fu, Y. (2022). Incremental transformer structure enhanced image inpainting with masking positional encoding. In CVPR
Dosovitskiy, A., Beyer, L., & Kolesnikov A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
Goodfellow, I. J., Pouget-Abadie, J., & Mirza, M., et al. (2014). Generative adversarial nets. In NIPS
Gu, J., Shen, Y., & Zhou, B. (2020). Image processing using multi-code GAN prior. In CVPR
Gulrajani, I., Ahmed, F., & Arjovsky, M., et al. (2017). Improved training of wasserstein gans. In NIPS
Guo, X., Yang, H., & Huang, D. (2021). Image inpainting via conditional texture and structure dual generation. In ICCV
He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In CVPR
Heusel, M., Ramsauer, H., & Unterthiner, T., et al. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS
Huang, X., & Belongie, S. J. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV
Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. TOG, 36(4), 1–14.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR
Karras, T., Aittala, M., & Hellsten, J., et al. (2020a). Training generative adversarial networks with limited data. In NeurIPS
Karras, T., Laine, S., & Aittala, M., et al. (2020b). Analyzing and improving the image quality of stylegan. In CVPR
Krishnan, D., Teterwak, P., & Sarna, A., et al. (2019). Boundless: Generative adversarial networks for image extension. In ICCV
Lee, C., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In CVPR
Li, J., Wang, N., & Zhang, L., et al. (2020). Recurrent feature reasoning for image inpainting. In CVPR
Li, W., Lin, Z., & Zhou, K., et al. (2022a). Mat: Mask-aware transformer for large hole image inpainting. In CVPR
Li, X., Guo, Q., & Lin, D., et al. (2022b). Misf: Multi-level interactive siamese filtering for high-fidelity image inpainting. In CVPR
Liao, L., Xiao, J., & Wang, Z., et al. (2020). Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In ECCV
Liao L., Xiao J., & Wang Z., et al. (2021). Image inpainting guided by coherence priors of semantics and textures. In CVPR
Lillicrap, T. P., Hunt, J. J., & Pritzel A., et al. (2016). Continuous control with deep reinforcement learning. In ICLR
Liu, G., Reda, F. A., & Shih, K. J., et al. (2018). Image inpainting for irregular holes using partial convolutions. In ECCV
Liu, H., Jiang, B., & Xiao, Y., et al. (2019). Coherent semantic attention for image inpainting. In ICCV
Liu, H., Jiang, B., & Song, Y., et al. (2020). Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV
Mao, X., Li, Q., & Xie, H., et al. (2017). Least squares generative adversarial networks. In ICCV
Miyato, T., Kataoka, T., & Koyama, M., et al. (2018). Spectral normalization for generative adversarial networks. In ICLR
Nazeri, K., Ng, E., & Joseph, T., et al. (2019). Edgeconnect: Structure guided image inpainting using edge prediction. In ICCVW
Park, T., Liu, M., & Wang, T., et al. (2019). Semantic image synthesis with spatially-adaptive normalization. In CVPR
Pathak, D., Krähenbühl, P., & Donahue, J., et al. (2016). Context encoders: Feature learning by inpainting. In CVPR
Richardson, E., Alaluf, Y., & Patashnik, O., et al. (2021). Encoding in style: A stylegan encoder for image-to-image translation. In CVPR
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI
Salimans, T., Goodfellow, I. J., Zaremba, W., et al. (2016). Improved techniques for training gans. In NIPS
Shetty, R., Fritz, M., & Schiele, B. (2018). Adversarial scene editing: Automatic object removal from weak supervision. In NIPS
Song, L., Cao, J., & Song, L., et al. (2019). Geometry-aware face completion and editing. In AAAI
Song, Y., Yang, C., & Shen, Y., et al. (2018). Spg-net: Segmentation prediction and guidance network for image inpainting. In BMVC
Song, Y., Sohl-Dickstein, J., & Kingma, D. P., et al. (2021). Score-based generative modeling through stochastic differential equations. In ICLR
Suvorov, R., Logacheva, E., & Mashikhin, A., et al. (2022). Resolution-robust large mask inpainting with fourier convolutions. In WACV
Ulyanov, D., & Vedaldi, A., & Lempitsky, V. S. Instance normalization: The missing ingredient for fast stylization. arXiv.
Wan, Z., Zhang, J., & Chen, D., et al. (2021). High-fidelity pluralistic image completion with transformers. In ICCV
Wang, P., Chen, P., & Yuan, Y., et al. (2018a). Understanding convolution for semantic segmentation. In WACV
Wang, X., Yu, K., & Dong, C., et al. (2018b). Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR
Wang, Y., Tao, X., & Qi, X., et al. (2018c). Image inpainting via generative multi-column convolutional neural networks. In NIPS
Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Woo, S., Park, J., & Lee, J., et al. (2018). CBAM: Convolutional block attention module. In ECCV
Wu, H., Zheng, S., & Zhang, J., et al. (2019). GP-GAN: Towards realistic high-resolution image blending. In ACM MM
Xia, W., Zhang, Y., & Yang, Y., et al. (2022). Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
Xiong, W., Yu, J., & Lin, Z., et al. (2019). Foreground-aware image inpainting. In CVPR
Xu, Y., Shen, Y., & Zhu, J., et al. (2021). Generative hierarchical features from synthesizing images. In CVPR
Yan, Z., Li, X., & Li, M., et al. (2018). Shift-net: Image inpainting via deep feature rearrangement. In ECCV
Yang, J., Qi, Z., & Shi, Y. (2020). Learning to incorporate structure knowledge for image inpainting. In AAAI
Yang, S., Jiang, L., & Liu, Z., et al. (2022). Vtoonify: Controllable high-resolution portrait video style transfer. TOG
Yang, Z., Dong, J., & Liu, P., et al. (2019). Very long natural scenery image prediction by outpainting. In ICCV
Yu F., Koltun V., & Funkhouser, T. A. (2017). Dilated residual networks. In CVPR
Yu, J., Lin, Z., Yang, J., et al. (2018). Generative image inpainting with contextual attention. In CVPR
Yu, J., Lin, Z., & Yang, J., et al. (2019). Free-form image inpainting with gated convolution. In ICCV
Yu, Y., Zhan, F., & Wu, R., et al. (2021). Diverse image inpainting with bidirectional and autoregressive transformers. In MM
Yu, Y., Du, D., & Zhang, L., et al. (2022a). Unbiased multi-modality guidance for image inpainting. In ECCV
Yu, Y., Zhang, L., & Fan, H., et al. (2022b). High-fidelity image inpainting with gan inversion. In ECCV
Zeng, Y., Lin, Z., & Yang, J., et al. (2020). High-resolution image inpainting with iterative confidence feedback and guided upsampling. In ECCV
Zeng, Y., Lin, Z., & Lu, H., et al. (2021). Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In ICCV
Zeng, Y., Fu, J., & Chao, H., et al. (2022). Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics
Zhang, R., Isola, P., & Efros, A. A., et al. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR
Zhao, S., Cui, J., & Sheng, Y., et al. (2021). Large scale image completion via co-modulated generative adversarial networks. In ICLR
Zheng, C., Cham, T., & Cai, J. (2019). Pluralistic image completion. In CVPR
Zhou, B., Lapedriza, A., Khosla, A., et al. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464.
Zhou, B., Zhao H., & Puig X., et al. (2017b). Scene parsing through ade20k dataset. In CVPR
Zhu, J., Krähenbühl P., & Shechtman E., et al. (2016). Generative visual manipulation on the natural image manifold. In ECCV
Zhu, J., Shen Y., & Zhao D., et al. (2020). In-domain GAN inversion for real image editing. In ECCV
Acknowledgements
Libo Zhang is supported by National Natural Science Foundation of China (No. 62476266). Heng Fan has not been supported by any funding for this work at any stage.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Maja Pantic.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, L., Yu, Y., Yao, J. et al. High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion. Int J Comput Vis 133, 5788–5805 (2025). https://doi.org/10.1007/s11263-025-02448-w
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02448-w
