MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao | Yu Zhong | Ziqi Wang | Liang-Jian Deng

University of Science and Technology of China (UESTC)

paper:

Quick Introduction

Fig 1: MMAIF framework overview.

Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 1) neglecting real-world image degradations (e.g., noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines.

We provide a effecient data synthesis pipeline to generate degradation pairs.

Fig. 2: Data synthetic pipeline overview.

Based on the pipeline, we generate around 10k image pairs for training, including various degradation scenarios including rain, snow, haze, motion blur, JPEG compression, and Gaussian noise, etc., as well as, commom image fusion tasks including multi-exposure, multi-focus, and visible-infrared image fusion.

We train two versions of models:

Regression-based model
Flow matching model

with MoE architecture and work in latent space, which is fast and efficient.

Our model can suit for multiple image fusion tasks and different degradations, only in one model.

Fig. 3: Model comparision overview.

Here are some visual results of our model and restoration+fusion methods.

Fig. 4: Visual comparisons.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Quick Introduction

Code Will be Released Soon! Stay Tuned for Updates.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Quick Introduction

Code Will be Released Soon! Stay Tuned for Updates.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages