Image fusion aims to blend complementary information from diverse modalities, yet most current methods lack robustness in complex fusion scenarios and cannot flexibly accommodate user intent. We present DiTFuse, the first Diffusion-Transformer (DiT) framework for instruction-driven, dynamic fusion control. Guided by natural-language commands, DiTFuse flexibly blends multimodal content to match diverse preferences and scenarios. Training employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ideal reference images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared–visible, multi-focus, and multi-exposure fusion—as well as text-controlled refinement and downstream tasks—within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
We propose a DiT-based framework with a parallel input structure that jointly fuses text and visual features for multi-modal fusion (IVIF, MFF, MEF). This design builds a robust backbone that reduces modality redundancy in both pre- and post-fusion stages.
Our self-supervised framework combines three key elements: M3-based noisy pair generation for realistic priors, mean IVIF fusion to bridge modality gaps, and multi-prompt text-conditioned data for global control. This enables scalable multimodal alignment without ground-truth labels.
DiTFuse is the first end-to-end framework that directly controls fusion results via natural-language instructions by aligning text and multi-modal features in the latent space. It supports fine-grained visual control, instruction-following segmentation, and strong zero-shot generalization to related tasks.
Training & Inference Pipeline. Textual control information is encoded by the Text Tokenizer, while image data is mapped into visual embeddings via VAE encoders. These conditional signals guide the DiT backbone during the denoising process. The left half illustrates the training stage with M3-based constraints; the right half shows inference, where the unified framework supports multiple fusion tasks and downstream applications.
Data Construction Pipeline. The upper part shows the Multi-degradation Mask Image Modeling (M3) process, which creates noisy pairs through mixed degradations such as masking, noise, and blur. The lower part builds a Ground Truth Pool by adjusting contrast and illumination or overlaying transparent masks for segmentation targets, enabling self-supervised learning without ideal reference images.
We report quantitative comparisons on infrared–visible fusion (IVIF), multi-focus fusion (MFF), multi-exposure fusion (MEF), and multi-modal segmentation. The best and second-best results are marked with gold and blue underlined text, respectively.
Table I. Quantitative comparison on the MSRS, M3FD, and TNO datasets.
| Method | MSRS | M3FD | TNO | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE↓ | PSNR↑ | MANIQA↑ | LIQE↑ | CLIP-IQA↑ | MSE↓ | PSNR↑ | MANIQA↑ | LIQE↑ | CLIP-IQA↑ | MSE↓ | PSNR↑ | MANIQA↑ | LIQE↑ | CLIP-IQA↑ | |
| SwinFusion | 0.038 | 64.52 | 0.138 | 1.108 | 0.312 | 0.059 | 61.37 | 0.284 | 1.704 | 0.481 | 0.059 | 61.36 | 0.164 | 1.010 | 0.229 |
| SeAFusion | 0.038 | 64.33 | 0.144 | 1.138 | 0.355 | 0.060 | 61.12 | 0.288 | 1.641 | 0.466 | 0.058 | 61.63 | 0.187 | 1.013 | 0.267 |
| PMGI | 0.066 | 60.32 | 0.142 | 1.030 | 0.244 | 0.038 | 62.91 | 0.277 | 1.359 | 0.420 | 0.044 | 62.23 | 0.162 | 1.013 | 0.212 |
| DDBFusion | 0.021 | 66.93 | 0.138 | 1.102 | 0.285 | 0.032 | 63.81 | 0.274 | 1.445 | 0.440 | 0.039 | 62.97 | 0.199 | 1.019 | 0.244 |
| DDFM | 0.022 | 66.60 | 0.142 | 1.053 | 0.296 | 0.033 | 63.58 | 0.296 | 1.553 | 0.452 | 0.045 | 62.21 | 0.187 | 1.019 | 0.253 |
| DeFusion | 0.026 | 66.08 | 0.132 | 1.042 | 0.318 | 0.036 | 63.52 | 0.276 | 1.425 | 0.433 | 0.040 | 63.40 | 0.185 | 1.015 | 0.253 |
| U2Fusion | 0.022 | 66.46 | 0.154 | 1.096 | 0.327 | 0.033 | 63.61 | 0.282 | 1.423 | 0.506 | 0.038 | 63.08 | 0.201 | 1.014 | 0.256 |
| Text-DiFuse | 0.092 | 58.62 | 0.131 | 0.984 | 0.284 | 0.058 | 60.80 | 0.275 | 1.478 | 0.423 | 0.056 | 61.19 | 0.197 | 1.027 | 0.270 |
| Text-IF | 0.039 | 64.10 | 0.140 | 1.107 | 0.362 | 0.051 | 62.10 | 0.286 | 1.661 | 0.457 | 0.051 | 62.02 | 0.190 | 1.023 | 0.281 |
| DiTFuse | 0.021 | 66.63 | 0.162 | 1.240 | 0.392 | 0.032 | 63.81 | 0.299 | 1.718 | 0.498 | 0.036 | 63.50 | 0.209 | 1.019 | 0.297 |
Table II. Quantitative comparison on the MFIF, RealMFF, and SICE datasets. MFIF and RealMFF are multi-focus datasets, while SICE is a multi-exposure dataset.
| Method | MFIF (MFF_DATA) | RealMFF (MFF_DATA) | SICE (MEF_DATA) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SF↑ | AG↑ | LIQE↑ | MUSIQ↑ | CLIP-IQA↑ | SF↑ | AG↑ | LIQE↑ | MUSIQ↑ | CLIP-IQA↑ | EN↑ | SD↑ | LIQE↑ | MUSIQ↑ | CLIP-IQA↑ | |
| ZMMF | 16.77 | 5.863 | 2.476 | 49.32 | 0.483 | 14.86 | 5.703 | 3.037 | 53.23 | 0.566 | 7.286 | 0.268 | 3.774 | 69.30 | 0.667 |
| PMGI | 11.36 | 4.226 | 2.195 | 53.63 | 0.581 | 15.43 | 5.884 | 2.610 | 52.55 | 0.511 | 6.677 | 0.156 | 2.592 | 70.72 | 0.599 |
| SwinFusion | 20.16 | 6.941 | 2.994 | 58.53 | 0.628 | 18.32 | 6.577 | 2.943 | 55.72 | 0.543 | 7.133 | 0.271 | 3.571 | 67.50 | 0.665 |
| U2Fusion | 22.11 | 8.060 | 3.119 | 62.14 | 0.658 | 16.64 | 6.370 | 2.021 | 53.40 | 0.501 | 7.328 | 0.262 | 3.346 | 67.67 | 0.673 |
| DDBFusion | 15.51 | 5.374 | 2.795 | 57.33 | 0.623 | 14.19 | 5.302 | 2.683 | 53.31 | 0.534 | 7.453 | 0.268 | 3.684 | 69.30 | 0.686 |
| DeFusion | 12.60 | 4.679 | 2.856 | 58.95 | 0.619 | 12.41 | 4.700 | 2.421 | 51.81 | 0.517 | 7.315 | 0.240 | 3.567 | 68.97 | 0.650 |
| DiTFuse | 23.81 | 8.260 | 3.891 | 68.42 | 0.668 | 18.26 | 6.634 | 3.408 | 58.46 | 0.572 | 7.532 | 0.274 | 4.005 | 70.32 | 0.693 |
Table III. Quantitative comparison of DiTFuse and LISA on multi-modal segmentation.
| Class | DiTFuse | LISA (Fusion) | LISA (VIS) | LISA (IR) |
|---|---|---|---|---|
| Building | 0.5271 | 0.5598 | 0.5654 | 0.5622 |
| Bus | 0.4087 | 0.2871 | 0.3142 | 0.2687 |
| Car | 0.6426 | 0.5829 | 0.5955 | 0.4621 |
| Motorcycle | 0.2903 | 0.1923 | 0.2790 | 0.0877 |
| Person | 0.3776 | 0.2238 | 0.2117 | 0.2684 |
| Pole | 0.2405 | 0.2193 | 0.2289 | 0.1184 |
| Road | 0.7436 | 0.7744 | 0.7640 | 0.7660 |
| Sidewalk | 0.2039 | 0.3143 | 0.3357 | 0.2673 |
| Sky | 0.8915 | 0.8869 | 0.8998 | 0.8674 |
| Truck | 0.3161 | 0.2695 | 0.2825 | 0.2037 |
| Vegetation | 0.5936 | 0.5137 | 0.5624 | 0.4659 |
| Overall | 0.4760 | 0.4386 | 0.4581 | 0.3943 |
@article{ditfuse2025,
title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach},
author={Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025}
}