Official implementation of Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach (TPAMI 2026)
Any questions can be consulted -> (Email:lijiayang.cs@gmail.com)
Looking forward to your ⭐!
- release code
- release ckpt
- release arxiv
- IEEE version paper
The core objective of our work is to demonstrate the superiority of a parallel architecture in information control. In our experiments beyond the main paper, I also tried AdaIN-based information injection and T2I-Adapter-style feature-map addition. However, these approaches inevitably cause information from the two modalities to become entangled—numerically mixed together—making it impossible to truly separate the content of the two input images. This is why explicit information disentanglement is necessary, and why a parallel design is the appropriate choice.
In addition, the M3-style synthetic fusion data construction pipeline can significantly improve the performance of the fusion task itself. Finally, with the rapid progress of unified models for visual understanding and generation, we believe fusion tasks should also actively embrace this trend, incorporating strong visual priors into fusion frameworks. We look forward to future advances enabled by such unified architectures.
For detailed installation and usage instructions, please refer to ➡️ setup.md.
For testing, please refer to the provided script:➡️ test.md
This script demonstrates how to run DiTFuse in different modes (single, batch, and multi-prompt).
The inference stage requires approximately 12 GB of GPU memory and can be efficiently executed on widely available high-performance GPUs, such as NVIDIA RTX 3090, V100, and RTX 4090.
Training follows the same procedure as OmniGen.
If you use DiTFuse in your research, please cite:
@ARTICLE{11297852,
author={Li, Jiayang and Jiang, Chengjie and Jiang, Junjun and Liang, Pengwei and Ma, Jiayi and Nie, Liqiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach},
year={2026},
volume={48},
number={4},
pages={3970-3987},
keywords={Semantics;Image fusion;Training;Image segmentation;Transformers;Optimization;Data models;Visual effects;Feature extraction;Electronic mail;Image fusion;DiT;text control},
doi={10.1109/TPAMI.2025.3642842}}
This project is built on OmniGen, a powerful Diffusion Transformer framework developed by VectorSpace Lab.