DiTFuse

Official implementation of Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach (TPAMI 2026)

Any questions can be consulted -> (Email:lijiayang.cs@gmail.com)

Looking forward to your ⭐！

📌 TODOs

release code

release ckpt

release arxiv

IEEE version paper

Core Concept:

The core objective of our work is to demonstrate the superiority of a parallel architecture in information control. In our experiments beyond the main paper, I also tried AdaIN-based information injection and T2I-Adapter-style feature-map addition. However, these approaches inevitably cause information from the two modalities to become entangled—numerically mixed together—making it impossible to truly separate the content of the two input images. This is why explicit information disentanglement is necessary, and why a parallel design is the appropriate choice.

In addition, the M3-style synthetic fusion data construction pipeline can significantly improve the performance of the fusion task itself. Finally, with the rapid progress of unified models for visual understanding and generation, we believe fusion tasks should also actively embrace this trend, incorporating strong visual priors into fusion frameworks. We look forward to future advances enabled by such unified architectures.

🚀 Overview

Setup

For detailed installation and usage instructions, please refer to ➡️ setup.md.

Test & Train

Testing

For testing, please refer to the provided script:➡️ test.md

This script demonstrates how to run DiTFuse in different modes (single, batch, and multi-prompt).

The inference stage requires approximately 12 GB of GPU memory and can be efficiently executed on widely available high-performance GPUs, such as NVIDIA RTX 3090, V100, and RTX 4090.

Training

Training follows the same procedure as OmniGen.

📄 Citation

If you use DiTFuse in your research, please cite:

@ARTICLE{11297852,
  author={Li, Jiayang and Jiang, Chengjie and Jiang, Junjun and Liang, Pengwei and Ma, Jiayi and Nie, Liqiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach}, 
  year={2026},
  volume={48},
  number={4},
  pages={3970-3987},
  keywords={Semantics;Image fusion;Training;Image segmentation;Transformers;Optimization;Data models;Visual effects;Feature extraction;Electronic mail;Image fusion;DiT;text control},
  doi={10.1109/TPAMI.2025.3642842}}

❤️ Acknowledgements

This project is built on OmniGen, a powerful Diffusion Transformer framework developed by VectorSpace Lab.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
OmniGen.egg-info		OmniGen.egg-info
OmniGen		OmniGen
docs		docs
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
make_M3_data.py		make_M3_data.py
prompt.md		prompt.md
requirements.txt		requirements.txt
setup.md		setup.md
setup.py		setup.py
test.md		test.md
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiTFuse

📌 TODOs

Core Concept:

🚀 Overview

Setup

Test & Train

Testing

Training

📄 Citation

❤️ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiTFuse

📌 TODOs

Core Concept:

🚀 Overview

Setup

Test & Train

Testing

Training

📄 Citation

❤️ Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages