PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Jiayang Li^1,*, Chengjie Jiang^2,*, Junjun Jiang^1,† Pengwei Liang¹, Jiayi Ma³, Liqiang Nie⁴,

¹ Harbin Institute of Technology
² Tsinghua University
³ Wuhan University
⁴ Harbin Institute of Technology (Shenzhen)
🎉🎉🎉 TPAMI 2025

^* Equal contribution ^† Corresponding author

arXiv Code Hugging Face Contact

Abstract

Image fusion aims to blend complementary information from diverse modalities, yet most current methods lack robustness in complex fusion scenarios and cannot flexibly accommodate user intent. We present DiTFuse, the first Diffusion-Transformer (DiT) framework for instruction-driven, dynamic fusion control. Guided by natural-language commands, DiTFuse flexibly blends multimodal content to match diverse preferences and scenarios. Training employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ideal reference images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared–visible, multi-focus, and multi-exposure fusion—as well as text-controlled refinement and downstream tasks—within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

Contributions

Multimodal parallel architecture for image fusion

We propose a DiT-based framework with a parallel input structure that jointly fuses text and visual features for multi-modal fusion (IVIF, MFF, MEF). This design builds a robust backbone that reduces modality redundancy in both pre- and post-fusion stages.

Multi-objective hybrid self-supervised training

Our self-supervised framework combines three key elements: M3-based noisy pair generation for realistic priors, mean IVIF fusion to bridge modality gaps, and multi-prompt text-conditioned data for global control. This enables scalable multimodal alignment without ground-truth labels.

Instruction-driven end-to-end controllable fusion

DiTFuse is the first end-to-end framework that directly controls fusion results via natural-language instructions by aligning text and multi-modal features in the latent space. It supports fine-grained visual control, instruction-following segmentation, and strong zero-shot generalization to related tasks.

Training & Data Pipeline

Training and Inference Pipeline of DiTFuse

Training & Inference Pipeline. Textual control information is encoded by the Text Tokenizer, while image data is mapped into visual embeddings via VAE encoders. These conditional signals guide the DiT backbone during the denoising process. The left half illustrates the training stage with M3-based constraints; the right half shows inference, where the unified framework supports multiple fusion tasks and downstream applications.

Multi-degradation Mask Image Modeling and Data Construction

Data Construction Pipeline. The upper part shows the Multi-degradation Mask Image Modeling (M3) process, which creates noisy pairs through mixed degradations such as masking, noise, and blur. The lower part builds a Ground Truth Pool by adjusting contrast and illumination or overlaying transparent masks for segmentation targets, enabling self-supervised learning without ideal reference images.

Qualitative Results

Infrared and Visible Image Fusion

Multi-Focus Image Fusion

Multi-Exposure Image Fusion

Text-Controlled Image Fusion

Multi-Modal Image Segmentation

Quantitative Comparison Results

We report quantitative comparisons on infrared–visible fusion (IVIF), multi-focus fusion (MFF), multi-exposure fusion (MEF), and multi-modal segmentation. The best and second-best results are marked with gold and blue underlined text, respectively.

Table I. Quantitative comparison on the MSRS, M³FD, and TNO datasets.

Method	MSRS					M³FD					TNO
Method	MSE↓	PSNR↑	MANIQA↑	LIQE↑	CLIP-IQA↑	MSE↓	PSNR↑	MANIQA↑	LIQE↑	CLIP-IQA↑	MSE↓	PSNR↑	MANIQA↑	LIQE↑	CLIP-IQA↑
SwinFusion	0.038	64.52	0.138	1.108	0.312	0.059	61.37	0.284	1.704	0.481	0.059	61.36	0.164	1.010	0.229
SeAFusion	0.038	64.33	0.144	1.138	0.355	0.060	61.12	0.288	1.641	0.466	0.058	61.63	0.187	1.013	0.267
PMGI	0.066	60.32	0.142	1.030	0.244	0.038	62.91	0.277	1.359	0.420	0.044	62.23	0.162	1.013	0.212
DDBFusion	0.021	66.93	0.138	1.102	0.285	0.032	63.81	0.274	1.445	0.440	0.039	62.97	0.199	1.019	0.244
DDFM	0.022	66.60	0.142	1.053	0.296	0.033	63.58	0.296	1.553	0.452	0.045	62.21	0.187	1.019	0.253
DeFusion	0.026	66.08	0.132	1.042	0.318	0.036	63.52	0.276	1.425	0.433	0.040	63.40	0.185	1.015	0.253
U2Fusion	0.022	66.46	0.154	1.096	0.327	0.033	63.61	0.282	1.423	0.506	0.038	63.08	0.201	1.014	0.256
Text-DiFuse	0.092	58.62	0.131	0.984	0.284	0.058	60.80	0.275	1.478	0.423	0.056	61.19	0.197	1.027	0.270
Text-IF	0.039	64.10	0.140	1.107	0.362	0.051	62.10	0.286	1.661	0.457	0.051	62.02	0.190	1.023	0.281
DiTFuse	0.021	66.63	0.162	1.240	0.392	0.032	63.81	0.299	1.718	0.498	0.036	63.50	0.209	1.019	0.297

Table II. Quantitative comparison on the MFIF, RealMFF, and SICE datasets. MFIF and RealMFF are multi-focus datasets, while SICE is a multi-exposure dataset.

Method	MFIF (MFF_DATA)					RealMFF (MFF_DATA)					SICE (MEF_DATA)
Method	SF↑	AG↑	LIQE↑	MUSIQ↑	CLIP-IQA↑	SF↑	AG↑	LIQE↑	MUSIQ↑	CLIP-IQA↑	EN↑	SD↑	LIQE↑	MUSIQ↑	CLIP-IQA↑
ZMMF	16.77	5.863	2.476	49.32	0.483	14.86	5.703	3.037	53.23	0.566	7.286	0.268	3.774	69.30	0.667
PMGI	11.36	4.226	2.195	53.63	0.581	15.43	5.884	2.610	52.55	0.511	6.677	0.156	2.592	70.72	0.599
SwinFusion	20.16	6.941	2.994	58.53	0.628	18.32	6.577	2.943	55.72	0.543	7.133	0.271	3.571	67.50	0.665
U2Fusion	22.11	8.060	3.119	62.14	0.658	16.64	6.370	2.021	53.40	0.501	7.328	0.262	3.346	67.67	0.673
DDBFusion	15.51	5.374	2.795	57.33	0.623	14.19	5.302	2.683	53.31	0.534	7.453	0.268	3.684	69.30	0.686
DeFusion	12.60	4.679	2.856	58.95	0.619	12.41	4.700	2.421	51.81	0.517	7.315	0.240	3.567	68.97	0.650
DiTFuse	23.81	8.260	3.891	68.42	0.668	18.26	6.634	3.408	58.46	0.572	7.532	0.274	4.005	70.32	0.693

Table III. Quantitative comparison of DiTFuse and LISA on multi-modal segmentation.

Class	DiTFuse	LISA (Fusion)	LISA (VIS)	LISA (IR)
Building	0.5271	0.5598	0.5654	0.5622
Bus	0.4087	0.2871	0.3142	0.2687
Car	0.6426	0.5829	0.5955	0.4621
Motorcycle	0.2903	0.1923	0.2790	0.0877
Person	0.3776	0.2238	0.2117	0.2684
Pole	0.2405	0.2193	0.2289	0.1184
Road	0.7436	0.7744	0.7640	0.7660
Sidewalk	0.2039	0.3143	0.3357	0.2673
Sky	0.8915	0.8869	0.8998	0.8674
Truck	0.3161	0.2695	0.2825	0.2037
Vegetation	0.5936	0.5137	0.5624	0.4659
Overall	0.4760	0.4386	0.4581	0.3943

@article{ditfuse2025, title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach}, author={Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2025} }