TL;DR: Diffusion models exhibit heavy temporal redundancy, yet we transmit full activations step after step.
Why are we sending near-duplicated data across GPUs?
CompactFusion transmits only the compressed residuals — the real information change — to drastically reduce bandwidth with minimal quality loss.
We owe special thanks to the xDiT team—without their excellent open-source framework, this project would simply not exist.
Their work laid the foundation for everything we've built.
We thank the DistriFusion authors for sharing their code and system.
We also thank common_metrics_on_video_quality for their excellent video quality evaluation tools.
- Diffusion models generate data step-by-step, but their intermediate activations change slowly and predictably.
- In multi-GPU inference, these large activations are repeatedly transmitted between devices.
- The transmitted data are highly redundant, wasting precious bandwidth on near-duplicate content.
- We ask: Why are we transmitting redundant stuff at all?
CompactFusion is a residual compression framework for parallel diffusion model serving.
It compresses only the change (residual) between activations across steps — and adds optional error feedback to maintain reconstruction quality.
✅ CompactFusion only targets communication, making it:
- 📦 Plug-and-play: No model re-training or modification
- 🛠 Framework-compatible: Integrates into ring attention, patch parallelism and more
- ⚡ Extremely efficient: Up to 100× compression, with <1% data transmitted, but still outperforming DistriFusion in quality.
Residual Compression Principle![]() |
System Architecture![]() |
CompactFusion supports out-of-the-box compression for:
- ✅ Compressed Ring Attention
- ✅ Compressed Patch Parallel
- ✅ DistriFusion migrated to
xDiT - ✅ Patch Parallel migrated to
xDiT - ✅ FLUX, CogVideoX, SD, Pixart-alpha and other backbones
We build CompactFusion on top of the excellent xDiT framework.
You may simply use the pre-built Docker image from xDiT:
docker pull thufeifeibear/xdit-devExample usages are provided in:
examples/cogvideox_example.py
examples/flux_example.py
We do not modify the setup of xDiT. You can refer directly to xDiT documentation for usage details.
If you use this code for your research, please cite our paper.
@article{luo2025accelerating,
title={Accelerating Parallel Diffusion Model Serving with Residual Compression},
author={Luo, Jiajun and Xiao, Yicheng and Xu, Jianru and You, Yangxiu and Lu, Rongwei and Tang, Chen and Jiang, Jingyan and Wang, Zhi},
journal={arXiv preprint arXiv:2507.17511},
year={2025}
}


