Compared to prior works that rely on implicit alignment or coarse video-level routing, ProPhy introduces a progressive alignment framework. It injects learnable physical priors and employs fine-grained token-level routing, allowing specialized experts to internalize specific physical domains and improve the physical realism of generated videos.
The core of ProPhy consists of two key components:
- Semantic Expert Block: Captures high-level physical categories and initial semantic alignment.
- Refinement Expert Block: Performs fine-grained refinement to ensure precise physical dynamics.
During inference, ProPhy operates end-to-end, dynamically aligning physics categories through these blocks to produce physically-consistent video content.
-
Python: 3.10+ recommended.
-
Environment Setup (using uv):
uv sync source .venv/bin/activate # or .venv\Scripts\activate on Windows
-
All commands below are run from the repository root.
tools/generate_attention_map.py produces per-video attention maps for physical phenomena and appearance using a Qwen2.5-VL–based model.
Run from the repository root:
export PYTHONPATH=$(pwd):$PYTHONPATH
python3 tools/generate_attention_map.py \
--data_json_path /path/to/dataset.json \
--video_base_path /path/to/videos \
--output_dir /path/to/attention_output \
--model_path /path/to/Qwen2.5-VL-checkpointFor the JSON file passed to --data_json_path, each item needs:
video_name: the filename of the video (e.g.,video_001.mp4), which will be joined with the--video_base_pathargument.activate_expert: a list that supports both:- integers: built-in phenomenon / appearance IDs defined in
configs/attention_map.py - strings: your own physical attributes
- integers: built-in phenomenon / appearance IDs defined in
Example with built-in IDs:
[
{
"video_name": "video_001.mp4",
"activate_expert": [0, 3]
}
]This will generate attention maps for expert IDs 0 and 3.
You can also mix IDs with custom strings:
[
{
"video_name": "video_001.mp4",
"activate_expert": [0, "surface tension", "magnetic attraction"]
}
]This will generate the default describe map, the built-in map for ID 0, and extra maps for surface tension and magnetic attraction.
Pretrained backbone checkpoints are available on Hugging Face: CogVideoX and Wan. Our ProPhy checkpoints will be released soon!
-
CogVideoX
export PYTHONPATH=$(pwd):$PYTHONPATH python3 inference_cogvideox.py \ --pretrained_checkpoint /path/to/CogVideoX-5b \ --prophy_checkpoint /path/to/checkpoint \ --prompt "Your prompt" \ --output_path /path/to/output.mp4
-
Wan
export PYTHONPATH=$(pwd):$PYTHONPATH python3 inference_wan.py \ --pretrained_checkpoint /path/to/Wan2.1-T2V-1.3B-Diffusers \ --prophy_checkpoint /path/to/checkpoint \ --prompt "Your prompt" \ --output_path /path/to/output.mp4
output_path can be a .mp4 file or a directory (in which case a default filename is used).
We would like to thank the following projects for their contributions:
- Wan2.1 and CogVideoX for their excellent backbone models.
- WISA for providing their high-quality dataset.
If you use ProPhy in your work, please cite:
@misc{wang2025prophyprogressivephysicalalignment,
title={ProPhy: Progressive Physical Alignment for Dynamic World Simulation},
author={Zijun Wang and Panwen Hu and Jing Wang and Terry Jingchen Zhang and Yuhao Cheng and Long Chen and Yiqiang Yan and Zutao Jiang and Hanhui Li and Xiaodan Liang},
year={2025},
eprint={2512.05564},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05564},
}
