DreamWorld is a unified framework that integrates complementary world knowledge into video generators via a Joint World Modeling Paradigm.
Boming Tan, Xiangdong Zhang, Ning Liao, Yuqing Zhang, Shaofeng Zhang, Xue Yang, Qi Fan, Yanyong Zhang
- 🎉 Sept, 2025: Our work VideoREPA is accepted by NeurIPS 2025, the first adaptation of REPA to video generation, transferring physical knowledge via Token Relation Distillation to improve the physical realism of text-to-video models.
- 💥 Feb, 2026: DreamWorld is available on Arxiv.
- 🚀 Mar, 2026: The code of DreamWorld has been released. Feel free to use and try our work.
Despite impressive progress in video generation, existing models remain limited to surface-level plausibility and lack a coherent, unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies.
To address this limitation, we introduce DreamWorld, which jointly predicts video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency.
- Joint World Modeling Paradigm: Integrates temporal dynamics from Optical Flow, spatial geometry from VGGT, and semantic understanding from DINOv2.
- Consistent Constraint Annealing (CCA): A progressive decay mechanism that regulates world-level constraints during training to mitigate visual instability and temporal flickering, ensuring high-fidelity generation.
- Multi-Source Inner-Guidance: Leverages the model's own predicted knowledge features during inference to steer the generation process, ensuring trajectories strictly adhere to real-world laws.
Below is the overview of the DreamWorld training and inference pipeline:
Extensive evaluations show that DreamWorld significantly outperforms baselines and establishes a new standard for world models.
DreamWorld demonstrates significant improvements over baselines, particularly in temporal dynamics, semantic understanding, and spatial relationships. FT. denotes the fine-tuned version, and Reimpl. indicates our re-implementation of the method.
| Method | Quality Score | Semantic Score | Overall Score |
|---|---|---|---|
| Wan2.1-T2V-1.3B | 79.81 | 65.43 | 76.93 |
| Wan2.1-T2V-1.3B(FT.) | 81.26 | 68.47 | 78.71 |
| VideoJAM(Reimpl.) | 81.18 | 69.08 | 78.76 |
| DreamWorld (Ours) | 83.49 | 70.89 | 80.97 |
| Wan2.1(FT.) | VideoJAM(Reimpl.) | DreamWorld (Ours) | Prompt |
|---|---|---|---|
Wan2.1-1.mp4 |
VideoJAM-1.mp4 |
DreamWorld-1.mp4 |
Gwen Stacy reading a book, tilt up. |
Wan2.1-2.mp4 |
VideoJAM-2.mp4 |
DreamWorld-2.mp4 |
A hose sprays water onto a burning pile of tires... |
Wan2.1-3.mp4 |
VideoJAM-3.mp4 |
DreamWorld-3.mp4 |
A person wades through a swamp... |
Wan2.1-4.mp4 |
VideoJAM-4.mp4 |
DreamWorld-4.mp4 |
Happy dog wearing a yellow turtleneck... |
Wan2.1-5.mp4 |
VideoJAM-5.mp4 |
DreamWorld-5.mp4 |
A potter's wheel is lightly poked with a tool... |
Wan2.1-6.mp4 |
VideoJAM-6.mp4 |
DreamWorld-6.mp4 |
Two roller skaters perform a synchronized spin... |
Run the following commands in order.
git clone https://github.com/ABU121111/DreamWorld.git
conda create -n dreamworld python=3.10 -y
conda activate dreamworld
cd DreamWorld
pip install -r requirements.txtWe use the WISA dataset. The metadata JSON will be downloaded to ./data/wisa/. Video zip parts will be downloaded to ./data/wisa/videos/ and will be extracted during preprocessing. We use parts 10 to 63 and then filter a 32K subset that meets our requirements.
pip install -U huggingface_hub
# Download metadata JSON to ./data/wisa/
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa \
--include "data/wisa-80k.json"
# Download video zip parts to ./data/wisa/videos/ (parts 10-63)
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/1[0-9].zip"
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/2[0-9].zip"
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/3[0-9].zip"
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/4[0-9].zip"
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/5[0-9].zip"
huggingface-cli download --repo-type dataset qihoo360/WISA-80K \
--local-dir ./data/wisa/videos \
--include "data/videos/6[0-3].zip"# Wan2.1-T2V-1.3B-Diffusers
huggingface-cli download --repo-type model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--local-dir ./ckpt/wan-t2v-1.3b-diffusers
# VGGT
huggingface-cli download --repo-type model facebook/VGGT-1B \
--local-dir ./ckpt/vggt
# DINOv2
wget -P ./ckpt/ https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth
# RAFT
wget -P ./ckpt/ https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip
cd ckpt
unzip models.zip
rm -rf models.zip
cd ..# 1) Extract all zip files under ./data/wisa/videos and flatten videos into the videos root
# 2) Remove videos with fewer than 81 frames
# 3) Generate ./data/wisa/prompt.txt and ./data/wisa/video.txt based on wisa-80k.json
python ./data/wisa/preprocess.py \
--video_folder ./data/wisa/videos \
--json_path ./data/wisa/wisa-80k.json \
--output_dir ./data/wisa \
--min_frames 81You can extract the three feature types independently. The full set of commands is also available in extract/pipline.sh.
Depending on your environment, each extractor may require its own extra dependencies. If you see missing packages or build errors, install the requirements for that specific module first.
mkdir -p ./data/features/dino ./data/features/vggt ./data/features/flow
## Extract features for each video and write to ./data/features
# DINOv2
python ./extract/DINOv2/pca.py \
--video_folder ./data/wisa/videos \
--output_dir ./data/features/dino \
--ckpt_path ./ckpt/dinov2_vitb14_reg4_pretrain.pth
# VGGT
python ./extract/VGGT/fea.py \
--video_dir ./data/wisa/videos \
--output_dir ./data/features/vggt \
--model_path ./ckpt/vggt
# Optical Flow
python ./extract/RAFT/latent.py \
--video_folder ./data/wisa/videos \
--output_dir ./data/features/flow \
--raft_model ./ckpt/models/raft-things.pth \
--vae_checkpoint ./ckpt/wan-t2v-1.3b-diffusersThe training dataset config is script/training/training.json. The validation set config is script/training/validation.json. Before training, update data_root in training.json to your local dataset path. Using ./data/wisa is recommended.
# You can edit the script to change GPU count and parallel settings
bash ./script/training/train.shInference script: ./script/inference/inference.sh.
Before running, update the following items in the script:
DATASET_FILE, you can use./script/inference/inference.json--lora_path, point to the LoRA weights*.safetensorsproduced by training--output_dir, directory to save generated videos
# You can edit the script to change GPU count and output paths
bash ./script/inference/inference.shOr run inference directly with our released checkpoints. Please download the weights from Huggingface
huggingface-cli download --repo-type model TeanABU/DreamWorld --local-dir ./ckptThis repo includes two evaluation toolkits:
- WorldScore: see
evaluation/WorldScore/README.mdandevaluation/WorldScore/evaluate.sh - VBench: see
evaluation/VBench/README.mdandevaluation/VBench/evaluate.sh
Reproducibility notes. We use seed 42 by default. Even with the same seed, results may differ across devices due to randomness in video generation and evaluation. For more consistent comparisons, you can evaluate released videos or run inference using released DreamWorld.
@misc{tan2026dreamworldunifiedworldmodeling,
title={DreamWorld: Unified World Modeling in Video Generation},
author={Boming Tan and Xiangdong Zhang and Ning Liao and Yuqing Zhang and Shaofeng Zhang and Xue Yang and Qi Fan and Yanyong Zhang},
year={2026},
eprint={2603.00466},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.00466},
}