This repo contains the code, models, and dataset used in
Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang
hero_small.mp4
- [2026.04.03] We have released our paper — discussions and feedback are warmly welcome!
TL;DR We present a large-scale dataset and framework for high-quality inverse and forward rendering of videos using fine-tuned video diffusion models. We extract synchronized RGB videos from two AAA games and five aligned G-buffer channels, and propose a VLM-based evaluation protocol for real-world scenes. Our pipeline consists of two components:
- Inverse Renderer (RGB → G-buffers): Fine-tuned from Cosmos-Transfer1-DiffusionRenderer to decompose RGB videos into G-buffer maps (albedo, normal, depth, roughness, metallic)
- Game Editing (G-buffers + Text → Stylized RGB): Fine-tuned from Wan2.1 1.3B (via DiffSynth-Studio) to synthesize photorealistic RGB videos from G-buffer inputs with controllable lighting and style via text prompts
Key features of our dataset:
- 4M+ frames at 720p / 30 FPS with 6 synchronized channels (RGB + albedo, normal, depth, metallic, roughness)
- 40 hours of gameplay from 2 AAA games (Cyberpunk 2077 & Black Myth: Wukong)
- Long-duration sequences: average 8 min per clip, up to 53 min continuous recording
- Diverse content: urban/outdoor/indoor scenes, varying weather (sunny, rainy, foggy, night, sunset), and realistic motion patterns
- Motion blur variant: offline-generated via sub-frame interpolation and linear-domain temporal averaging
- VLM-based evaluation: reference-free assessment of material predictions using vision-language models
This repository contains the Inverse Renderer and Game Editing models. Please follow the instructions below to set up the environment and run inference for each model. We recommend creating separate conda environments for the two models to avoid version conflicts.
git clone --recurse-submodules https://github.com/ShandaAI/AlayaRenderer.git
cd AlayaRenderer| Model | Base Model | Link |
|---|---|---|
| Inverse Renderer | Cosmos-Transfer1-DiffusionRenderer 7B | HuggingFace |
| Game Editing | Wan2.1 1.3B | HuggingFace |
Our model is fine-tuned from Cosmos-Transfer1-DiffusionRenderer. Please follow the inverse_renderer/ instructions for environment setup and inference. Download the related weights and replace the checkpoint under inverse_renderer/checkpoints/Diffusion_Renderer_Inverse_Cosmos_7B with our fine-tuned checkpoint.
Please follow the DiffSynth-Studio instructions to set up the environment and download the related weights. Download our fine-tuned checkpoint from HuggingFace and place it under game_editing/models/train/Wan2.1-T2V-1.3B_gbuffer/.
cd game_editing
CUDA_VISIBLE_DEVICES=0 python \
examples/wanvideo/model_inference/inference_gbuffer_caption.py \
--checkpoint models/train/Wan2.1-T2V-1.3B_gbuffer/model.safetensors \
--gpu 0 \
--style snowy_winter \
--prompt "the scene is set in a frozen, snow-covered environment under cold, pale winter light with falling snowflakes, creating a silent and ethereal winter wonderland atmosphere." \
--gbuffer_dir test_dataset \
--save_dir outputs/ \
--num_frames 81 --height 480 --width 832- Release dataset.
- Release data curation toolkit.
This project builds upon the following excellent works:
- DiffusionRenderer by NVIDIA Toronto AI Lab
- Wan2.1 by Wan-Video
- DiffSynth-Studio by ModelScope
See LICENSE.
If you find this project helpful, please consider citing:
@article{huang2026generativeworldrenderer,
title={Generative World Renderer},
author={Zheng-Hui Huang and Zhixiang Wang and Jiaming Tan and Ruihan Yu and Yidan Zhang and Bo Zheng and Yu-Lun Liu and Yung-Yu Chuang and Kaipeng Zhang},
journal={arXiv preprint arXiv:2604.02329},
year={2026}
}