Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

VEGA-3D turns a frozen video generation model into a Latent World Simulator, extracting geometry-aware spatiotemporal priors and fusing them with semantic visual tokens for 3D scene understanding, spatial reasoning, and embodied manipulation.

Xianjin Wu1 Dingkang Liang1, † Tianrui Feng1 Kui Xia2 Yumeng Zhang2 Xiaofan Li2 Xiao Tan2 Xiang Bai1

1 Huazhong University of Science and Technology 2 Baidu Inc., China Project Lead

VEGA-3D teaser overview
VEGA-3D avoids explicit 3D dependency and heavy geometric supervision by mining implicit spatial priors from large-scale video generation models.

A plug-and-play route to 3D awareness

Multimodal large language models are strong at semantics, but they still struggle with spatial blindness: fine-grained geometric reasoning, view-consistent localization, and physical dynamics. VEGA-3D starts from a different observation. If modern video generation models can synthesize coherent videos under camera motion and interaction, they must already encode robust 3D structure and motion priors.

We therefore repurpose a pretrained video diffusion model as a frozen latent world simulator, extract intermediate features at informative denoising stages, and align them with semantic visual tokens through adaptive gated fusion. This produces consistent gains across scene understanding, spatial reasoning, and robotic manipulation without requiring explicit 3D supervision.

Video generation priors become a reusable visual signal

01

3D Awareness Analysis

We use multi-view correspondence to identify which backbones preserve physical geometry across changing viewpoints.

02

Latent World Simulator

A frozen video diffusion model is queried at intermediate layers and mid-denoising steps, where geometry is strongest.

03

Adaptive Gated Fusion

Generative and semantic token streams are projected into a shared space and fused with token-level dynamic gating.

VEGA-3D pipeline overview
Pipeline. VEGA-3D activates a frozen video generator as a latent world simulator and injects its internal 3D priors into the MLLM visual stream.
Adaptive gated fusion module
Fusion. Token-level adaptive gating balances semantic recognition cues and geometry-aware generative priors.

Why generative features help

Multi-view consistency and attention visualization
Implicit 3D priors. The generative backbone preserves multi-view consistency and sharpens localization for geometry-sensitive reasoning.
Feature-domain comparison and correspondence correlation
Correlation. Higher multi-view correspondence strongly tracks downstream 3D understanding performance.

Consistent gains across understanding, reasoning, and action

56.2
ScanRefer Acc@0.5
+4.5 over baseline
55.1
Multi3DRefer F1@0.5
+2.4 over baseline
50.5
VSI-Bench Avg.
+1.6 over baseline
97.3%
LIBERO Avg. SR
+0.3 over baseline

3D scene understanding

VEGA-3D gives the largest improvements on localization-centric tasks, where geometry acts as a strong spatial anchor for MLLMs.

Visual-spatial reasoning

The same generative prior transfers to video-based reasoning tasks such as relative distance, direction, and appearance order.

Embodied manipulation

Physical-world priors also help action models, improving robotic manipulation performance even in a saturated regime.

Better localization in cluttered 3D scenes

ScanRefer success cases
Success cases. VEGA-3D finds the correct object under fine-grained relational descriptions where the baseline drifts to semantically similar distractors.
ScanRefer failure cases
Failure cases. Hard scenes remain, but the comparison shows the remaining localization ambiguity is narrower than the baseline error pattern.

Transfer beyond 3D benchmarks

VSI-Bench relative distance example
Relative distance. VEGA-3D better estimates distance-sensitive relationships across views.
VSI-Bench relative direction example
Relative direction. Generative priors improve directional reasoning in egocentric spatial layouts.
VSI-Bench appearance order example
Appearance order. The same representation also supports temporal ordering and dynamic scene understanding.

Mid-denoising features are the sweet spot

Ablation on denoising step and layer depth
Ablation. Intermediate diffusion steps and selected internal layers consistently provide the most useful spatial priors.
Inference profiling with cached generative features
Profiling. Caching the generative branch per scene substantially reduces latency, memory, and FLOPs overhead.

If you find VEGA-3D useful, please cite

@article{wu2026vega,
  title   = {Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
  author  = {Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
  journal = {arXiv preprint arXiv:2603.19235},
  year    = {2026}
}