Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Wu, Xianjin; Liang, Dingkang; Feng, Tianrui; Xia, Kui; Zhang, Yumeng; Li, Xiaofan; Tan, Xiao; Bai, Xiang

Abstract

A plug-and-play route to 3D awareness

Multimodal large language models are strong at semantics, but they still struggle with spatial blindness: fine-grained geometric reasoning, view-consistent localization, and physical dynamics. VEGA-3D starts from a different observation. If modern video generation models can synthesize coherent videos under camera motion and interaction, they must already encode robust 3D structure and motion priors.

We therefore repurpose a pretrained video diffusion model as a frozen latent world simulator, extract intermediate features at informative denoising stages, and align them with semantic visual tokens through adaptive gated fusion. This produces consistent gains across scene understanding, spatial reasoning, and robotic manipulation without requiring explicit 3D supervision.

Method

Video generation priors become a reusable visual signal

01

3D Awareness Analysis

We use multi-view correspondence to identify which backbones preserve physical geometry across changing viewpoints.

02

Latent World Simulator

A frozen video diffusion model is queried at intermediate layers and mid-denoising steps, where geometry is strongest.

03

Adaptive Gated Fusion

Generative and semantic token streams are projected into a shared space and fused with token-level dynamic gating.

VEGA-3D pipeline overview — **Pipeline.** VEGA-3D activates a frozen video generator as a latent world simulator and injects its internal 3D priors into the MLLM visual stream.

Adaptive gated fusion module — **Fusion.** Token-level adaptive gating balances semantic recognition cues and geometry-aware generative priors.

Analysis

Why generative features help

Multi-view consistency and attention visualization — **Implicit 3D priors.** The generative backbone preserves multi-view consistency and sharpens localization for geometry-sensitive reasoning.

Feature-domain comparison and correspondence correlation — **Correlation.** Higher multi-view correspondence strongly tracks downstream 3D understanding performance.

Results

Consistent gains across understanding, reasoning, and action

56.2

ScanRefer Acc@0.5

+4.5 over baseline

55.1

Multi3DRefer F1@0.5

+2.4 over baseline

50.5

VSI-Bench Avg.

+1.6 over baseline

97.3%

LIBERO Avg. SR

+0.3 over baseline

3D scene understanding

VEGA-3D gives the largest improvements on localization-centric tasks, where geometry acts as a strong spatial anchor for MLLMs.

Visual-spatial reasoning

The same generative prior transfers to video-based reasoning tasks such as relative distance, direction, and appearance order.

Embodied manipulation

Physical-world priors also help action models, improving robotic manipulation performance even in a saturated regime.

Qualitative Grounding

Better localization in cluttered 3D scenes

ScanRefer success cases — **Success cases.** VEGA-3D finds the correct object under fine-grained relational descriptions where the baseline drifts to semantically similar distractors.

ScanRefer failure cases — **Failure cases.** Hard scenes remain, but the comparison shows the remaining localization ambiguity is narrower than the baseline error pattern.

Spatial Reasoning

Transfer beyond 3D benchmarks

VSI-Bench relative distance example — **Relative distance.** VEGA-3D better estimates distance-sensitive relationships across views.

VSI-Bench relative direction example — **Relative direction.** Generative priors improve directional reasoning in egocentric spatial layouts.

VSI-Bench appearance order example — **Appearance order.** The same representation also supports temporal ordering and dynamic scene understanding.

Efficiency and Ablation

Mid-denoising features are the sweet spot

Ablation on denoising step and layer depth — **Ablation.** Intermediate diffusion steps and selected internal layers consistently provide the most useful spatial priors.

Inference profiling with cached generative features — **Profiling.** Caching the generative branch per scene substantially reduces latency, memory, and FLOPs overhead.

Citation

If you find VEGA-3D useful, please cite

@article{wu2026vega,
  title   = {Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
  author  = {Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
  journal = {arXiv preprint arXiv:2603.19235},
  year    = {2026}
}