Fengyi Wu1,*,
Yifei Dong1,*,
Zhi-Qi Cheng1,†,
Yilong Dai1,
Guangyu Chen1,
Hang Wang2,
Qi Dai3,
Alexander G Hauptmann4
1UW, 2PolyU, 3Microsoft Research, 4CMU
GoViG introduces a new task in embodied AI: generating navigation instructions directly from egocentric visual observations of the initial and goal states. Unlike previous methods that rely on semantic maps or structured annotations, GoViG operates purely on egocentric visual input—making it highly adaptable to unseen and unstructured environments.
GoViG decomposes the instruction generation task into two interconnected subtasks:
-
Navigation Visualization
Predicts intermediate visual states that bridge the initial and goal views. -
Instruction Generation with Visual Cues
Synthesizes linguistically coherent and spatially grounded instructions based on both observed and anticipated visuals.
These components are unified within an autoregressive MLLM, trained with tailored objectives to ensure spatial accuracy and linguistic clarity.
Inspired by human navigation behavior, GoViG supports two multimodal reasoning paradigms:
- One-Pass Reasoning: Generates instructions in a single forward pass.
- Interleaved Reasoning: Alternates between visual prediction and language generation for incremental planning.
To evaluate GoViG, we introduce R2R-Goal, a dataset combining synthetic and real-world trajectories.
conda create -n GoViG python=3.10
conda activate GoViG
pip install torch==2.4.0
pip install -r requirements.txt --userWe release a partial dataset for the purpose of debugging and demonstrating the data format, you can find them in data_samples. And you can access the full dataset here
unzip R2R_Goal.zipbash train.shbash eval.shyou can find detailed metrics calculation in taskeval_vis.py.
We would like to thank ANOLE and MVOT for their publicly available codebase, which we referenced during the implementation of Anole training.
More examples of GoViG results on the Real-world Subset of our R2R-Goal dataset.
If you find this repository or our paper useful, please consider starring this repository and citing our paper:
@article{wu2025govig,
title={GoViG: Goal-Conditioned Visual Navigation Instruction Generation},
author={Wu, Fengyi and Dong, Yifei and Cheng, Zhi-Qi and Dai, Yilong and Chen, Guangyu and Wang, Hang and Dai, Qi and Hauptmann, Alexander G},
journal={arXiv preprint arXiv:2508.09547},
year={2025}
}

































