Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Official implementation of the CVPR 2025 paper: Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Akhil Perincherry, Jacob Krantz and Stefan Lee

[Project Page] [Paper]

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or “imaginations”, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of ∼1 point and up to ∼0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.

Installation

Install Matterport3D simulators: follow instructions from here to install the latest version.

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Setup VLN-DUET and VLN-HAMT using their official instructions.
Install requirements:

conda create --name vln-imagine python=3.8.5
conda activate vln-imagine
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

# install timm
git clone https://github.com/rwightman/pytorch-image-models.git
cd pytorch-image-models
git checkout 9cc7dda6e5fcbbc7ac5ba5d2d44050d2a8e3e38d

Download checkpoints and features from here. Files include:

off-the-shelf ViT features for R2R-Imagine.
HAMT ViT features for R2R-Imagine.
HAMT-Imagine R2R checkpoint.
DUET-Imagine R2R checkpoint.

(optional) Download imagination generations for R2R from here and metadata of generations and noun-phrase segments of R2R instructions from here.
Run - adjust paths of downloaded files and run training/inference for HAMT and DUET from the respective folders:

cd <folder-of-HAMT/DUET src>
bash scripts/run_r2r.sh

License

Our code is MIT licensed. Trained models are considered data derived from the Matterport3D scene dataset and are distributed according to the Matterport3D Terms of Use.

Citing

@InProceedings{aperinch_2025_VLN_Imagine,
          title={Do Visual Imaginations Improve Vision-and-Language Navigation Agents?},
          author={Akhil Perincherry and Jacob Krantz and Stefan Lee},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          month={June},
          year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
VLN-DUET		VLN-DUET
VLN-HAMT		VLN-HAMT
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Installation

License

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Installation

License

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages