StereoNavNet (SNN)

Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels

Northeastern University

TODO

Model code release
Dataset release
Simulation environment release

Overview

StereoNavNet (SNN) is a visual navigation approach that combines stereo vision with 3D occupancy voxel grids for robust, geometry-aware robot navigation. Unlike semantic approaches that learn navigation directly from raw images, SNN uses an explicit 3D geometric representation as an intermediate step, leading to higher learning efficiency and improved generalizability across environments.

Architecture

SNN follows a modular design with two main components:

Stereo Images (I_L, I_R)
        │
        ▼
┌─────────────────────────┐
│   Perception Module      │  (StereoVoxelNet / SVN)
│   Stereo → 3D Voxels     │
│                         │  Loss: IoU between predicted and GT voxel grids
└──────────┬──────────────┘
           │  Occupancy Grid G (64×64×64)
           ▼
┌─────────────────────────┐
│    Policy Module         │  (3D CNN + MLP)
│  Voxels + Goal → Action  │
│                         │  Loss: MSE between predicted and expert actions
└──────────┬──────────────┘
           │
           ▼
      Action a = (v, ω)

Perception Module (StereoVoxelNet)

Estimates a 3D voxel occupancy grid from stereo RGB image pairs:

Feature Extraction: MobileNet-based backbone extracts features from left and right images
Cost Volume: Constructs a stereo cost volume by interweaving left/right features at multiple disparity levels
3D Decoder (UNet): A 2D-to-3D encoder-decoder architecture predicts the 64×64×64 occupancy grid

Policy Module (3D CNN + MLP)

Extracts geometric obstacle features and predicts velocity actions:

3D CNN Feature Extraction: Four 3D convolutional layers with channels [4, 8, 16, 32], kernel size 4, stride 2 process the occupancy grid into a 256-dimensional obstacle feature vector h_o
Goal Encoding: The relative goal position is represented as (distance, angle)
Action Prediction: An MLP with layers [258→128→32→8→2] (each with BatchNorm + ReLU) takes concatenated [h_o, goal] and predicts the velocity pair (v, ω)

Modular Learning

SNN is trained in three stages:

Stage 1 – Train the perception module (SVN) using stereo images and ground-truth voxel occupancy grids (IoU loss), 150 epochs
Stage 2 – Train the policy module (Planner) using ground-truth occupancy as input (MSE loss on expert actions), 50 epochs
Stage 3 – Joint end-to-end fine-tuning with combined loss: L = L_perception + α · L_policy (α = 0.1), 300 epochs

File Structure

stereonavnet/
├── __init__.py         # Package init
├── args.py             # Command-line argument definitions
├── dataset.py          # Dataset loading and preprocessing
├── network.py          # Core model definitions (SVN, Planner, SNN)
├── submodule.py        # MobileNet feature extraction, cost volume utilities
├── train.py            # Training script with modular learning
└── stereonavnet/       # Package directory

Usage

Training

Stage 1: Train Perception Module (StereoVoxelNet)

python train.py --stereo_only --epochs 150

Stage 2: Train Policy Module (Planner)

python train.py --planner_only --epochs 50

Stage 3: Joint End-to-End Fine-Tuning (SNN)

python train.py --epochs 300 --alpha 0.1

Testing

python train.py --test

Key Arguments

Argument	Default	Description
`--stereo_only`	False	Train only the perception module (SVN)
`--planner_only`	False	Train only the policy module with GT occupancy
`--epochs`	400	Number of training epochs
`--batch_size`	4	Training batch size
`--lr`	1e-3	Learning rate (Adam optimizer)
`--alpha`	0.1	Weight for velocity loss in joint training
`--gpus`	1	Number of GPUs
`--val_frequency`	3	Validation every N epochs
`--dataset_portion`	1.0	Fraction of training data to use
`--test`	False	Run in test mode

Citation

@INPROCEEDINGS{10802385,
  author={Li, Hongyu and Padır, Taşkın and Jiang, Huaizu},
  booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels}, 
  year={2024},
  pages={5559-5566},
  doi={10.1109/IROS58592.2024.10802385}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StereoNavNet (SNN)

TODO

Overview

Architecture

Perception Module (StereoVoxelNet)

Policy Module (3D CNN + MLP)

Modular Learning

File Structure

Usage

Training

Testing

Key Arguments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
__init__.py		__init__.py
args.py		args.py
dataset.py		dataset.py
network.py		network.py
submodule.py		submodule.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

StereoNavNet (SNN)

TODO

Overview

Architecture

Perception Module (StereoVoxelNet)

Policy Module (3D CNN + MLP)

Modular Learning

File Structure

Usage

Training

Testing

Key Arguments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages