Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels
Hongyu Li, Taskin Padir, Huaizu Jiang
Northeastern University
- Model code release
- Dataset release
- Simulation environment release
StereoNavNet (SNN) is a visual navigation approach that combines stereo vision with 3D occupancy voxel grids for robust, geometry-aware robot navigation. Unlike semantic approaches that learn navigation directly from raw images, SNN uses an explicit 3D geometric representation as an intermediate step, leading to higher learning efficiency and improved generalizability across environments.
SNN follows a modular design with two main components:
Stereo Images (I_L, I_R)
│
▼
┌─────────────────────────┐
│ Perception Module │ (StereoVoxelNet / SVN)
│ Stereo → 3D Voxels │
│ │ Loss: IoU between predicted and GT voxel grids
└──────────┬──────────────┘
│ Occupancy Grid G (64×64×64)
▼
┌─────────────────────────┐
│ Policy Module │ (3D CNN + MLP)
│ Voxels + Goal → Action │
│ │ Loss: MSE between predicted and expert actions
└──────────┬──────────────┘
│
▼
Action a = (v, ω)
Estimates a 3D voxel occupancy grid from stereo RGB image pairs:
- Feature Extraction: MobileNet-based backbone extracts features from left and right images
- Cost Volume: Constructs a stereo cost volume by interweaving left/right features at multiple disparity levels
- 3D Decoder (UNet): A 2D-to-3D encoder-decoder architecture predicts the 64×64×64 occupancy grid
Extracts geometric obstacle features and predicts velocity actions:
- 3D CNN Feature Extraction: Four 3D convolutional layers with channels [4, 8, 16, 32], kernel size 4, stride 2 process the occupancy grid into a 256-dimensional obstacle feature vector h_o
- Goal Encoding: The relative goal position is represented as (distance, angle)
- Action Prediction: An MLP with layers [258→128→32→8→2] (each with BatchNorm + ReLU) takes concatenated [h_o, goal] and predicts the velocity pair (v, ω)
SNN is trained in three stages:
- Stage 1 – Train the perception module (SVN) using stereo images and ground-truth voxel occupancy grids (IoU loss), 150 epochs
- Stage 2 – Train the policy module (Planner) using ground-truth occupancy as input (MSE loss on expert actions), 50 epochs
- Stage 3 – Joint end-to-end fine-tuning with combined loss:
L = L_perception + α · L_policy(α = 0.1), 300 epochs
stereonavnet/
├── __init__.py # Package init
├── args.py # Command-line argument definitions
├── dataset.py # Dataset loading and preprocessing
├── network.py # Core model definitions (SVN, Planner, SNN)
├── submodule.py # MobileNet feature extraction, cost volume utilities
├── train.py # Training script with modular learning
└── stereonavnet/ # Package directory
Stage 1: Train Perception Module (StereoVoxelNet)
python train.py --stereo_only --epochs 150Stage 2: Train Policy Module (Planner)
python train.py --planner_only --epochs 50Stage 3: Joint End-to-End Fine-Tuning (SNN)
python train.py --epochs 300 --alpha 0.1python train.py --test| Argument | Default | Description |
|---|---|---|
--stereo_only |
False | Train only the perception module (SVN) |
--planner_only |
False | Train only the policy module with GT occupancy |
--epochs |
400 | Number of training epochs |
--batch_size |
4 | Training batch size |
--lr |
1e-3 | Learning rate (Adam optimizer) |
--alpha |
0.1 | Weight for velocity loss in joint training |
--gpus |
1 | Number of GPUs |
--val_frequency |
3 | Validation every N epochs |
--dataset_portion |
1.0 | Fraction of training data to use |
--test |
False | Run in test mode |
@INPROCEEDINGS{10802385,
author={Li, Hongyu and Padır, Taşkın and Jiang, Huaizu},
booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels},
year={2024},
pages={5559-5566},
doi={10.1109/IROS58592.2024.10802385}}