Skip to content

JarvisLee0423/GREAT-Stereo

Repository files navigation

🚀 GREAT-Stereo (ICCV 2025) 🚀

Our significant extension version of GREAT, termed as GREATEN, is available at Paper, Code.

This repository contains the source code for our paper.

Paper | Paper Page | YouTube

Global Regulation and Excitation via Attention Tuning for Stereo Matching (GREAT-Stereo) arxiv

Jiahao LI, Xinhong Chen, Zhengmin JIANG, Qian Zhou, Yung-Hui Li, Jianping Wang

architecture

💡 Abstract

Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark.

Our main contributions are:

  • We propose a universal framework that can be integrated into existing iterative stereo-matching methods to improve the performance in ill-posed regions.
  • We introduce Spatial (SA), Matching (MA), and Volume (VA) Attentions, designed to mitigate ambiguities in ill-posed regions with global context information.
  • Our method outperforms existing published methods on public leaderboards such as SceneFlow, KITTI, ETH3D, and Middlebury, with especially significant improvements in ill-posed regions.

✅ To Do List

  • The real-time version of the GREAT Framwork. (Hint: This TODO list will be implemented on our significant extension of GREAT, termed as GREATEN.)
  • The gpu-memory-friendly implementation of the Matching Attention. (Hint: see at GREATEN repository.)
  • The Foundation-Model-based experiments.
  • The solid and robust version of the GREAT Framwork.
  • The accelerate training and evaluating pipeline.

🆕 Solid Version of GREAT-Stereo

We now propose a solid and robust version of our GREAT Framework, which obtains better performance on the SceneFlow and public KITTI 2012/2015 benchmarks, especially in ill-posed regions like Occlusion. Meanwhile, the Foundation-Model version of our GREAT-IGEV also obtains comparable performance with the current SOTA Foundation-Model-based architectures.

We merge the solid and robust version of GREAT-Stereo into great-stereo folder.

Our main modifications are:

  • We simplify the implementation of Volume Attention.
  • We extend the application of Spatial Attention.
  • We remove the redundant implementation of receptive augmentation.
  • We modify the cost volume construction pipeline with combined cost volume.
  • We implement Foundation-Model (DepthAny) based GREAT-IGEV named GREAT-IGEV-DepthAny by replacing the mobilenetv2 backbone with DepthAnythingV2, which is based on the implementation in Monster, and conduct the Foundation-Model-based experiments.
  • We accelerate the training and evaluation with DistributedDataParallel settings.

The benchmark results and corresponding checkpoints are:

Models SceneFlow KITTI2012 KITTI2015 Params Run Time Checkpoints
EPE D3 Occ-EPE Occ-D3 Non-Occ-EPE Non-Occ-D3 Out-Noc (2px) Out-All (2px) Out-Noc (3px) Out-All (3px) D1-All D1-bg Noc-D1-All Noc-D1-bg
Light-Weight Model
LEA-Stereo 0.78 - - - - - 1.90 2.39 1.13 1.45 1.65 1.40 1.51 1.29 1.81M 0.30s -
ACVNet 0.48 - - - - - 1.83 2.34 1.13 1.47 1.65 1.37 1.52 1.26 6.20M 0.20s -
IGEV-Stereo 0.48 - 1.65 - 0.19 - 1.71 2.17 1.12 1.44 1.59 1.38 1.49 1.27 12.60M 0.32s -
Selective-IGEV 0.45 - 1.57 - 0.17 - 1.59 2.05 1.07 1.38 1.55 1.33 1.44 1.22 13.14M 0.24s -
IGEV++ 0.43 - - - - - 1.56 2.03 1.04 1.36 1.51 1.31 1.42 1.20 14.53M 0.28s -
GREAT-IGEV (Ours) 0.41 2.20 1.51 10.12 0.14 0.49 1.51 2.00 1.02 1.37 1.50 1.28 1.37 1.14 14.44M 0.33s Google Drive
GREAT-Selective (Ours) 0.42 2.19 1.52 10.11 0.15 0.48 1.48 1.94 1.00 1.31 1.49 1.27 1.40 1.16 14.98M 0.43s Google Drive
GREAT-IGEV-Solid (Ours) 0.39 2.13 1.48 9.08 0.12 0.47 1.47 1.98 0.95 1.32 1.47 1.25 1.37 1.14 18.4M 0.33s Google Drive
GREAT-Selective-Solid (Ours) 0.38 2.07 1.46 8.85 0.11 0.46 - - - - - - - - 18.9M 0.43s Google Drive
Foundation Model
ViTA-Stereo 0.34 - - - - - 1.46 1.80 0.93 1.16 1.50 1.21 1.41 1.12 - - -
AIO-Stereo - - - - - - 1.58 1.94 1.05 1.29 1.54 1.34 1.43 1.22 - - -
Foundation-Stereo 0.34 - - - - - - - - - - - - - - - -
DEFOM-Stereo 0.42 - - - - - 1.43 1.79 0.94 1.18 1.41 1.25 1.33 1.15 - 0.30s -
IGEV++ (DepthAny) - - - - - - 1.36 1.74 0.89 1.13 1.43 1.15 1.36 1.07 348M 0.48s -
Monster 0.37 2.00 1.35 9.18 0.14 0.44 1.36 1.75 0.84 1.09 1.41 1.13 1.33 1.05 388M 0.45s -
GREAT-IGEV-DepthAny (Ours) 0.36 2.03 1.41 8.70 0.11 0.45 1.34 1.76 0.85 1.13 1.43 1.15 1.36 1.07 386M 0.43s Google Drive

The zero-shot results for Foundation Models are:

Models SceneFlow (EPE) KITTI2012 (D3) KITTI2015 (D3) Middlebury (D2) ETH3D (D1)
StereoAnywhere - 3.90 3.93 6.96 1.66
FoundationStereo 0.34 - - 5.5 1.8
DEFOM-Stereo 0.42 3.76 4.99 5.91 2.35
Monster 0.38 3.37 3.44 3.67 1.10
Monster* 0.39 4.82 5.98 4.66 9.15
GREAT-IGEV-DepthAny (Ours) 0.36 4.31 5.48 3.35 5.82
GREAT-IGEV-DepthAny* (Ours) 0.39 4.34 5.56 3.26 2.48

PS: Monster* is the result from the SceneFlow reproduction experiment by using the official code of Monster, see issue#28 in the official code for more information.

PS: GREAT-IGEV-DepthAny* is the result from the SceneFlow experiment after zero-shot selection, according to the issue#23 in the offcicial code of Monster.

🎬 Demos & Results

RAFT_DEMO
RAFT Demo
IGEV_DEMO
IGEV Demo
SELECTIVE_DEMO
Selective Demo

sceneflow_vis

Qualitative results of GREAT-IGEV on the Scene Flow test set of occlusion (Row 1), textureless (Row 2), and repetitive texture (Row 3) regions.

sota_comparison
transferability

Comparisons with state-of-the-art stereo methods on different public benchmarks and ablation study of the cross-model transferability of the proposed GREAT framework on the Scene Flow test set.

⚙️ Environment Settings

  • NVIDIA RTX 3090 or 4090
  • python 3.8
conda create -n great python=3.8
conda activate great

pip install torch torchvision torchaudio xformers==0.0.22.post3+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm==4.67.1
pip install scipy==1.10.1
pip install opencv-python==4.11.0.86
pip install scikit-image==0.21.0
pip install tensorboard==2.12.0
pip install matplotlib==3.7.5
pip install timm==0.5.4
pip install numpy==1.24.1
pip install einops==0.8.1
pip install open3d==0.19.0

💾 Required Data

🧪 Evaluation

  1. Download the Checkpoints from Google Drive.

  2. Change the following parameters in the script located at launchers/stereo_matching/test_launcher/.

    • dataset
      • Choices => [sceneflow, kitti, eth3d, middlebury_(Q | H | F)]
    • dataset_root
      • your/path/to/corresponding/dataset
    • restore_ckpt
      • your/path/to/checkpoint
    • max_disp (Optional)
      • 768 for Middlebury and 192 for others
  3. Run the evaluation (e.g. Evaluation of GREAT-IGEV on Scene Flow test set).

./launchers/stereo_matching/test_launcher/great_igev_evaluator.sh
  1. (Optional) You can also change the eval_mode in the evaluation script to get different evaluation results.
    • metric to generate evaluation quantity results (Default).
    • pcgen to generate the points cloud of predicted disparity for visualization.
    • cvvis to generate the visualization of the cost volume.

📚 Training

  1. Change the following parameters in the script located at launchers/stereo_matching/train_launcher/.

    • logdir
      • your/path/to/save/training/information
    • train_datasets
      • Choices => [sceneflow, vkitti2, kitti, eth3d_train, eth3d_finetune, middlebury_train, middlebury_finetune]
    • train_datasets_root
      • your/path/to/corresponding/dataset
    • restore_ckpt (Optional)
      • your/path/to/checkpoint/for/finetuning
  2. Run the training (e.g. Training of GREAT-IGEV on Scene Flow test set).

./launchers/stereo_matching/train_launcher/great_igev_trainer.sh
  1. (Optional) You can also change the trainer in the script from stereo_trainer.py to stereo_resumable_trainer.py, which can resume the training if the training process has been accidentally shut down. The stereo_resumable_trainer.py will save checkpoints for model, optimizer, and learning rate scheduler for resuming.

  2. (Optional) Thanks for the repository of IGEV-Stereo, we also provide the choices of the data type in mixed precision training. You can change this data type with precision_dtype in the script. Choices are float32, float16, and bfloat16. Default value is float16. NOTE: Our provided checkpoints are trained with float16 and float32.

📦 Submission

For submission to the KITTI benchmark (e.g. GREAT-IGEV).

python3 save_disp_kitti.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the ETH3D benchmark (e.g. GREAT-IGEV).

python3 save_disp_eth3d.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the Middlebury benchmark (e.g. GREAT-IGEV).

python3 save_disp_middlebury.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

📖 Citation

If you find our works useful in your research, please consider citing our paper.

@inproceedings{li2025global,
  title={Global regulation and excitation via attention tuning for stereo matching},
  author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Zhou, Qian and Li, Yung-Hui and Wang, Jianping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={25539--25549},
  year={2025}
}

Acknowledgements

This project is based on RAFT-Stereo, IGEV-Stereo, and Selective-Stereo. Meanwhile, the core attention modules of this project are modified from CoEx, VOLO, and Swin-Transformer. We thank the original authors for their excellent work.

About

【ICCV 2025】Global Regulation and Excitation via Attention Tuning for Stereo Matching (GREAT-Stereo)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors