Our significant extension version of GREAT, termed as GREATEN, is available at Paper, Code.
This repository contains the source code for our paper.
Paper | Paper Page | YouTube
Global Regulation and Excitation via Attention Tuning for Stereo Matching (GREAT-Stereo)
Jiahao LI, Xinhong Chen, Zhengmin JIANG, Qian Zhou, Yung-Hui Li, Jianping Wang
Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark.
Our main contributions are:
- We propose a universal framework that can be integrated into existing iterative stereo-matching methods to improve the performance in ill-posed regions.
- We introduce Spatial (SA), Matching (MA), and Volume (VA) Attentions, designed to mitigate ambiguities in ill-posed regions with global context information.
- Our method outperforms existing published methods on public leaderboards such as SceneFlow, KITTI, ETH3D, and Middlebury, with especially significant improvements in ill-posed regions.
-
The real-time version of the GREAT Framwork.(Hint: This TODO list will be implemented on our significant extension of GREAT, termed as GREATEN.) - The gpu-memory-friendly implementation of the Matching Attention. (Hint: see at GREATEN repository.)
- The Foundation-Model-based experiments.
- The solid and robust version of the GREAT Framwork.
- The accelerate training and evaluating pipeline.
We now propose a solid and robust version of our GREAT Framework, which obtains better performance on the SceneFlow and public KITTI 2012/2015 benchmarks, especially in ill-posed regions like Occlusion. Meanwhile, the Foundation-Model version of our GREAT-IGEV also obtains comparable performance with the current SOTA Foundation-Model-based architectures.
We merge the solid and robust version of GREAT-Stereo into great-stereo folder.
Our main modifications are:
- We simplify the implementation of Volume Attention.
- We extend the application of Spatial Attention.
- We remove the redundant implementation of receptive augmentation.
- We modify the cost volume construction pipeline with combined cost volume.
- We implement Foundation-Model (DepthAny) based GREAT-IGEV named GREAT-IGEV-DepthAny by replacing the mobilenetv2 backbone with DepthAnythingV2, which is based on the implementation in Monster, and conduct the Foundation-Model-based experiments.
- We accelerate the training and evaluation with DistributedDataParallel settings.
The benchmark results and corresponding checkpoints are:
| Models | SceneFlow | KITTI2012 | KITTI2015 | Params | Run Time | Checkpoints | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EPE | D3 | Occ-EPE | Occ-D3 | Non-Occ-EPE | Non-Occ-D3 | Out-Noc (2px) | Out-All (2px) | Out-Noc (3px) | Out-All (3px) | D1-All | D1-bg | Noc-D1-All | Noc-D1-bg | ||||
| Light-Weight Model | |||||||||||||||||
| LEA-Stereo | 0.78 | - | - | - | - | - | 1.90 | 2.39 | 1.13 | 1.45 | 1.65 | 1.40 | 1.51 | 1.29 | 1.81M | 0.30s | - |
| ACVNet | 0.48 | - | - | - | - | - | 1.83 | 2.34 | 1.13 | 1.47 | 1.65 | 1.37 | 1.52 | 1.26 | 6.20M | 0.20s | - |
| IGEV-Stereo | 0.48 | - | 1.65 | - | 0.19 | - | 1.71 | 2.17 | 1.12 | 1.44 | 1.59 | 1.38 | 1.49 | 1.27 | 12.60M | 0.32s | - |
| Selective-IGEV | 0.45 | - | 1.57 | - | 0.17 | - | 1.59 | 2.05 | 1.07 | 1.38 | 1.55 | 1.33 | 1.44 | 1.22 | 13.14M | 0.24s | - |
| IGEV++ | 0.43 | - | - | - | - | - | 1.56 | 2.03 | 1.04 | 1.36 | 1.51 | 1.31 | 1.42 | 1.20 | 14.53M | 0.28s | - |
| GREAT-IGEV (Ours) | 0.41 | 2.20 | 1.51 | 10.12 | 0.14 | 0.49 | 1.51 | 2.00 | 1.02 | 1.37 | 1.50 | 1.28 | 1.37 | 1.14 | 14.44M | 0.33s | Google Drive |
| GREAT-Selective (Ours) | 0.42 | 2.19 | 1.52 | 10.11 | 0.15 | 0.48 | 1.48 | 1.94 | 1.00 | 1.31 | 1.49 | 1.27 | 1.40 | 1.16 | 14.98M | 0.43s | Google Drive |
| GREAT-IGEV-Solid (Ours) | 0.39 | 2.13 | 1.48 | 9.08 | 0.12 | 0.47 | 1.47 | 1.98 | 0.95 | 1.32 | 1.47 | 1.25 | 1.37 | 1.14 | 18.4M | 0.33s | Google Drive |
| GREAT-Selective-Solid (Ours) | 0.38 | 2.07 | 1.46 | 8.85 | 0.11 | 0.46 | - | - | - | - | - | - | - | - | 18.9M | 0.43s | Google Drive |
| Foundation Model | |||||||||||||||||
| ViTA-Stereo | 0.34 | - | - | - | - | - | 1.46 | 1.80 | 0.93 | 1.16 | 1.50 | 1.21 | 1.41 | 1.12 | - | - | - |
| AIO-Stereo | - | - | - | - | - | - | 1.58 | 1.94 | 1.05 | 1.29 | 1.54 | 1.34 | 1.43 | 1.22 | - | - | - |
| Foundation-Stereo | 0.34 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| DEFOM-Stereo | 0.42 | - | - | - | - | - | 1.43 | 1.79 | 0.94 | 1.18 | 1.41 | 1.25 | 1.33 | 1.15 | - | 0.30s | - |
| IGEV++ (DepthAny) | - | - | - | - | - | - | 1.36 | 1.74 | 0.89 | 1.13 | 1.43 | 1.15 | 1.36 | 1.07 | 348M | 0.48s | - |
| Monster | 0.37 | 2.00 | 1.35 | 9.18 | 0.14 | 0.44 | 1.36 | 1.75 | 0.84 | 1.09 | 1.41 | 1.13 | 1.33 | 1.05 | 388M | 0.45s | - |
| GREAT-IGEV-DepthAny (Ours) | 0.36 | 2.03 | 1.41 | 8.70 | 0.11 | 0.45 | 1.34 | 1.76 | 0.85 | 1.13 | 1.43 | 1.15 | 1.36 | 1.07 | 386M | 0.43s | Google Drive |
The zero-shot results for Foundation Models are:
| Models | SceneFlow (EPE) | KITTI2012 (D3) | KITTI2015 (D3) | Middlebury (D2) | ETH3D (D1) |
|---|---|---|---|---|---|
| StereoAnywhere | - | 3.90 | 3.93 | 6.96 | 1.66 |
| FoundationStereo | 0.34 | - | - | 5.5 | 1.8 |
| DEFOM-Stereo | 0.42 | 3.76 | 4.99 | 5.91 | 2.35 |
| Monster | 0.38 | 3.37 | 3.44 | 3.67 | 1.10 |
| Monster* | 0.39 | 4.82 | 5.98 | 4.66 | 9.15 |
| GREAT-IGEV-DepthAny (Ours) | 0.36 | 4.31 | 5.48 | 3.35 | 5.82 |
| GREAT-IGEV-DepthAny* (Ours) | 0.39 | 4.34 | 5.56 | 3.26 | 2.48 |
PS: Monster* is the result from the SceneFlow reproduction experiment by using the official code of Monster, see issue#28 in the official code for more information.
PS: GREAT-IGEV-DepthAny* is the result from the SceneFlow experiment after zero-shot selection, according to the issue#23 in the offcicial code of Monster.
RAFT Demo
|
IGEV Demo
|
Selective Demo
|
Qualitative results of GREAT-IGEV on the Scene Flow test set of occlusion (Row 1), textureless (Row 2), and repetitive texture (Row 3) regions.
Comparisons with state-of-the-art stereo methods on different public benchmarks and ablation study of the cross-model transferability of the proposed GREAT framework on the Scene Flow test set.
- NVIDIA RTX 3090 or 4090
- python 3.8
conda create -n great python=3.8
conda activate great
pip install torch torchvision torchaudio xformers==0.0.22.post3+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm==4.67.1
pip install scipy==1.10.1
pip install opencv-python==4.11.0.86
pip install scikit-image==0.21.0
pip install tensorboard==2.12.0
pip install matplotlib==3.7.5
pip install timm==0.5.4
pip install numpy==1.24.1
pip install einops==0.8.1
pip install open3d==0.19.0- SceneFlow
- KITTI
- ETH3D
- Middlebury
- TartanAir
- VKITTI2
- CREStereo Dataset
- FallingThings
- InStereo2K
- Sintel Stereo
- HR-VS
-
Download the Checkpoints from Google Drive.
-
Change the following parameters in the script located at
launchers/stereo_matching/test_launcher/.dataset- Choices => [sceneflow, kitti, eth3d, middlebury_(Q | H | F)]
dataset_root- your/path/to/corresponding/dataset
restore_ckpt- your/path/to/checkpoint
max_disp(Optional)768for Middlebury and192for others
-
Run the evaluation (e.g. Evaluation of GREAT-IGEV on Scene Flow test set).
./launchers/stereo_matching/test_launcher/great_igev_evaluator.sh- (Optional) You can also change the
eval_modein the evaluation script to get different evaluation results.metricto generate evaluation quantity results (Default).pcgento generate the points cloud of predicted disparity for visualization.cvvisto generate the visualization of the cost volume.
-
Change the following parameters in the script located at
launchers/stereo_matching/train_launcher/.logdir- your/path/to/save/training/information
train_datasets- Choices => [sceneflow, vkitti2, kitti, eth3d_train, eth3d_finetune, middlebury_train, middlebury_finetune]
train_datasets_root- your/path/to/corresponding/dataset
restore_ckpt(Optional)- your/path/to/checkpoint/for/finetuning
-
Run the training (e.g. Training of GREAT-IGEV on Scene Flow test set).
./launchers/stereo_matching/train_launcher/great_igev_trainer.sh-
(Optional) You can also change the trainer in the script from
stereo_trainer.pytostereo_resumable_trainer.py, which can resume the training if the training process has been accidentally shut down. Thestereo_resumable_trainer.pywill save checkpoints for model, optimizer, and learning rate scheduler for resuming. -
(Optional) Thanks for the repository of IGEV-Stereo, we also provide the choices of the data type in mixed precision training. You can change this data type with
precision_dtypein the script. Choices arefloat32,float16, andbfloat16. Default value isfloat16. NOTE: Our provided checkpoints are trained withfloat16andfloat32.
For submission to the KITTI benchmark (e.g. GREAT-IGEV).
python3 save_disp_kitti.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsFor submission to the ETH3D benchmark (e.g. GREAT-IGEV).
python3 save_disp_eth3d.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsFor submission to the Middlebury benchmark (e.g. GREAT-IGEV).
python3 save_disp_middlebury.py --name great-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsIf you find our works useful in your research, please consider citing our paper.
@inproceedings{li2025global,
title={Global regulation and excitation via attention tuning for stereo matching},
author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Zhou, Qian and Li, Yung-Hui and Wang, Jianping},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={25539--25549},
year={2025}
}This project is based on RAFT-Stereo, IGEV-Stereo, and Selective-Stereo. Meanwhile, the core attention modules of this project are modified from CoEx, VOLO, and Swin-Transformer. We thank the original authors for their excellent work.






