This repository contains the source code for our paper.
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching (GREATEN-Stereo)
Jiahao LI, Xinhong Chen, Zhengmin JIANG, Cheng Huang, Yung-Hui Li, Jianping Wang
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
Our main contributions are:
- We introduce the Gated Contextual-Geometric Fusion (GCGF) module that effectively fuses stereo-image and surface-normal features to mitigate cross-domain discrepancies and the ill-posed ambiguities inherent in image textures, thereby enhancing Synthetic-to-Realistic (Syn-to-Real) generalization.
- We design the Specular-Transparent Augmentation (STA) strategy to intentionally disturb the texture consistency of training images, forcing the GCGF module to better filter ambiguous image textures and improve fusion reliability.
- We develop sparse attention alternatives to preserve the global feature extraction capability of GREAT-Stereo for handling ambiguities in occluded and texture-related ill-posed regions, while significantly reducing computational cost, termed as Sparse Spatial Attention (SSA), Sparse Dual-Matching Attention (SDMA), and Simple Volume Attention (SVA).
- Trained solely on synthetic data, our GREATEN-Stereo outperforms existing published stereo-matching methods in Synthetic-to-Realistic generalization across five major real-world benchmarks: ETH3D, Middlebury, KITTI-2012, KITTI-2015, and Booster.
Demo visualization of our captured stereo pairs. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.
Synthetic-to-Realistic Zero-Shot visualization of GREATEN-Stereo on the Booster, Middlebury, and ETH3D. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.
Synthetic-to-Realistic Zero-Shot performance of GREATEN-Stereo on the KITTI-2012, KITTI-2015, and Booster. Unless specified, all models are trained from scratch on SceneFlow. StereoAnywhere is trained from a frozen RAFT-Stereo checkpoint with extra supervisions for surface normals, and uses priors of VFM pretrained on the HyperSim dataset. * denotes training on our Syn-to-Real Mixed datasets.
Comparison of the computational overhead of GREATEN-Stereo on the SceneFlow, KITTI-2015, and Middlebury. "DA" denotes "DepthAny". Results in Table (a) are evaluated on NVIDIA RTX 4090. Results in Table (b) are inferred on NVIDIA A800-80GB using Full-Resolution Middlebury and max disparity set to 768.
- NVIDIA RTX 4090
- python 3.8
conda create -n greaten python=3.8
conda activate greaten
pip install torch torchvision torchaudio xformers==0.0.22.post3+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm==4.67.1
pip install scipy==1.10.1
pip install opencv-python==4.11.0.86
pip install scikit-image==0.21.0
pip install tensorboard==2.12.0
pip install matplotlib==3.7.5
pip install timm==0.5.4
pip install numpy==1.24.1
pip install einops==0.8.1
pip install open3d==0.19.0
pip install kornia==0.7.3
pip install setuptools==69.5.1
cd utils/stereo_matching/cuda_utils/deformable_aggregation && pip install -e .- SceneFlow
- KITTI
- ETH3D
- Middlebury
- Booster
- TartanAir
- VKITTI2
- CREStereo Dataset
- FallingThings
- InStereo2K
- Sintel Stereo
- HR-VS
- Download the Checkpoints:
| Model | Link |
|---|---|
| DepthAnything V2 | Download π |
| GREATEN-IGEV-SceneFlow-192 | Download π |
| GREATEN-Selective-SceneFlow-192 | Download π |
| GREATEN-DepthAny-IGEV-SceneFlow-192 | Download π |
| GREATEN-IGEV-Mixed-192 | Download π |
| GREATEN-Selective-Mixed-192 | Download π |
| GREATEN-DepthAny-IGEV-Mixed-192 | Download π |
| GREATEN-IGEV-RVC-192 | Download π |
| GREATEN-DepthAny-IGEV-RVC-192 | Download π |
-
Change the following parameters in the script located at
launchers/stereo_matching/test_launcher/.dataset- Choices => [sceneflow, kitti, booster, eth3d, middlebury_(Q | H | F)]
dataset_root- your/path/to/corresponding/dataset
restore_ckpt- your/path/to/checkpoint
max_disp(Optional)768for Middlebury and192for others
-
Run the evaluation (e.g. Evaluation of GREATEN-IGEV on Scene Flow test set).
./launchers/stereo_matching/test_launcher/greaten_igev_evaluator.sh- (Optional) You can also change the
eval_modein the evaluation script to get different evaluation results.metricto generate evaluation quantity results (Default).pcgento generate the points cloud of predicted disparity for visualization.cvvisto generate the visualization of the cost volume.
-
Change the following parameters in the script located at
launchers/stereo_matching/train_launcher/.logdir- your/path/to/save/training/information
train_datasets- Choices => [sceneflow, vkitti2, kitti, syn_to_real_train, rvc_mix_data_train, eth3d_train, eth3d_finetune, middlebury_train, middlebury_finetune]
train_datasets_root- your/path/to/corresponding/dataset
restore_ckpt(Optional)- your/path/to/checkpoint/for/finetuning
-
Run the training (e.g. Training of GREATEN-IGEV on Scene Flow test set).
./launchers/stereo_matching/train_launcher/greaten_igev_trainer.shFor submission to the KITTI benchmark (e.g. GREATEN-IGEV).
python3 save_disp_kitti.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsFor submission to the ETH3D benchmark (e.g. GREATEN-IGEV).
python3 save_disp_eth3d.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsFor submission to the Middlebury benchmark (e.g. GREATEN-IGEV).
python3 save_disp_middlebury.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/resultsIf you find our works useful in your research, please consider citing our papers.
GREAT-Stereo:
@inproceedings{li2025global,
title={Global regulation and excitation via attention tuning for stereo matching},
author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Zhou, Qian and Li, Yung-Hui and Wang, Jianping},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={25539--25549},
year={2025}
}
GREATEN-Stereo:
@article{li2026geometry,
title={Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching},
author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Huang, Cheng and Li, Yung-Hui and Wang, Jianping},
journal={arXiv preprint arXiv:2604.09142},
year={2026}
}This project is based on IGEV-Stereo, Selective-Stereo, and Monster. Meanwhile, the core attention modules of this project are modified from GREAT-Stereo based on Deformable Attention implementations from GaussianFormer. We thank the original authors for their excellent work.





