🚀 GREATEN-Stereo 🚀

This repository contains the source code for our paper.

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching (GREATEN-Stereo)

Jiahao LI, Xinhong Chen, Zhengmin JIANG, Cheng Huang, Yung-Hui Li, Jianping Wang

💡 Abstract

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

Our main contributions are:

We introduce the Gated Contextual-Geometric Fusion (GCGF) module that effectively fuses stereo-image and surface-normal features to mitigate cross-domain discrepancies and the ill-posed ambiguities inherent in image textures, thereby enhancing Synthetic-to-Realistic (Syn-to-Real) generalization.
We design the Specular-Transparent Augmentation (STA) strategy to intentionally disturb the texture consistency of training images, forcing the GCGF module to better filter ambiguous image textures and improve fusion reliability.
We develop sparse attention alternatives to preserve the global feature extraction capability of GREAT-Stereo for handling ambiguities in occluded and texture-related ill-posed regions, while significantly reducing computational cost, termed as Sparse Spatial Attention (SSA), Sparse Dual-Matching Attention (SDMA), and Simple Volume Attention (SVA).
Trained solely on synthetic data, our GREATEN-Stereo outperforms existing published stereo-matching methods in Synthetic-to-Realistic generalization across five major real-world benchmarks: ETH3D, Middlebury, KITTI-2012, KITTI-2015, and Booster.

🎬 Demos & Results

Demo visualization of our captured stereo pairs. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.

Synthetic-to-Realistic Zero-Shot visualization of GREATEN-Stereo on the Booster, Middlebury, and ETH3D. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.

Synthetic-to-Realistic Zero-Shot performance of GREATEN-Stereo on the KITTI-2012, KITTI-2015, and Booster. Unless specified, all models are trained from scratch on SceneFlow. StereoAnywhere is trained from a frozen RAFT-Stereo checkpoint with extra supervisions for surface normals, and uses priors of VFM pretrained on the HyperSim dataset. * denotes training on our Syn-to-Real Mixed datasets.

Comparison of the computational overhead of GREATEN-Stereo on the SceneFlow, KITTI-2015, and Middlebury. "DA" denotes "DepthAny". Results in Table (a) are evaluated on NVIDIA RTX 4090. Results in Table (b) are inferred on NVIDIA A800-80GB using Full-Resolution Middlebury and max disparity set to 768.

⚙️ Environment Settings

NVIDIA RTX 4090
python 3.8

conda create -n greaten python=3.8
conda activate greaten

pip install torch torchvision torchaudio xformers==0.0.22.post3+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm==4.67.1
pip install scipy==1.10.1
pip install opencv-python==4.11.0.86
pip install scikit-image==0.21.0
pip install tensorboard==2.12.0
pip install matplotlib==3.7.5
pip install timm==0.5.4
pip install numpy==1.24.1
pip install einops==0.8.1
pip install open3d==0.19.0
pip install kornia==0.7.3
pip install setuptools==69.5.1

cd utils/stereo_matching/cuda_utils/deformable_aggregation && pip install -e .

💾 Required Data

🧪 Evaluation

Download the Checkpoints:

Model	Link
DepthAnything V2	Download 😆
GREATEN-IGEV-SceneFlow-192	Download 😆
GREATEN-Selective-SceneFlow-192	Download 😆
GREATEN-DepthAny-IGEV-SceneFlow-192	Download 😆
GREATEN-IGEV-Mixed-192	Download 😆
GREATEN-Selective-Mixed-192	Download 😆
GREATEN-DepthAny-IGEV-Mixed-192	Download 😆
GREATEN-IGEV-RVC-192	Download 😆
GREATEN-DepthAny-IGEV-RVC-192	Download 😆

Change the following parameters in the script located at launchers/stereo_matching/test_launcher/.
- dataset
  - Choices => [sceneflow, kitti, booster, eth3d, middlebury_(Q | H | F)]
- dataset_root
  - your/path/to/corresponding/dataset
- restore_ckpt
  - your/path/to/checkpoint
- max_disp (Optional)
  - 768 for Middlebury and 192 for others
Run the evaluation (e.g. Evaluation of GREATEN-IGEV on Scene Flow test set).

./launchers/stereo_matching/test_launcher/greaten_igev_evaluator.sh

(Optional) You can also change the eval_mode in the evaluation script to get different evaluation results.
- metric to generate evaluation quantity results (Default).
- pcgen to generate the points cloud of predicted disparity for visualization.
- cvvis to generate the visualization of the cost volume.

📚 Training

Change the following parameters in the script located at launchers/stereo_matching/train_launcher/.
- logdir
  - your/path/to/save/training/information
- train_datasets
  - Choices => [sceneflow, vkitti2, kitti, syn_to_real_train, rvc_mix_data_train, eth3d_train, eth3d_finetune, middlebury_train, middlebury_finetune]
- train_datasets_root
  - your/path/to/corresponding/dataset
- restore_ckpt (Optional)
  - your/path/to/checkpoint/for/finetuning
Run the training (e.g. Training of GREATEN-IGEV on Scene Flow test set).

./launchers/stereo_matching/train_launcher/greaten_igev_trainer.sh

📦 Submission

For submission to the KITTI benchmark (e.g. GREATEN-IGEV).

python3 save_disp_kitti.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the ETH3D benchmark (e.g. GREATEN-IGEV).

python3 save_disp_eth3d.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the Middlebury benchmark (e.g. GREATEN-IGEV).

python3 save_disp_middlebury.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

📖 Citation

If you find our works useful in your research, please consider citing our papers.

GREAT-Stereo:
@inproceedings{li2025global,
  title={Global regulation and excitation via attention tuning for stereo matching},
  author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Zhou, Qian and Li, Yung-Hui and Wang, Jianping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={25539--25549},
  year={2025}
}

GREATEN-Stereo:
@article{li2026geometry,
  title={Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching},
  author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Huang, Cheng and Li, Yung-Hui and Wang, Jianping},
  journal={arXiv preprint arXiv:2604.09142},
  year={2026}
}

Acknowledgements

This project is based on IGEV-Stereo, Selective-Stereo, and Monster. Meanwhile, the core attention modules of this project are modified from GREAT-Stereo based on Deformable Attention implementations from GaussianFormer. We thank the original authors for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
demos/imgs		demos/imgs
launchers/stereo_matching		launchers/stereo_matching
models/greaten_stereo		models/greaten_stereo
modules		modules
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_infer.py		demo_infer.py
save_disp_common.py		save_disp_common.py
save_disp_eth3d.py		save_disp_eth3d.py
save_disp_kitti.py		save_disp_kitti.py
save_disp_middlebury.py		save_disp_middlebury.py
stereo_dist_evaluator.py		stereo_dist_evaluator.py
stereo_evaluator.py		stereo_evaluator.py
stereo_resumable_dist_trainer.py		stereo_resumable_dist_trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 GREATEN-Stereo 🚀

💡 Abstract

🎬 Demos & Results

⚙️ Environment Settings

💾 Required Data

🧪 Evaluation

📚 Training

📦 Submission

📖 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 GREATEN-Stereo 🚀

💡 Abstract

🎬 Demos & Results

⚙️ Environment Settings

💾 Required Data

🧪 Evaluation

📚 Training

📦 Submission

📖 Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages