Skip to content

JarvisLee0423/GREATEN-Stereo

Repository files navigation

πŸš€ GREATEN-Stereo πŸš€

This repository contains the source code for our paper.

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching (GREATEN-Stereo) arxiv

Jiahao LI, Xinhong Chen, Zhengmin JIANG, Cheng Huang, Yung-Hui Li, Jianping Wang

main_architecture

πŸ’‘ Abstract

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

Our main contributions are:

  • We introduce the Gated Contextual-Geometric Fusion (GCGF) module that effectively fuses stereo-image and surface-normal features to mitigate cross-domain discrepancies and the ill-posed ambiguities inherent in image textures, thereby enhancing Synthetic-to-Realistic (Syn-to-Real) generalization.
  • We design the Specular-Transparent Augmentation (STA) strategy to intentionally disturb the texture consistency of training images, forcing the GCGF module to better filter ambiguous image textures and improve fusion reliability.
  • We develop sparse attention alternatives to preserve the global feature extraction capability of GREAT-Stereo for handling ambiguities in occluded and texture-related ill-posed regions, while significantly reducing computational cost, termed as Sparse Spatial Attention (SSA), Sparse Dual-Matching Attention (SDMA), and Simple Volume Attention (SVA).
  • Trained solely on synthetic data, our GREATEN-Stereo outperforms existing published stereo-matching methods in Synthetic-to-Realistic generalization across five major real-world benchmarks: ETH3D, Middlebury, KITTI-2012, KITTI-2015, and Booster.

🎬 Demos & Results

demo_vis

Demo visualization of our captured stereo pairs. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.

booster_vis

middlebury_eth3d_vis

Synthetic-to-Realistic Zero-Shot visualization of GREATEN-Stereo on the Booster, Middlebury, and ETH3D. "DA" denotes "DepthAny". All the models are trained exclusively on synthetic datasets.

sota_comparison

Synthetic-to-Realistic Zero-Shot performance of GREATEN-Stereo on the KITTI-2012, KITTI-2015, and Booster. Unless specified, all models are trained from scratch on SceneFlow. StereoAnywhere is trained from a frozen RAFT-Stereo checkpoint with extra supervisions for surface normals, and uses priors of VFM pretrained on the HyperSim dataset. * denotes training on our Syn-to-Real Mixed datasets.

efficiency

Comparison of the computational overhead of GREATEN-Stereo on the SceneFlow, KITTI-2015, and Middlebury. "DA" denotes "DepthAny". Results in Table (a) are evaluated on NVIDIA RTX 4090. Results in Table (b) are inferred on NVIDIA A800-80GB using Full-Resolution Middlebury and max disparity set to 768.

βš™οΈ Environment Settings

  • NVIDIA RTX 4090
  • python 3.8
conda create -n greaten python=3.8
conda activate greaten

pip install torch torchvision torchaudio xformers==0.0.22.post3+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm==4.67.1
pip install scipy==1.10.1
pip install opencv-python==4.11.0.86
pip install scikit-image==0.21.0
pip install tensorboard==2.12.0
pip install matplotlib==3.7.5
pip install timm==0.5.4
pip install numpy==1.24.1
pip install einops==0.8.1
pip install open3d==0.19.0
pip install kornia==0.7.3
pip install setuptools==69.5.1

cd utils/stereo_matching/cuda_utils/deformable_aggregation && pip install -e .

πŸ’Ύ Required Data

πŸ§ͺ Evaluation

  1. Download the Checkpoints:
Model Link
DepthAnything V2 Download πŸ˜†
GREATEN-IGEV-SceneFlow-192 Download πŸ˜†
GREATEN-Selective-SceneFlow-192 Download πŸ˜†
GREATEN-DepthAny-IGEV-SceneFlow-192 Download πŸ˜†
GREATEN-IGEV-Mixed-192 Download πŸ˜†
GREATEN-Selective-Mixed-192 Download πŸ˜†
GREATEN-DepthAny-IGEV-Mixed-192 Download πŸ˜†
GREATEN-IGEV-RVC-192 Download πŸ˜†
GREATEN-DepthAny-IGEV-RVC-192 Download πŸ˜†
  1. Change the following parameters in the script located at launchers/stereo_matching/test_launcher/.

    • dataset
      • Choices => [sceneflow, kitti, booster, eth3d, middlebury_(Q | H | F)]
    • dataset_root
      • your/path/to/corresponding/dataset
    • restore_ckpt
      • your/path/to/checkpoint
    • max_disp (Optional)
      • 768 for Middlebury and 192 for others
  2. Run the evaluation (e.g. Evaluation of GREATEN-IGEV on Scene Flow test set).

./launchers/stereo_matching/test_launcher/greaten_igev_evaluator.sh
  1. (Optional) You can also change the eval_mode in the evaluation script to get different evaluation results.
    • metric to generate evaluation quantity results (Default).
    • pcgen to generate the points cloud of predicted disparity for visualization.
    • cvvis to generate the visualization of the cost volume.

πŸ“š Training

  1. Change the following parameters in the script located at launchers/stereo_matching/train_launcher/.

    • logdir
      • your/path/to/save/training/information
    • train_datasets
      • Choices => [sceneflow, vkitti2, kitti, syn_to_real_train, rvc_mix_data_train, eth3d_train, eth3d_finetune, middlebury_train, middlebury_finetune]
    • train_datasets_root
      • your/path/to/corresponding/dataset
    • restore_ckpt (Optional)
      • your/path/to/checkpoint/for/finetuning
  2. Run the training (e.g. Training of GREATEN-IGEV on Scene Flow test set).

./launchers/stereo_matching/train_launcher/greaten_igev_trainer.sh

πŸ“¦ Submission

For submission to the KITTI benchmark (e.g. GREATEN-IGEV).

python3 save_disp_kitti.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the ETH3D benchmark (e.g. GREATEN-IGEV).

python3 save_disp_eth3d.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

For submission to the Middlebury benchmark (e.g. GREATEN-IGEV).

python3 save_disp_middlebury.py --name greaten-igev-stereo --restore_ckpt your/path/to/checkpoint --left_imgs your/path/to/left/imgs --right_imgs your/path/to/right/imgs --output_directory your/path/to/save/submission/results

πŸ“– Citation

If you find our works useful in your research, please consider citing our papers.

GREAT-Stereo:
@inproceedings{li2025global,
  title={Global regulation and excitation via attention tuning for stereo matching},
  author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Zhou, Qian and Li, Yung-Hui and Wang, Jianping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={25539--25549},
  year={2025}
}

GREATEN-Stereo:
@article{li2026geometry,
  title={Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching},
  author={Li, Jiahao and Chen, Xinhong and Jiang, Zhengmin and Huang, Cheng and Li, Yung-Hui and Wang, Jianping},
  journal={arXiv preprint arXiv:2604.09142},
  year={2026}
}

Acknowledgements

This project is based on IGEV-Stereo, Selective-Stereo, and Monster. Meanwhile, the core attention modules of this project are modified from GREAT-Stereo based on Deformable Attention implementations from GaussianFormer. We thank the original authors for their excellent work.

About

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching (GREATEN-Stereo)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors