Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

This is the official implementation of our paper accepted to CVPR 2026

Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield

Abstract

Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10× faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods.

[NOTE] This model is designed for real-time applications. For offline computation for the best accuracy, please checkout our earlier work FoundationStereo.

Environment setup

Option 1: Docker

docker build --network host -t ffs -f docker/dockerfile .
bash docker/run_container.sh

Option 2: pip

conda create -n ffs python=3.12 && conda activate ffs
pip install torch==2.6.0 torchvision==0.21.0 xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Weights and Trade-off

download from here and put under the folder weights/ (e.g. ./weights/23-36-37). Below table compares the differences among some representative models of varying sizes from our trained family. They are sorted from slowest to fastest, with accuracy descending, where runtime is profiled on GPU 3090, image size 640x480.

To trade-off speed and accuracy, there are two options:

Try with different checkpoints.
Tune the config flags (see explanations in the "Run demo" section below).

Checkpoint	valid_iters	Runtime-Pytorch (ms)	Runtime-TRT (ms)	Peak Memory (MB)
`23-36-37`	8	49.4	23.4	653
`23-36-37`	4	41.1	18.4	653
`20-26-39`	8	43.6	19.4	651
`20-26-39`	4	37.5	16.4	651
`20-30-48`	8	38.4	16.6	646
`20-30-48`	4	29.3	14.0	646

Run demo

python scripts/run_demo.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1  --scale 1 --get_pc 1 --valid_iters 8 --max_disp 192 --zfar 100

Flag	Meaning
`--model_dir`	Path to the trained weights/model file
`--left_file`	Path to the left image file
`--right_file`	Path to the right image file
`--intrinsic_file`	Path to the camera intrinsic matrix and baseline file
`--out_dir`	Output directory for saving results
`--remove_invisible`	Whether to ignore non-overlapping region's depth (0: no, 1: yes)
`--denoise_cloud`	Whether to apply denoising to the point cloud (0: no, 1: yes)
`--scale`	Image scaling factor
`--get_pc`	Obtain point cloud output (0: no, 1: yes)
`--valid_iters`	Number of refinement updates during forward pass
`--max_disp`	Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory.
`--zfar`	Maximum depth to include in point cloud

Refer to scripts/run_demo.py for comprehensive list of flags.

Tips:

The input left and right images should be rectified and undistorted, which means there should not be fisheye kind of lens distortion and the epipolar lines are horizontal between the left/right images. If you obtain images from stereo cameras such as Zed, they usually have handled this for you.
Do not swap left and right image. The left image should really be obtained from the left-side camera (objects will appear righter in the image).
We recommend to use PNG files with no lossy compression
Our method works best on stereo RGB images. However, we have also tested it on monochrome or IR stereo images (e.g. from RealSense D4XX series) and it works well too.
To get point cloud for your own data, you need to specify the intrinsics. In the intrinsic file in args, 1st line is the flattened 1x9 intrinsic matrix, 2nd line is the baseline (distance) between the left and right camera, unit in meters.
The model performs better for image width size <1000. You can run with smaller scale, e.g. --scale 0.5 to downsize input image, then upsize the output depth to your need with nearest neighbor interpolation.
For faster inference, you can reduce the input image resolution by e.g. --scale 0.5, and reduce refine iterations by e.g. --valid_iters 4.
Note that the 1st time running is slower due to compilation, use a while loop after warm up for live running.

Expect to see results like below:

Disparity/Depth:
Point cloud:

ONNX/TRT

For TRT, we recommend first setup env in docker.

python scripts/make_onnx.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --save_path output/ --height 448 --width 640 --valid_iters 8 --max_disp 192

Flag	Meaning
`--model_dir`	Path to the trained weights/model file
`--save_path`	Directory to save ONNX outputs and zip file
`--height`	Input image height, better to be divisible by 32. Reduce image size can increase speed.
`--width`	Input image width, better to be divisible by 32. Reduce image size can increase speed.
`--valid_iters`	Number of updates during forward pass, reduce it for faster speed, but may drop quality
`--max_disp`	Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory.

Refer to scripts/make_onnx.py for a comprehensive list of available flags. Since some intermediate operation is not supported by TRT conversion. We split around it into 2 onnx files.

Then convert from ONNX to TRT as below.

trtexec --onnx=output/feature_runner.onnx --saveEngine=output/feature_runner.engine --fp16  --useCudaGraph
trtexec --onnx=output/post_runner.onnx --saveEngine=output/post_runner.engine --fp16  --useCudaGraph

To use TRT for inference:

python scripts/run_demo_tensorrt.py --onnx_dir output/ --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1  --get_pc 1 --zfar 100

Internet-Scale Pseudo-Labeling

Real-world data offers greater diversity and realism than synthetic data. However, obtaining real stereo images with ground-truth metric depth annotation is notoriously difficult. To address this challenge, we propose an automatic data curation pipeline to generate pseudo-labels on internet-scale stereo images from Stereo4D dataset. Top: Pseudo-labeling pipeline on in-the-wild internet stereo data. Bottom: Visualization of our generated pseudo-labels.

Below are visualizations of the intermediate results in our pseudo-labeling process. In the rightmost column, green checkmark or red cross denotes whether samples are kept for training or not, based on the percentage of positive pixels in the consistency mask. Our data curation process can automatically discover failures on noisy internet data such as images containing subtitle (bottom), mosaic (2nd last row) and overly challenging samples that are unsuitable for training (top). The final pseudo-labels can also correct erroneous predictions from FoundationStereo on sky regions (5th row).

The dataset is available at HuggingFace: https://huggingface.co/datasets/nvidia/ffs_stereo4d

Citation

@article{wen2026fastfoundationstereo,
  title={{Fast-FoundationStereo}: Real-Time Zero-Shot Stereo Matching},
  author={Bowen Wen and Shaurya Dewan and Stan Birchfield},
  journal={CVPR},
  year={2026}
}

Contact

Please contact Bowen Wen (bowenw@nvidia.com) for questions and commercial inquiries.

Acknowledgement

We would like to thank Xutong Ren, Karsten Patzwaldt, Yonggan Fu, Saurav Muralidharan, Han Cai, Pavlo Molchanov, Yu Wang, Varun Praveen, Joseph Aribido and Jun Gao for their insightful early discussions for this project. We would also like to thank NVIDIA Isaac and TAO teams for their engineering support and valuable discussions. Thanks to the authors of FoundationStereo, Selective-IGEV, Stereo4D and RAFT-Stereo for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
core		core
demo_data		demo_data
docker		docker
output_docker		output_docker
scripts		scripts
weights		weights
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Utils.py		Utils.py
model_card.md		model_card.md
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Abstract

Environment setup

Weights and Trade-off

Run demo

ONNX/TRT

Internet-Scale Pseudo-Labeling

Citation

Contact

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Abstract

Environment setup

Weights and Trade-off

Run demo

ONNX/TRT

Internet-Scale Pseudo-Labeling

Citation

Contact

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages