Abstract
Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence’s consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving SoTA results on commonly used benchmarks.
Access this chapter
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
Notes
- 1.
\(\textrm{TEPE}(\textbf{d}, \textbf{d}_{\textrm{gt}})=\sqrt{\sum _{t=1}^{T-1}((\textbf{d}^{t} - \textbf{d}^{t+1}) - (\textbf{d}_{\textrm{gt}}^{t} - \textbf{d}_{\textrm{gt}}^{t+1}))^{2}} \).
References
Azuma, R.T.: A survey of augmented reality. Presence Teleoper. Virtual Environ. 6(4), 355–385 (1997)
Bao, W., Wang, W., Xu, Y., Guo, Y., Hong, S., Zhang, X.: Instereo2k: a large real dataset for stereo matching in indoor scenes. SCIENCE CHINA Inf. Sci. 63(11), 1–11 (2020)
Birchfield, S., Tomasi, C.: Depth discontinuities by pixel-to-pixel stereo. IJCV 35(3), 269–293 (1999)
Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching with slanted support windows. In: BMVC, vol. 11, pp. 1–11 (2011)
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE TPAMI 23(11), 1222–1239 (2001)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV, pp. 611–625 (2012)
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR, pp. 5410–5418 (2018)
Chang, T., Yang, X., Zhang, T., Wang, M.: Domain generalized stereo matching via hierarchical visual transformation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9559–9568 (2023)
Cheng, Z., Yang, J., Li, H.: Stereo matching in time: 100+ fps video stereo matching for extended reality. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8719–8728 (2024)
Deschaud, J.E.: Kitti-carla: a kitti-like dataset generated by carla simulator. arXiv preprint arXiv:2109.00892 (2021)
DeSouza, G.N., Kak, A.C.: Vision for mobile robot navigation: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 237–267 (2002)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: dense 3D reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968. IEEE (2011)
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: CVPR, pp. 3273–3282 (2019)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Hirschmüller, H., Innocent, P.R., Garibaldi, J.: Real-time correlation-based stereo vision with reduced border errors. IJCV 47(1), 229–246 (2002)
Jing, J., et al.: Uncertainty guided adaptive warping for robust and efficient stereo matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3318–3327 (2023)
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Dynamicstereo: consistent dynamic depth from stereo videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13229–13239 (2023)
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: CVPR, pp. 66–75 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: ICPR, vol. 3, pp. 15–18 (2006)
Li, J., et al.: Practical stereo matching via cascaded recurrent network with adaptive correlation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16263–16272 (2022)
Li, Z., et al.: Temporally consistent online depth estimation in dynamic scenes. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3018–3027 (2023)
Lipson, L., Teed, Z., Deng, J.: Raft-stereo: multilevel recurrent field transforms for stereo matching. arXiv preprint arXiv:2109.07547 (2021)
Liu, H., et al.: Video super-resolution based on deep learning: a comprehensive survey. Artif. Intell. Rev. 55(8), 5981–6035 (2022)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR, pp. 4040–4048 (2016)
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR, pp. 3061–3070 (2015)
Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: a two-stage convolutional neural network for stereo matching. In: CVPRW, pp. 887–895 (2017)
Pang, J., et al.: Zoom and learn: generalizing deep stereo matching to novel domains. In: CVPR, pp. 2070–2079 (2018)
Rao, Z., et al.: Masked representation learning for domain generalized stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5435–5444 (2023)
Scharstein, D., et al.: High-resolution stereo datasets with subpixel-accurate ground truth. In: German Conference on Pattern Recognition, pp. 31–42 (2014)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1), 7–42 (2002)
Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR, pp. 3260–3269 (2017)
Shah, S., Dey, D., Lovett, C., Kapoor, A.: Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In: Field and Service Robotics (2017). https://arxiv.org/abs/1705.05065
Shen, Z., Dai, Y., Rao, Z.: Cfnet: cascade and fused cost volume for robust stereo matching. In: CVPR, pp. 13906–13915 (2021)
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, pp. 369–386. SPIE (2019)
Song, X., Yang, G., Zhu, X., Zhou, H., Wang, Z., Shi, J.: Adastereo: a simple and efficient approach for adaptive stereo matching. In: CVPR, pp. 10328–10337 (2021)
Sun, J., Zheng, N.N., Shum, H.Y.: Stereo matching using belief propagation. IEEE TPAMI 25(7), 787–800 (2003)
Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S., Bouaziz, S.: Hitnet: hierarchical iterative tile refinement network for real-time stereo matching. In: CVPR, pp. 14362–14372 (2021)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV, pp. 402–419 (2020)
Teed, Z., Deng, J.: Raft-3D: scene flow using rigid-motion embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375–8384 (2021)
Tremblay, J., To, T., Birchfield, S.: Falling things: a synthetic dataset for 3D object detection and pose estimation. In: CVPRW, pp. 2038–2041 (2018)
Van Meerbergen, G., Vergauwen, M., Pollefeys, M., Van Gool, L.: A hierarchical symmetric stereo algorithm using dynamic programming. IJCV 47(1), 275–285 (2002)
Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919–21928 (2023)
Xu, G., Wang, Y., Cheng, J., Tang, J., Yang, X.: Accurate and efficient stereo matching via attention concatenation volume. arXiv preprint arXiv:2209.12699 (2022)
Xu, H., Zhang, J.: Aanet: adaptive aggregation network for efficient stereo matching. In: CVPR, pp. 1959–1968 (2020)
Yang, G., Manela, J., Happold, M., Ramanan, D.: Hierarchical deep stereo matching on high-resolution images. In: CVPR, pp. 5515–5524 (2019)
Yang, Q., Wang, L., Yang, R., Stewénius, H., Nistér, D.: Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE TPAMI 31(3), 492–504 (2008)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: CVPR, pp. 1592–1599 (2015)
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: Ga-net: guided aggregation net for end-to-end stereo matching. In: CVPR, pp. 185–194 (2019)
Zhang, Y., Poggi, M., Mattoccia, S.: Temporalstereo: efficient spatial-temporal stereo matching network. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9528–9535. IEEE (2023)
Zhong, Y., Li, H., Dai, Y.: Open-world stereo video matching with deep RNN. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–116 (2018)
Acknowledgments
This work was funded by the Imperial College-China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jing, J., Mao, Y., Mikolajczyk, K. (2025). Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-73027-6_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73026-9
Online ISBN: 978-3-031-73027-6
eBook Packages: Computer ScienceComputer Science (R0)

