{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T20:40:19Z","timestamp":1770064819894,"version":"3.49.0"},"reference-count":17,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:00:00Z","timestamp":1769990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100018693","name":"European Union","doi-asserted-by":"publisher","award":["101132308"],"award-info":[{"award-number":["101132308"]}],"id":[{"id":"10.13039\/100018693","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Monocular Depth Estimation (MDE) infers per-pixel scene geometry from a single RGB image. Despite recent progress, global MDE models often blur depth discontinuities at object boundaries and fail to capture object-level structure. Segment-aware depth estimation addresses this limitation by exploiting semantic segmentation to decompose depth prediction into simpler, class-specific subproblems. In this work, we study semantic-aware MDE in a multi-branch design where each semantic class is handled by a lightweight Vision Transformer (ViT) branch that predicts dense depth for its class while suppressing interference from other regions. We further examine fusion strategies that merge the branch outputs into a single prediction: (i) a learnable cross-attention fusion module that predicts depth from the stack of per-class proposals and masks, and (ii) a parameter-free stitched summation that sums mask-gated outputs. The proposed architecture is simple, scalable, end-to-end trainable, and compatible with arbitrary transformer backbones. Experiments on Virtual KITTI 2, where ground-truth depth and semantic labels are available, show that segment-aware modeling produces sharper depth boundaries and improves standard error metrics compared to a single-branch baseline (AbsRel 0.243\u21920.152; RMSE 11.952\u21929.101). Finally, we find that the parameter-free summation matches, and in most cases improves upon, the accuracy of learned fusion while adding no computational overhead.<\/jats:p>","DOI":"10.3390\/info17020145","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T09:00:33Z","timestamp":1770022833000},"page":"145","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["On Segment-Aware Monocular Depth Estimation Using Vision Transformers"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4320-3740","authenticated-orcid":false,"given":"Vasileios","family":"Arampatzakis","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece"},{"name":"Athena Research Center, 67100 Xanthi, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9909-1584","authenticated-orcid":false,"given":"George","family":"Pavlidis","sequence":"additional","affiliation":[{"name":"Athena Research Center, 67100 Xanthi, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0898-6102","authenticated-orcid":false,"given":"Nikolaos","family":"Mitianoudis","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2730-0006","authenticated-orcid":false,"given":"Nikos","family":"Papamarkos","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,2]]},"reference":[{"key":"ref_1","unstructured":"Lee, S.H., Mo, S., and Yu, S.X. (2025, January 19). SHED Light on Segmentation for Depth Estimation. Proceedings of the Structural Priors for Vision Workshop at ICCV\u201925, Honolulu, HI, USA."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Jiao, J., Cao, Y., Song, Y., and Lau, R. (2018, January 8\u201314). Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01267-0_4"},{"key":"ref_3","unstructured":"Cabon, Y., Murray, N., and Humenberger, M. (2020). Virtual KITTI 2. arXiv."},{"key":"ref_4","unstructured":"Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (July, January 26). Virtual worlds as proxy for multi-object tracking analysis. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, L., Zhang, J., Wang, O., Lin, Z., and Lu, H. (2020, January 14\u201319). SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00062"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Gao, N., He, F., Jia, J., Shan, Y., Zhang, H., Zhao, X., and Huang, K. (2022, January 18\u201324). PanopticDepth: A Unified Framework for Depth-Aware Panoptic Segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00168"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2396","DOI":"10.1109\/TPAMI.2023.3330944","article-title":"Monocular Depth Estimation: A Thorough Review","volume":"46","author":"Arampatzakis","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018, January 18\u201322). Deep Ordinal Regression Network for Monocular Depth Estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00214"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1623","DOI":"10.1109\/TPAMI.2020.3019967","article-title":"Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer","volume":"44","author":"Ranftl","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Chen, P.Y., Liu, A.H., Liu, Y.C., and Wang, Y.C.F. (2019, January 16\u201320). Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-Aware Representation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00273"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"109297","DOI":"10.1016\/j.patcog.2022.109297","article-title":"Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance","volume":"137","author":"Li","year":"2023","journal-title":"Pattern Recognit."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ladicky, L., Shi, J., and Pollefeys, M. (2014, January 23\u201328). Pulling Things out of Perspective. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.19"},{"key":"ref_13","first-page":"2983","article-title":"ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation","volume":"37","author":"Xing","year":"2023","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kim, S.Y., Zhang, J., Niklaus, S., Fan, Y., Chen, S., Lin, Z., and Kim, M. (2022, January 18\u201324). Layered Depth Refinement with Mask Guidance. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00383"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Saeedan, F., and Roth, S. (2021, January 5\u20139). Boosting Monocular Depth with Panoptic Segmentation Maps. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual.","DOI":"10.1109\/WACV48630.2021.00390"},{"key":"ref_16","unstructured":"Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (2014). Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. Proceedings of the 28th International Conference on Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016, January 25\u201328). Deeper Depth Prediction with Fully Convolutional Residual Networks. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.32"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/2\/145\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T09:21:43Z","timestamp":1770024103000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/2\/145"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,2]]},"references-count":17,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["info17020145"],"URL":"https:\/\/doi.org\/10.3390\/info17020145","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,2]]}}}