{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,18]],"date-time":"2026-04-18T13:32:10Z","timestamp":1776519130745,"version":"3.51.2"},"reference-count":56,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2023,4,16]],"date-time":"2023-04-16T00:00:00Z","timestamp":1681603200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Self-driving vehicles must be controlled by navigation algorithms that ensure safe driving for passengers, pedestrians and other vehicle drivers. One of the key factors to achieve this goal is the availability of effective multi-object detection and tracking algorithms, which allow to estimate position, orientation and speed of pedestrians and other vehicles on the road. The experimental analyses conducted so far have not thoroughly evaluated the effectiveness of these methods in road driving scenarios. To this aim, we propose in this paper a benchmark of modern multi-object detection and tracking methods applied to image sequences acquired by a camera installed on board the vehicle, namely, on the videos available in the BDD100K dataset. The proposed experimental framework allows to evaluate 22 different combinations of multi-object detection and tracking methods using metrics that highlight the positive contribution and limitations of each module of the considered algorithms. The analysis of the experimental results points out that the best method currently available is the combination of ConvNext and QDTrack, but also that the multi-object tracking methods applied on road images must be substantially improved. Thanks to our analysis, we conclude that the evaluation metrics should be extended by considering specific aspects of the autonomous driving scenarios, such as multi-class problem formulation and distance from the targets, and that the effectiveness of the methods must be evaluated by simulating the impact of the errors on driving safety.<\/jats:p>","DOI":"10.3390\/s23084024","type":"journal-article","created":{"date-parts":[[2023,4,17]],"date-time":"2023-04-17T02:26:02Z","timestamp":1681698362000},"page":"4024","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Benchmarking 2D Multi-Object Detection and Tracking Algorithms in Autonomous Vehicle Driving Scenarios"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4271-8160","authenticated-orcid":false,"given":"Diego","family":"Gragnaniello","sequence":"first","affiliation":[{"name":"Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5495-2432","authenticated-orcid":false,"given":"Antonio","family":"Greco","sequence":"additional","affiliation":[{"name":"Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4687-7994","authenticated-orcid":false,"given":"Alessia","family":"Saggese","sequence":"additional","affiliation":[{"name":"Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2948-741X","authenticated-orcid":false,"given":"Mario","family":"Vento","sequence":"additional","affiliation":[{"name":"Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-0371-738X","authenticated-orcid":false,"given":"Antonio","family":"Vicinanza","sequence":"additional","affiliation":[{"name":"Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Ahangar, M.N., Ahmed, Q.Z., Khan, F.A., and Hafeez, M. (2021). A survey of autonomous vehicles: Enabling communication technologies and challenges. Sensors, 21.","DOI":"10.3390\/s21030706"},{"key":"ref_2","first-page":"100551","article-title":"Autonomous Vehicles in 5G and beyond: A Survey","volume":"39","author":"Hakak","year":"2022","journal-title":"Veh. Commun."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"14643","DOI":"10.1109\/ACCESS.2022.3145972","article-title":"On the integration of enabling wireless technologies and sensor fusion for next-generation connected and autonomous vehicles","volume":"10","author":"Butt","year":"2022","journal-title":"IEEE Access"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"58443","DOI":"10.1109\/ACCESS.2020.2983149","article-title":"A survey of autonomous driving: Common practices and emerging technologies","volume":"8","author":"Yurtsever","year":"2020","journal-title":"IEEE Access"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1364","DOI":"10.1109\/TNNLS.2020.3043505","article-title":"A survey of end-to-end driving: Architectures and training methods","volume":"33","author":"Tampuu","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Prakash, A., Chitta, K., and Geiger, A. (2021, January 20\u201325). Multi-modal fusion transformer for end-to-end autonomous driving. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00700"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Greco, A., Rundo, L., Saggese, A., Vento, M., and Vicinanza, A. (2022, January 23). Imitation Learning for Autonomous Vehicle Driving: How Does the Representation Matter?. Proceedings of the International Conference on Image Analysis and Processing (ICIAP), Lecce, Italy.","DOI":"10.1007\/978-3-031-06427-2_2"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Tampuu, A., Aidla, R., van Gent, J.A., and Matiisen, T. (2023). Lidar-as-camera for end-to-end driving. Sensors, 23.","DOI":"10.3390\/s23052845"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Alaba, S.Y., and Ball, J.E. (2022). A survey on deep-learning-based lidar 3d object detection for autonomous driving. Sensors, 22.","DOI":"10.36227\/techrxiv.20442858"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"5668","DOI":"10.1109\/JSEN.2020.3041615","article-title":"Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review","volume":"21","author":"Ravindran","year":"2020","journal-title":"IEEE Sensors J."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"8077","DOI":"10.1109\/TITS.2021.3075749","article-title":"Vehicles Detection for Smart Roads Applications on Board of Smart Cameras: A Comparative Analysis","volume":"23","author":"Greco","year":"2021","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Li, J., Ding, Y., Wei, H.L., Zhang, Y., and Lin, W. (2022). SimpleTrack: Rethinking and Improving the JDE Approach for Multi-Object Tracking. Sensors, 22.","DOI":"10.3390\/s22155863"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lu, Z., Rathod, V., Votel, R., and Huang, J. (2020, January 13\u201319). Retinatrack: Online single stage joint detection and tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01468"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"103514","DOI":"10.1016\/j.dsp.2022.103514","article-title":"A survey of modern deep learning based object detection models","volume":"126","author":"Zaidi","year":"2022","journal-title":"Digit. Signal Process."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"388","DOI":"10.1108\/AA-12-2021-0174","article-title":"A human activity-aware shared control solution for medical human\u2013robot interaction","volume":"42","author":"Su","year":"2022","journal-title":"Assem. Autom."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"6039","DOI":"10.1109\/LRA.2021.3089999","article-title":"Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network","volume":"6","author":"Qi","year":"2021","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"43905","DOI":"10.1109\/ACCESS.2018.2864672","article-title":"Multi-object tracking by flying cameras based on a forward-backward interaction","volume":"6","author":"Carletti","year":"2018","journal-title":"IEEE Access"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., and Yu, F. (2021, January 20\u201325). Quasi-dense similarity learning for multiple object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00023"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Carletti, V., Foggia, P., Greco, A., Saggese, A., and Vento, M. (2015, January 25\u201328). Automatic detection of long term parked cars. Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Karlsruhe, Germany.","DOI":"10.1109\/AVSS.2015.7301722"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., and Caine, B. (2020, January 13\u201319). Scalability in perception for autonomous driving: Waymo open dataset. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00252"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., and Darrell, T. (2020, January 13\u201319). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00271"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu, W., and Wang, X. (2021). Bytetrack: Multi-object tracking by associating every detection box. arXiv.","DOI":"10.1007\/978-3-031-20047-2_1"},{"key":"ref_23","unstructured":"Li, S., Danelljan, M., Ding, H., Huang, T.E., and Yu, F. (2022). European Conference on Computer Vision (ECCV), Springer."},{"key":"ref_24","unstructured":"Yan, B., Jiang, Y., Sun, P., Wang, D., Yuan, Z., Luo, P., and Lu, H. (2022). European Conference on Computer Vision (ECCV), Springer."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 13\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_27","unstructured":"Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., Fang, J. (2023, January 01). ultralytics\/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. Available online: https:\/\/zenodo.org\/record\/7347926#.ZDZQX3ZBw2w."},{"key":"ref_28","unstructured":"Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). YOLOX: Exceeding YOLO Series in 2021. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"3349","DOI":"10.1109\/TPAMI.2020.2983686","article-title":"Deep High-Resolution Representation Learning for Visual Recognition","volume":"43","author":"Wang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022, January 18\u201324). A ConvNet for the 2020s. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Wojke, N., Bewley, A., and Paulus, D. (2017, January 17\u201320). Simple online and realtime tracking with a deep association metric. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"ref_33","first-page":"726","article-title":"Do Different Tracking Tasks Require Different Appearance Models?","volume":"34","author":"Wang","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"3069","DOI":"10.1007\/s11263-021-01513-4","article-title":"Fairmot: On the fairness of detection and re-identification in multiple object tracking","volume":"129","author":"Zhang","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2008\/246309","article-title":"Evaluating multiple object tracking performance: The clear mot metrics","volume":"2008","author":"Bernardin","year":"2008","journal-title":"EURASIP J. Image Video Process."},{"key":"ref_36","unstructured":"Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi, C. (2016). European Conference on Computer Vision, Springer."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"548","DOI":"10.1007\/s11263-020-01375-2","article-title":"Hota: A higher order metric for evaluating multi-object tracking","volume":"129","author":"Luiten","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1016\/j.neucom.2019.11.023","article-title":"Deep learning in video multi-object tracking: A survey","volume":"381","author":"Ciaparrone","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Guo, S., Wang, S., Yang, Z., Wang, L., Zhang, H., Guo, P., Gao, Y., and Guo, J. (2022). A Review of Deep Learning-Based Visual Multi-Object Tracking Algorithms for Autonomous Driving. Appl. Sci., 12.","DOI":"10.3390\/app122110741"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"6400","DOI":"10.1007\/s10489-021-02293-7","article-title":"Deep learning in multi-object detection and tracking: State of the art","volume":"51","author":"Pal","year":"2021","journal-title":"Appl. Intell."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"116300","DOI":"10.1016\/j.eswa.2021.116300","article-title":"Data association in multiple object tracking: A survey of recent techniques","volume":"192","author":"Rakai","year":"2022","journal-title":"Expert Syst. Appl."},{"key":"ref_42","unstructured":"Wang, Z., Zheng, L., Liu, Y., Li, Y., and Wang, S. (2020). European Conference on Computer Vision, Springer."},{"key":"ref_43","unstructured":"Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and Wei, Y. (2022). European Conference on Computer Vision (ECCV), Springer."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Chu, P., Wang, J., You, Q., Ling, H., and Liu, Z. (2023, January 2\u20137). Transmot: Spatial-temporal graph transformer for multiple object tracking. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV56688.2023.00485"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2022, January 18\u201324). Trackformer: Multi-object tracking with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"ref_46","unstructured":"Milan, A., Leal-Taix\u00e9, L., Reid, I., Roth, S., and Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Pereira, R., Carvalho, G., Garrote, L., and Nunes, U.J. (2022). Sort and deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Appl. Sci., 12.","DOI":"10.3390\/app12031319"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. (2016, January 25\u201328). Simple online and realtime tracking. Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA.","DOI":"10.1109\/ICIP.2016.7533003"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Du, Y., Zhao, Z., Song, Y., Zhao, Y., Su, F., Gong, T., and Meng, H. (2023). StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed., 1\u201314.","DOI":"10.1109\/TMM.2023.3240881"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for autonomous driving?. The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_51","unstructured":"Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., and Leal-Taix\u00e9, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv."},{"key":"ref_52","unstructured":"Bergmann, P., Meinhardt, T., and Leal-Taixe, L. (November, January 27). Tracking Without Bells and Whistles. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea."},{"key":"ref_53","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201315). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_54","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H.Y.M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_55","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_56","unstructured":"Jonathon Luiten, A.H. (2023, January 01). TrackEval. Available online: https:\/\/github.com\/JonathonLuiten\/TrackEval."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/8\/4024\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:16:43Z","timestamp":1760123803000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/8\/4024"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,16]]},"references-count":56,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2023,4]]}},"alternative-id":["s23084024"],"URL":"https:\/\/doi.org\/10.3390\/s23084024","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,16]]}}}