Abstract
This study investigates the effectiveness of six prominent machine learning models—random forest, decision trees, K-nearest neighbor, logistic regression, support vector machines, and Naïve Bayes—for intrusion detection systems in industrial Internet of Things environments. The evaluation encompasses the effects of data preprocessing techniques, including feature engineering, data normalization, recoding, and missing data mitigation. Furthermore, the research delves into dataset balancing, examining the effects of six different techniques on model performance. The investigations are conducted using the domain-specific WUSTL-IIOT-2021 dataset, which captures the unique characteristics of IIoT data. The study also investigates multi-class attack identification utilizing an innovative SMOTE-based multi-class balancing approach to tackle dataset imbalances. The results indicate that data preprocessing and intelligent dataset balancing produce consistent enhancements in the classification performance of the selected models across binary and multi-classification tasks. Random forest emerges as the standout algorithm, delivering consistently high performance with computational efficiency.










Similar content being viewed by others
Data availability
This study relies on a publicly available dataset (WUSTL-IIOT-2021). This dataset is available from the following reference—Zolanvari, Μ., Gupta, L., Khan, K. Μ., & Jain, R. (2021), WUSTL-IIOT-2O2l Dataset for IIoT Cybersecurity Research, Washington University in St. Louis, USA.
Change history
08 May 2024
A Correction to this paper has been published: https://doi.org/10.1007/s00521-024-09841-5
References
Stouffer K, Pillitteri V, Lightman S, et al (2015) Guide to industrial control systems (ICS) security NIST special publication 800–82 revision 2, pp 1–157
Smadi AA, Ajao BT, Johnson BK et al (2021) A comprehensive survey on cyber-physical smart grid testbed architectures: requirements and challenges. Electronics 10:1043. https://doi.org/10.3390/electronics10091043
Bonetto R, Sychev I, Zhdanenko O, et al (2020) Smart grids for smarter cities. In: 2020 IEEE 17th annual consumer communications and networking conference (CCNC). https://doi.org/10.1109/CCNC46108.2020.9045309
Attar H (2023) Joint IoT/ML platforms for smart societies and environments: a review on multimodal information-based learning for safety and security. J Data Inf Qual. https://doi.org/10.1145/3603713
Calabretta M, Pecori R, Vecchio M, Veltri L (2018) MQTT-AUTH: a token-based solution to endow MQTT with authentication and authorization capabilities. J Commun Softw Syst 14:320–331. https://doi.org/10.24138/jcomss.v14i4.604
Calabretta M, Pecori R, Veltri L (2018) A token-based protocol for securing MQTT communications. In: Proceedings of the 26th international conference on software, telecommunications and computer networks, SoftCOM 2018, pp 373–378. https://doi.org/10.23919/SOFTCOM.2018.8555834
Nti IK, Adekoya AF, Narko-Boateng O, Somanathan AR (2022) Stacknet based decision fusion classifier for network intrusion detection. Int Arab J Inf Technol 19:478–490. https://doi.org/10.34028/iajit/19/3A/8
Abdul Rahman Al-chikh Omar A, Soudan B, Ala’ Altaweel (2023) A comprehensive survey on detection of sinkhole attack in routing over low power and Lossy network for internet of things. Internet Things (Netherlands). https://doi.org/10.1016/j.iot.2023.100750
Samara G, Aljaidi M, Alazaidah R, et al (2023) A comprehensive review of machine learning-based intrusion detection techniques for IoT networks. In: Artificial intelligence, Internet of Things, and society 5.0. pp 465–473
Manderna A, Kumar S, Dohare U et al (2023) Vehicular Network Intrusion Detection Using a Cascaded Deep Learning Approach with Multi-Variant Metaheuristic. Sensors 23:8772. https://doi.org/10.3390/s23218772
Alamleh A, Albahri OS, Zaidan AA et al (2023) Federated Learning for IoMT Applications: A Standardization and Benchmarking Framework of Intrusion Detection Systems. IEEE J Biomed Heal Informatics 27:878–887. https://doi.org/10.1109/JBHI.2022.3167256
Surakhi O, García A, Jamoos M, Alkhanafseh M (2022) The Intrusion detection system by deep learning methods: issues and challenges. Int Arab J Inf Technol 19:501–513. https://doi.org/10.34028/iajit/19/3A/10
Keliris A, Salehghaffari H, Cairl B, et al (2016) Machine learning-based defense against process-aware attacks on industrial control systems. In: Proceedings of 2016 IEEE international test conference (ITC), pp 1–10. https://doi.org/10.1109/TEST.2016.7805855
Ullah I, Mahmoud QH (2017) A hybrid model for anomaly-based intrusion detection in SCADA networks. In: Proceedings of 2017 IEEE international conference on big data (big data), pp 2160–2167. https://doi.org/10.1109/BigData.2017.8258164
Vulfin AM, Vasilyev VI, Kuharev SN et al (2021) Algorithms for detecting network attacks in an enterprise industrial network based on data mining algorithms. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/2001/1/012004
Beaver JM, Borges-Hink RC, Buckner MA (2013) An evaluation of machine learning methods to detect malicious SCADA communications. In: Proceedings of 2013 12th international conference on machine learning and applications ICMLA, vol 2, pp 54–59. https://doi.org/10.1109/ICMLA.2013.105
Zhang Y, Ilić MD, Tonguz OK (2011) Mitigating blackouts via smart relays: a machine learning approach. Proc IEEE 99:94–118. https://doi.org/10.1109/JPROC.2010.2072970
Maglaras LA, Jiang J (2014) Intrusion detection in SCADA systems using machine learning techniques. In: Proceedings of 2014 science and information conference, pp 626–631. https://doi.org/10.1109/SAI.2014.6918252
Song Y, Luo W, Li J, et al (2021) SDN-based Industrial Internet Security Gateway. In: 2021 International conference on security, pattern analysis, and cybernetics (SPAC), pp 238–243. https://doi.org/10.1109/SPAC53836.2021.9539961
Zolanvari M, Teixeira MA, Gupta L et al (2019) Machine learning-based network vulnerability analysis of industrial Internet of Things. IEEE Internet Things J 6:6822–6834. https://doi.org/10.1109/JIOT.2019.2912022
Teixeira MA, Gupta L, Khan KM, Machine RJ (2021) WUSTL-IIOT-2021 dataset for IIoT cybersecurity research. Washington University, St. Louis
Siebert J, Joeckel L, Heidrich J et al (2022) Construction of a quality model for machine learning systems. Softw Qual J 30:307–335. https://doi.org/10.1007/s11219-021-09557-y
Sarker IH (2021) Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput Sci. https://doi.org/10.1007/s42979-021-00815-1
Eid AM, Nassif AB, Soudan B, Injadat MN (2023) IIoT network intrusion detection using machine learning. In: 2023 6th International conference on intelligent robotics and control engineering (IRCE). IEEE, pp 196–201
Ting KM (1998) Inducing cost-sensitive trees via instance weighting. Lect Notes Comput Sci (Subser Lect Notes Artif Intell Lect Notes Bioinf) 1510:139–147. https://doi.org/10.1007/bfb0094814
Zhang YP, Zhang LN, Wang YC (2010) Cluster-based majority under-sampling approaches for class imbalance learning. In: Proceedings of 2010 2nd IEEE international conference on information and financial engineering, pp 400–404. https://doi.org/10.1109/ICIFE.2010.5609385
Richman R, Wuthrich MV (2020) Nagging predictors. SSRN Electron J. https://doi.org/10.2139/ssrn.3627163
Mesevage TG (2021) Data cleaning steps and process to prep your data for success. MonkeyLearn, Montevideo
Tableau (2022) Data cleaning: definition, benefits, and how-to. Tableau, Mountain View
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. https://doi.org/10.1186/s12864-019-6413-7
Chicco D, Jurman G (2023) The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. https://doi.org/10.1186/s13040-023-00322-4
Khafajeh H (2020) An efficient intrusion detection approach using light gradient boosting. J Theor Appl Inf Technol 98:825–835
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
The authors would like to convey their thanks and appreciation to the “University of Sharjah” for supporting this work.
Informed consent
This study does not involve any experiments on animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised to correct the third Author name
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Eid, A.M., Soudan, B., Nassif, A.B. et al. Comparative study of ML models for IIoT intrusion detection: impact of data preprocessing and balancing. Neural Comput & Applic 36, 6955–6972 (2024). https://doi.org/10.1007/s00521-024-09439-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s00521-024-09439-x