Skip to main content
Log in

Comparative study of ML models for IIoT intrusion detection: impact of data preprocessing and balancing

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

A Correction to this article was published on 08 May 2024

This article has been updated

Abstract

This study investigates the effectiveness of six prominent machine learning models—random forest, decision trees, K-nearest neighbor, logistic regression, support vector machines, and Naïve Bayes—for intrusion detection systems in industrial Internet of Things environments. The evaluation encompasses the effects of data preprocessing techniques, including feature engineering, data normalization, recoding, and missing data mitigation. Furthermore, the research delves into dataset balancing, examining the effects of six different techniques on model performance. The investigations are conducted using the domain-specific WUSTL-IIOT-2021 dataset, which captures the unique characteristics of IIoT data. The study also investigates multi-class attack identification utilizing an innovative SMOTE-based multi-class balancing approach to tackle dataset imbalances. The results indicate that data preprocessing and intelligent dataset balancing produce consistent enhancements in the classification performance of the selected models across binary and multi-classification tasks. Random forest emerges as the standout algorithm, delivering consistently high performance with computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.
Fig. 9
The alternative text for this image may have been generated using AI.
Fig. 10
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

This study relies on a publicly available dataset (WUSTL-IIOT-2021). This dataset is available from the following reference—Zolanvari, Μ., Gupta, L., Khan, K. Μ., & Jain, R. (2021), WUSTL-IIOT-2O2l Dataset for IIoT Cybersecurity Research, Washington University in St. Louis, USA.

Change history

References

  1. Stouffer K, Pillitteri V, Lightman S, et al (2015) Guide to industrial control systems (ICS) security NIST special publication 800–82 revision 2, pp 1–157

  2. Smadi AA, Ajao BT, Johnson BK et al (2021) A comprehensive survey on cyber-physical smart grid testbed architectures: requirements and challenges. Electronics 10:1043. https://doi.org/10.3390/electronics10091043

    Article  Google Scholar 

  3. Bonetto R, Sychev I, Zhdanenko O, et al (2020) Smart grids for smarter cities. In: 2020 IEEE 17th annual consumer communications and networking conference (CCNC). https://doi.org/10.1109/CCNC46108.2020.9045309

  4. Attar H (2023) Joint IoT/ML platforms for smart societies and environments: a review on multimodal information-based learning for safety and security. J Data Inf Qual. https://doi.org/10.1145/3603713

    Article  Google Scholar 

  5. Calabretta M, Pecori R, Vecchio M, Veltri L (2018) MQTT-AUTH: a token-based solution to endow MQTT with authentication and authorization capabilities. J Commun Softw Syst 14:320–331. https://doi.org/10.24138/jcomss.v14i4.604

    Article  Google Scholar 

  6. Calabretta M, Pecori R, Veltri L (2018) A token-based protocol for securing MQTT communications. In: Proceedings of the 26th international conference on software, telecommunications and computer networks, SoftCOM 2018, pp 373–378. https://doi.org/10.23919/SOFTCOM.2018.8555834

  7. Nti IK, Adekoya AF, Narko-Boateng O, Somanathan AR (2022) Stacknet based decision fusion classifier for network intrusion detection. Int Arab J Inf Technol 19:478–490. https://doi.org/10.34028/iajit/19/3A/8

    Article  Google Scholar 

  8. Abdul Rahman Al-chikh Omar A, Soudan B, Ala’ Altaweel (2023) A comprehensive survey on detection of sinkhole attack in routing over low power and Lossy network for internet of things. Internet Things (Netherlands). https://doi.org/10.1016/j.iot.2023.100750

    Article  Google Scholar 

  9. Samara G, Aljaidi M, Alazaidah R, et al (2023) A comprehensive review of machine learning-based intrusion detection techniques for IoT networks. In: Artificial intelligence, Internet of Things, and society 5.0. pp 465–473

  10. Manderna A, Kumar S, Dohare U et al (2023) Vehicular Network Intrusion Detection Using a Cascaded Deep Learning Approach with Multi-Variant Metaheuristic. Sensors 23:8772. https://doi.org/10.3390/s23218772

    Article  Google Scholar 

  11. Alamleh A, Albahri OS, Zaidan AA et al (2023) Federated Learning for IoMT Applications: A Standardization and Benchmarking Framework of Intrusion Detection Systems. IEEE J Biomed Heal Informatics 27:878–887. https://doi.org/10.1109/JBHI.2022.3167256

    Article  Google Scholar 

  12. Surakhi O, García A, Jamoos M, Alkhanafseh M (2022) The Intrusion detection system by deep learning methods: issues and challenges. Int Arab J Inf Technol 19:501–513. https://doi.org/10.34028/iajit/19/3A/10

    Article  Google Scholar 

  13. Keliris A, Salehghaffari H, Cairl B, et al (2016) Machine learning-based defense against process-aware attacks on industrial control systems. In: Proceedings of 2016 IEEE international test conference (ITC), pp 1–10. https://doi.org/10.1109/TEST.2016.7805855

  14. Ullah I, Mahmoud QH (2017) A hybrid model for anomaly-based intrusion detection in SCADA networks. In: Proceedings of 2017 IEEE international conference on big data (big data), pp 2160–2167. https://doi.org/10.1109/BigData.2017.8258164

  15. Vulfin AM, Vasilyev VI, Kuharev SN et al (2021) Algorithms for detecting network attacks in an enterprise industrial network based on data mining algorithms. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/2001/1/012004

    Article  Google Scholar 

  16. Beaver JM, Borges-Hink RC, Buckner MA (2013) An evaluation of machine learning methods to detect malicious SCADA communications. In: Proceedings of 2013 12th international conference on machine learning and applications ICMLA, vol 2, pp 54–59. https://doi.org/10.1109/ICMLA.2013.105

  17. Zhang Y, Ilić MD, Tonguz OK (2011) Mitigating blackouts via smart relays: a machine learning approach. Proc IEEE 99:94–118. https://doi.org/10.1109/JPROC.2010.2072970

    Article  Google Scholar 

  18. Maglaras LA, Jiang J (2014) Intrusion detection in SCADA systems using machine learning techniques. In: Proceedings of 2014 science and information conference, pp 626–631. https://doi.org/10.1109/SAI.2014.6918252

  19. Song Y, Luo W, Li J, et al (2021) SDN-based Industrial Internet Security Gateway. In: 2021 International conference on security, pattern analysis, and cybernetics (SPAC), pp 238–243. https://doi.org/10.1109/SPAC53836.2021.9539961

  20. Zolanvari M, Teixeira MA, Gupta L et al (2019) Machine learning-based network vulnerability analysis of industrial Internet of Things. IEEE Internet Things J 6:6822–6834. https://doi.org/10.1109/JIOT.2019.2912022

    Article  Google Scholar 

  21. Teixeira MA, Gupta L, Khan KM, Machine RJ (2021) WUSTL-IIOT-2021 dataset for IIoT cybersecurity research. Washington University, St. Louis

    Google Scholar 

  22. Siebert J, Joeckel L, Heidrich J et al (2022) Construction of a quality model for machine learning systems. Softw Qual J 30:307–335. https://doi.org/10.1007/s11219-021-09557-y

    Article  Google Scholar 

  23. Sarker IH (2021) Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput Sci. https://doi.org/10.1007/s42979-021-00815-1

    Article  Google Scholar 

  24. Eid AM, Nassif AB, Soudan B, Injadat MN (2023) IIoT network intrusion detection using machine learning. In: 2023 6th International conference on intelligent robotics and control engineering (IRCE). IEEE, pp 196–201

  25. Ting KM (1998) Inducing cost-sensitive trees via instance weighting. Lect Notes Comput Sci (Subser Lect Notes Artif Intell Lect Notes Bioinf) 1510:139–147. https://doi.org/10.1007/bfb0094814

    Article  Google Scholar 

  26. Zhang YP, Zhang LN, Wang YC (2010) Cluster-based majority under-sampling approaches for class imbalance learning. In: Proceedings of 2010 2nd IEEE international conference on information and financial engineering, pp 400–404. https://doi.org/10.1109/ICIFE.2010.5609385

  27. Richman R, Wuthrich MV (2020) Nagging predictors. SSRN Electron J. https://doi.org/10.2139/ssrn.3627163

    Article  Google Scholar 

  28. Mesevage TG (2021) Data cleaning steps and process to prep your data for success. MonkeyLearn, Montevideo

    Google Scholar 

  29. Tableau (2022) Data cleaning: definition, benefits, and how-to. Tableau, Mountain View

    Google Scholar 

  30. Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. https://doi.org/10.1186/s12864-019-6413-7

    Article  Google Scholar 

  31. Chicco D, Jurman G (2023) The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. https://doi.org/10.1186/s13040-023-00322-4

    Article  Google Scholar 

  32. Khafajeh H (2020) An efficient intrusion detection approach using light gradient boosting. J Theor Appl Inf Technol 98:825–835

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bassel Soudan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

The authors would like to convey their thanks and appreciation to the “University of Sharjah” for supporting this work.

Informed consent

This study does not involve any experiments on animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised to correct the third Author name

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Eid, A.M., Soudan, B., Nassif, A.B. et al. Comparative study of ML models for IIoT intrusion detection: impact of data preprocessing and balancing. Neural Comput & Applic 36, 6955–6972 (2024). https://doi.org/10.1007/s00521-024-09439-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s00521-024-09439-x

Keywords

Profiles

  1. Ali Bou Nassif
  2. MohammadNoor Injadat