Skip to main content
Log in

How Data Augmentation Affects Evolutionary Algorithms in Feature Selection: An Experimental Study

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The rapid growth of machine learning has led to increased features used to represent data, often resulting in superfluous or irrelevant features that negatively impact model performance. Feature selection techniques address this issue by identifying the smallest subset of relevant features. This paper examines the integration of a novel data augmentation algorithm with evolutionary algorithms for feature selection across ten datasets from widely varying domains. We investigate the effectiveness of Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution on augmented datasets by 10–50% and compare their performance with standard filter-based and wrapper methods. Our experiments demonstrate that data augmentation significantly boosts the efficacy of evolutionary algorithms, improving accuracy by up to 5% and reducing feature sets by an average of 40%. While Differential Evolution generally outperforms other algorithms, our findings reveal that the efficacy of combining data augmentation and feature selection varies across datasets. Optimal performance is typically observed from 30 to 50% augmentation, though excessive augmentation can occasionally lead to slight performance degradation, emphasizing the need for careful calibration. This research paves the way for future studies on the interplay between data augmentation and feature selection, including investigations into explainability and generalizability across different machine learning paradigms. By providing insight into this complex interplay, our study contributes to developing more robust and efficient algorithms across various domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Algorithm 1
The alternative text for this image may have been generated using AI.
Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The data are available to the corresponding author upon request.

References

  1. Cordella LP, De Stefano C, Fontanella F, Scotto Di Freca A. A weighted majority vote strategy using Bayesian networks. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 8157 LNCS(PART 2). 2013. pp. 219–28.

  2. De Stefano C, Fontanella F, Folino G, Scotto Di Freca A. A Bayesian approach for combining ensembles of gp classifiers. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 6713 LNCS. 2011. pp. 26–35.

  3. De Stefano C, Fontanella F, Scotto Di Freca A. A novel Naive Bayes voting strategy for combining classifiers. In: 2012 international conference on frontiers in handwriting recognition. 2012. pp. 467–72.

  4. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: a data perspective. ACM Comput Surv. 2017;50(6).

  5. Venkatesh B, Anuradha J. A review of feature selection and its methods. Cybern Inf Technol. 2019;19(1):3–26.

    MathSciNet  Google Scholar 

  6. Cilia ND, De Stefano C, Fontanella F, Freca A. Variable-length representation for ec-based feature selection in high-dimensional data. In: Applications of evolutionary computation: 22nd international conference, evoapplications 2019, held as part of EvoStar 2019, Leipzig, Germany, April 24–26, 2019, Proceedings 22. Springer; 2019. pp. 325–40.

  7. Xue B, Zhang M, Browne WN, Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput. 2016;20(4):606–26.

    Article  Google Scholar 

  8. De Falco I, Tarantino E, Della Cioppa A, Fontanella F. A novel grammar-based genetic programming approach to clustering. In: Proceedings of the 2005 ACM symposium on applied computing. 2005. pp. 928–32.

  9. De Falco I, Tarantino E, Della Cioppa A, Fontanella F. An innovative approach to genetic programming–based clustering. In: Applied soft computing technologies: the challenge of complexity. Springer; 2006. pp. 55–64.

  10. De Stefano C, Fontanella F, Marrocco C. A GA-based feature selection algorithm for remote sensing images. Berlin: Springer; 2008. p. 285–94.

    Google Scholar 

  11. Cordella LP, De Stefano C, Fontanella F, Marrocco C, Freca A. Combining single class features for improving performance of a two stage classifier. In: 20th international conference on pattern recognition (ICPR 2010). IEEE Computer Society. 2010. pp. 4352–5.

  12. Li A-D, Xue B, Zhang M. Multi-objective particle swarm optimization for key quality feature selection in complex manufacturing processes. Inf Sci. 2023;641: 119062.

    Article  Google Scholar 

  13. Xue B, Zhang M, Browne WN. Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms. Appl Soft Comput. 2014;18:261–76.

    Article  Google Scholar 

  14. Al-Yaseen WL, Idrees AK, Almasoudy FH. Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recognit. 2022;132:108912. https://doi.org/10.1016/j.patcog.2022.108912.

  15. Wang P, Xue B, Liang J, Zhang M. Multiobjective differential evolution for feature selection in classification. IEEE Trans Cybern. 2023;53(7):4579–93. https://doi.org/10.1109/TCYB.2021.3128540.

    Article  Google Scholar 

  16. Goodfellow I, Bengio Y, Courville A. Deep learning. Adaptive computation and machine learning series. Cambridge: MIT Press; 2016.

    Google Scholar 

  17. Shanmugamani R, Moore SM. Deep learning for computer vision: expert techniques to train advanced neural networks using TensorFlow and Keras. Birmingham: Packt Publishing; 2018.

    Google Scholar 

  18. Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. 1st ed. Sebastopol: O’Reilly Media Inc; 2018.

    Google Scholar 

  19. Pereira S, Correia J, Machado P. Evolving data augmentation strategies. In: Jiménez Laredo JL, Hidalgo JI, Babaagba KO, editors. Applications of evolutionary computation. Cham: Springer; 2022. p. 337–51.

    Chapter  Google Scholar 

  20. Mertes S, Baird A, Schiller D, Schuller BW, André E. An evolutionary-based generative approach for audio data augmentation. In: 2020 IEEE 22nd international workshop on multimedia signal processing (MMSP). 2020. pp. 1–6.

  21. Zhang X, Yu L, Yin H, Lai KK. Integrating data augmentation and hybrid feature selection for small sample credit risk assessment with high dimensionality. Comput Oper Res. 2022;146: 105937.

    Article  MathSciNet  Google Scholar 

  22. Daniel T, Casenave F, Akkari N, Ryckelynck D. Data augmentation and feature selection for automatic model recommendation in computational physics. Math Comput Appl. 2021;26(1).

  23. D’Alessandro T, De Stefano C, Fontanella F, Nardone E. Integrating data augmentation in evolutionary algorithms for feature selection: a preliminary study. In: Smith S, Correia J, Cintrano C, editors. Applications of evolutionary computation. Cham: Springer; 2024. p. 397–412.

    Chapter  Google Scholar 

  24. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. J Mach Learn Res. 2002;46:389–422.

    Article  Google Scholar 

  25. Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection—a comparative study. In: Yin H, Tino P, Corchado E, Byrne W, Yao X, editors. Intelligent Data engineering and automated learning—IDEAL 2007. Berlin, Heidelberg: Springer; 2007. p. 178–87.

    Chapter  Google Scholar 

  26. Jain R, Ramakrishnan A. Modality-specific feature selection, data augmentation and temporal context for improved performance in sleep staging. IEEE J Biomed Health Inform. 2023.

  27. Zhao P, Zheng Q, Ding Z, Zhang Y, Wang H, Yang Y. A high-dimensional and small-sample submersible fault detection method based on feature selection and data augmentation. Sensors. 2021;22(1):204.

    Article  Google Scholar 

  28. Pohlmann S, Mashayekh A, Kuder M, Neve A, Weyh T. Data augmentation and feature selection for the prediction of the state of charge of lithium-ion batteries using artificial neural networks. Energies. 2023;16(18):6750.

    Article  Google Scholar 

  29. Fujita K, Kobayashi M, Nagao T. Data augmentation using evolutionary image processing. In: 2018 digital image computing: techniques and applications (DICTA). 2018. pp. 1–6.

  30. Mehta K, Kobti Z, Pfaff K, Fox S. Data augmentation using ca evolved gans. In: 2019 IEEE symposium on computers and communications (ISCC). 2019. pp. 1087–92.

  31. Marc ST, Belavkin R, Windridge D, Gao X. An evolutionary approach to automated class-specific data augmentation for image classification. In: Moosaei H, Hladík M, Pardalos PM, editors. Dynamics of information systems. Cham: Springer; 2024. p. 170–85.

    Chapter  Google Scholar 

  32. Velasco JM, Garnica O, Contador S, Lanchares J, Maqueda E, Botella M, Hidalgo JI. Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: 2017 IEEE congress on evolutionary computation (CEC). 2017. pp. 2193–200.

  33. Liu Z, Wang H. A data augmentation based kriging-assisted reference vector guided evolutionary algorithm for expensive dynamic multi-objective optimization. Swarm Evol Comput. 2022;75: 101173.

    Article  Google Scholar 

  34. Fortin F-A, De Rainville F-M, Gardner M-A, Parizeau M, Gagné C. DEAP: evolutionary algorithms made easy. J Mach Learn Res. 2012;13:2171–5.

    MathSciNet  Google Scholar 

  35. Cilia ND, De Stefano C, Fontanella F, Molinara M, Scotto Di Freca A. Handwriting analysis to support Alzheimer’s disease diagnosis: a preliminary study. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11679 LNCS. 2019. pp. 143–51.

  36. Cole R, Fanty M. ISOLET. UCI machine learning repository. 1994. https://doi.org/10.24432/C51G69.

  37. Duin R. Multiple features. UCI machine learning repository. https://doi.org/10.24432/C5HC70.

  38. Zhang K, Fan W, Yuan X. Ozone Level Detection. UCI machine learning repository. 2008. https://doi.org/10.24432/C5NG6W.

  39. Gul S, Rahim F, Isin S, Yilmaz F, Ozturk N, Turkay M, Kavakli IH. Structure-based design and classifications of small molecules regulating the circadian rhythm period. Sci Rep. 2021;11.

  40. Herrera CM, MM. CNAE-9 Dataset. 2007. https://archive.ics.uci.edu/ml/datasets/CNAE-9.

  41. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C-H, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci. 2001;98(26):15149–54.

    Article  Google Scholar 

  42. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002;359(9306):572–7.

    Article  Google Scholar 

  43. Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL. A public domain dataset for human activity recognition using smartphones. In: 21st European symposium on artificial neural networks, computational intelligence and machine learning, ESANN. 2013.

  44. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining. 2019.

  45. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  Google Scholar 

Download references

Funding

The research leading to these results has received funding from Project "Ecosistema dell’innovazione - Rome Technopole" financed by EU in NextGenerationEU plan through MUR Decree n. 1051 23.06.2022 - CUP H33C22000420001. This study was partially funded by the EU in NextGenerationEU Missione 4 Componente 1 CUP B53D23019150006 (PRIN 2022 "LBDigital" project). This study was partially funded by the EU in NextGenerationEU Missione 4 Componente 1 CUP H53D23003680006MU (PRIN 2022 "SHAPE-AD" project).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally.

Corresponding author

Correspondence to Francesco Fontanella.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional Results

Appendix A: Additional Results

See Table 8.

Table 8 Data augmentation and feature selection results in average accuracy and standard deviation computed over 20 runs for every feature selection technique and data augmentation (DA) percentage

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nardone, E., D’Alessandro, T., De Stefano, C. et al. How Data Augmentation Affects Evolutionary Algorithms in Feature Selection: An Experimental Study. SN COMPUT. SCI. 6, 536 (2025). https://doi.org/10.1007/s42979-025-04049-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s42979-025-04049-3

Profiles

  1. Emanuele Nardone
  2. Tiziana D’Alessandro