Abstract
The rapid growth of machine learning has led to increased features used to represent data, often resulting in superfluous or irrelevant features that negatively impact model performance. Feature selection techniques address this issue by identifying the smallest subset of relevant features. This paper examines the integration of a novel data augmentation algorithm with evolutionary algorithms for feature selection across ten datasets from widely varying domains. We investigate the effectiveness of Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution on augmented datasets by 10–50% and compare their performance with standard filter-based and wrapper methods. Our experiments demonstrate that data augmentation significantly boosts the efficacy of evolutionary algorithms, improving accuracy by up to 5% and reducing feature sets by an average of 40%. While Differential Evolution generally outperforms other algorithms, our findings reveal that the efficacy of combining data augmentation and feature selection varies across datasets. Optimal performance is typically observed from 30 to 50% augmentation, though excessive augmentation can occasionally lead to slight performance degradation, emphasizing the need for careful calibration. This research paves the way for future studies on the interplay between data augmentation and feature selection, including investigations into explainability and generalizability across different machine learning paradigms. By providing insight into this complex interplay, our study contributes to developing more robust and efficient algorithms across various domains.






Similar content being viewed by others
Data availability
The data are available to the corresponding author upon request.
References
Cordella LP, De Stefano C, Fontanella F, Scotto Di Freca A. A weighted majority vote strategy using Bayesian networks. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 8157 LNCS(PART 2). 2013. pp. 219–28.
De Stefano C, Fontanella F, Folino G, Scotto Di Freca A. A Bayesian approach for combining ensembles of gp classifiers. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 6713 LNCS. 2011. pp. 26–35.
De Stefano C, Fontanella F, Scotto Di Freca A. A novel Naive Bayes voting strategy for combining classifiers. In: 2012 international conference on frontiers in handwriting recognition. 2012. pp. 467–72.
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: a data perspective. ACM Comput Surv. 2017;50(6).
Venkatesh B, Anuradha J. A review of feature selection and its methods. Cybern Inf Technol. 2019;19(1):3–26.
Cilia ND, De Stefano C, Fontanella F, Freca A. Variable-length representation for ec-based feature selection in high-dimensional data. In: Applications of evolutionary computation: 22nd international conference, evoapplications 2019, held as part of EvoStar 2019, Leipzig, Germany, April 24–26, 2019, Proceedings 22. Springer; 2019. pp. 325–40.
Xue B, Zhang M, Browne WN, Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput. 2016;20(4):606–26.
De Falco I, Tarantino E, Della Cioppa A, Fontanella F. A novel grammar-based genetic programming approach to clustering. In: Proceedings of the 2005 ACM symposium on applied computing. 2005. pp. 928–32.
De Falco I, Tarantino E, Della Cioppa A, Fontanella F. An innovative approach to genetic programming–based clustering. In: Applied soft computing technologies: the challenge of complexity. Springer; 2006. pp. 55–64.
De Stefano C, Fontanella F, Marrocco C. A GA-based feature selection algorithm for remote sensing images. Berlin: Springer; 2008. p. 285–94.
Cordella LP, De Stefano C, Fontanella F, Marrocco C, Freca A. Combining single class features for improving performance of a two stage classifier. In: 20th international conference on pattern recognition (ICPR 2010). IEEE Computer Society. 2010. pp. 4352–5.
Li A-D, Xue B, Zhang M. Multi-objective particle swarm optimization for key quality feature selection in complex manufacturing processes. Inf Sci. 2023;641: 119062.
Xue B, Zhang M, Browne WN. Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms. Appl Soft Comput. 2014;18:261–76.
Al-Yaseen WL, Idrees AK, Almasoudy FH. Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recognit. 2022;132:108912. https://doi.org/10.1016/j.patcog.2022.108912.
Wang P, Xue B, Liang J, Zhang M. Multiobjective differential evolution for feature selection in classification. IEEE Trans Cybern. 2023;53(7):4579–93. https://doi.org/10.1109/TCYB.2021.3128540.
Goodfellow I, Bengio Y, Courville A. Deep learning. Adaptive computation and machine learning series. Cambridge: MIT Press; 2016.
Shanmugamani R, Moore SM. Deep learning for computer vision: expert techniques to train advanced neural networks using TensorFlow and Keras. Birmingham: Packt Publishing; 2018.
Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. 1st ed. Sebastopol: O’Reilly Media Inc; 2018.
Pereira S, Correia J, Machado P. Evolving data augmentation strategies. In: Jiménez Laredo JL, Hidalgo JI, Babaagba KO, editors. Applications of evolutionary computation. Cham: Springer; 2022. p. 337–51.
Mertes S, Baird A, Schiller D, Schuller BW, André E. An evolutionary-based generative approach for audio data augmentation. In: 2020 IEEE 22nd international workshop on multimedia signal processing (MMSP). 2020. pp. 1–6.
Zhang X, Yu L, Yin H, Lai KK. Integrating data augmentation and hybrid feature selection for small sample credit risk assessment with high dimensionality. Comput Oper Res. 2022;146: 105937.
Daniel T, Casenave F, Akkari N, Ryckelynck D. Data augmentation and feature selection for automatic model recommendation in computational physics. Math Comput Appl. 2021;26(1).
D’Alessandro T, De Stefano C, Fontanella F, Nardone E. Integrating data augmentation in evolutionary algorithms for feature selection: a preliminary study. In: Smith S, Correia J, Cintrano C, editors. Applications of evolutionary computation. Cham: Springer; 2024. p. 397–412.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. J Mach Learn Res. 2002;46:389–422.
Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection—a comparative study. In: Yin H, Tino P, Corchado E, Byrne W, Yao X, editors. Intelligent Data engineering and automated learning—IDEAL 2007. Berlin, Heidelberg: Springer; 2007. p. 178–87.
Jain R, Ramakrishnan A. Modality-specific feature selection, data augmentation and temporal context for improved performance in sleep staging. IEEE J Biomed Health Inform. 2023.
Zhao P, Zheng Q, Ding Z, Zhang Y, Wang H, Yang Y. A high-dimensional and small-sample submersible fault detection method based on feature selection and data augmentation. Sensors. 2021;22(1):204.
Pohlmann S, Mashayekh A, Kuder M, Neve A, Weyh T. Data augmentation and feature selection for the prediction of the state of charge of lithium-ion batteries using artificial neural networks. Energies. 2023;16(18):6750.
Fujita K, Kobayashi M, Nagao T. Data augmentation using evolutionary image processing. In: 2018 digital image computing: techniques and applications (DICTA). 2018. pp. 1–6.
Mehta K, Kobti Z, Pfaff K, Fox S. Data augmentation using ca evolved gans. In: 2019 IEEE symposium on computers and communications (ISCC). 2019. pp. 1087–92.
Marc ST, Belavkin R, Windridge D, Gao X. An evolutionary approach to automated class-specific data augmentation for image classification. In: Moosaei H, Hladík M, Pardalos PM, editors. Dynamics of information systems. Cham: Springer; 2024. p. 170–85.
Velasco JM, Garnica O, Contador S, Lanchares J, Maqueda E, Botella M, Hidalgo JI. Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: 2017 IEEE congress on evolutionary computation (CEC). 2017. pp. 2193–200.
Liu Z, Wang H. A data augmentation based kriging-assisted reference vector guided evolutionary algorithm for expensive dynamic multi-objective optimization. Swarm Evol Comput. 2022;75: 101173.
Fortin F-A, De Rainville F-M, Gardner M-A, Parizeau M, Gagné C. DEAP: evolutionary algorithms made easy. J Mach Learn Res. 2012;13:2171–5.
Cilia ND, De Stefano C, Fontanella F, Molinara M, Scotto Di Freca A. Handwriting analysis to support Alzheimer’s disease diagnosis: a preliminary study. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11679 LNCS. 2019. pp. 143–51.
Cole R, Fanty M. ISOLET. UCI machine learning repository. 1994. https://doi.org/10.24432/C51G69.
Duin R. Multiple features. UCI machine learning repository. https://doi.org/10.24432/C5HC70.
Zhang K, Fan W, Yuan X. Ozone Level Detection. UCI machine learning repository. 2008. https://doi.org/10.24432/C5NG6W.
Gul S, Rahim F, Isin S, Yilmaz F, Ozturk N, Turkay M, Kavakli IH. Structure-based design and classifications of small molecules regulating the circadian rhythm period. Sci Rep. 2021;11.
Herrera CM, MM. CNAE-9 Dataset. 2007. https://archive.ics.uci.edu/ml/datasets/CNAE-9.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C-H, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci. 2001;98(26):15149–54.
Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002;359(9306):572–7.
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL. A public domain dataset for human activity recognition using smartphones. In: 21st European symposium on artificial neural networks, computational intelligence and machine learning, ESANN. 2013.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining. 2019.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Funding
The research leading to these results has received funding from Project "Ecosistema dell’innovazione - Rome Technopole" financed by EU in NextGenerationEU plan through MUR Decree n. 1051 23.06.2022 - CUP H33C22000420001. This study was partially funded by the EU in NextGenerationEU Missione 4 Componente 1 CUP B53D23019150006 (PRIN 2022 "LBDigital" project). This study was partially funded by the EU in NextGenerationEU Missione 4 Componente 1 CUP H53D23003680006MU (PRIN 2022 "SHAPE-AD" project).
Author information
Authors and Affiliations
Contributions
All authors contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Additional Results
Appendix A: Additional Results
See Table 8.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nardone, E., D’Alessandro, T., De Stefano, C. et al. How Data Augmentation Affects Evolutionary Algorithms in Feature Selection: An Experimental Study. SN COMPUT. SCI. 6, 536 (2025). https://doi.org/10.1007/s42979-025-04049-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s42979-025-04049-3

