Abstract
With the advancement of Industry 4.0 and intelligent manufacturing, there is an increasing demand for enhanced safety and reliability in equipment operation. As a precursor to mechanical failures, abnormal sounds are critical indicators, and their accurate detection plays a vital role in accident prevention and operational efficiency. To address the limitations of existing methods—such as heavy reliance on handcrafted acoustic features and model structures, insufficient detail representation, and poor cross-device robustness—this paper proposes an end-to-end detection framework based on Pre-trained Representation-driven and Multi-domain Feature Fusion (PReMFF). Specifically, we fine-tune a large-scale pre-trained model, Wav2vec 2.0, to extract generalized acoustic features. To further improve performance, we introduce two specialized modules: an adaptive frequency band enhancement module that highlights key frequency components, and a multi-scale dilated causal temporal modeling module that captures long-range dependencies in the time domain. Finally, the three-way features are gated and fused and jointly supervised by the classifier and loss function, achieving excellent performance of 94.29% and 88.88% pAUC on the DCASE 2020 TASK 2 dataset. The MIMII dataset is used to verify its ability to quickly adapt and robustly generalize under new equipment and complex noise, indicating that it provides an efficient and feasible solution for intelligent monitoring of industrial sites.






Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020). https://doi.org/10.1016/j.ymssp.2015.09.039
Wang, C., Liu, C., Liao, M., Yang, Q.: An enhanced diagnosis method for weak fault features of bearing acoustic emission signal based on compressed sensing. Math. Biosci. Eng. 18, 1670–1688 (2021). https://doi.org/10.1109/ICASSP40776.2020.9054344
Suefusa, K., Nishida, T., Purohit, H., Tanabe, R., Endo, T., Kawaguchi, Y.: Anomalous sound detection based on interpolation deep neural network. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054344
Jiang, A., Zhang, W.-Q., Deng, Y., Fan, P., Liu, J.: Unsupervised anomaly detection and localization of machine audio: A gan-based approach. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096813
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021). https://doi.org/10.1109/ICCV48922.2021.00986
Giri, R., Tenneti, S.V., Cheng, F., Helwani, K., Isik, U., Krishnaswamy, A.: Self-supervised classification for detecting anomalous sounds (2020)
Dohi, K., Endo, T., Purohit, H., Tanabe, R., Kawaguchi, Y.: Flow-based self-supervised density estimation for anomalous sound detection. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 336–340 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414662
Liu, Y., Guan, J., Zhu, Q., Wang, W.: Anomalous sound detection using spectral-temporal information fusion. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 816–820 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747868
Guan, J., Xiao, F., Liu, Y., Zhu, Q., Wang, W.: Anomalous sound detection using audio representation with machine id based contrastive learning pretraining. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096054
Zhang, Y., Liu, J., Tian, Y., Liu, H., Li, M.: A dual-path framework with frequency-and-time excited network for anomalous sound detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1266–1270 (2024). https://doi.org/10.1109/ICASSP48485.2024.10448126
Han, B., Lv, Z., Jiang, A., Huang, W., Chen, Z., Deng, Y., Ding, J., Lu, C., Zhang, W.-Q., Fan, P., Liu, J., Qian, Y.: Exploring large scale pre-trained models for robust machine anomalous sound detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1326–1330 (2024). https://doi.org/10.1109/ICASSP48485.2024.10447183
Jiang, A., Han, B., Lv, Z., Deng, Y., Zhang, W.-Q., Chen, X., Qian, Y., Liu, J., Fan, P.: Anopatch: Towards better consistency in machine anomalous sound detection. arXiv preprint arXiv:2406.11364 (2024)
Yang, H., Liu, Z., Ma, N., Wang, X., Liu, W., Wang, H., Zhan, D., Hu, Z.: Csrm-mim: A self-supervised pre-training method for detecting catenary support components in electrified railways. IEEE Trans. Transp. Electrif. (2025)
Yan, J., Cheng, Y., Zhang, F., Li, M., Zhou, N., Jin, B., Wang, H., Yang, H., Zhang, W.: Research on multimodal techniques for arc detection in railway systems with limited data. Struct. Health Monit 14759217251336797 (2025)
Chen, S., Liu, Y., Gao, X., Han, Z.: Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In: Zhou, J., Wang, Y., Sun, Z., Jia, Z., Feng, J., Shan, S., Ubul, K., Guo, Z. (eds.) Biometric Recognition, pp. 428–438. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-97909-0_46
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR 1(2), 3 (2022)
Koizumi, Y., Kawaguchi, Y., Imoto, K., Nakamura, T., Nikaido, Y., Tanabe, R., Purohit, H., Suefusa, K., Endo, T., Yasuda, M., et al.: Description and discussion on dcase2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring. arXiv preprint arXiv:2006.05822 (2020) https://doi.org/10.48550/arXiv.2006.05822
Neri, M., Carli, M.: Low-complexity attention-based unsupervised anomalous sound detection exploiting separable convolutions and angular loss. IEEE Sensors Letters 8(11), 1–4 (2024). https://doi.org/10.1109/LSENS.2024.3480450
Wang, Y., Zhang, Q., Zhang, W., Zhang, Y.: A lightweight framework for unsupervised anomalous sound detection based on selective learning of time-frequency domain features. Appl. Acoust. 228, 110308 (2025)
Purohit, H., Tanabe, R., Ichige, K., Endo, T., Nikaido, Y., Suefusa, K., Kawaguchi, Y.: Mimii dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint arXiv:1909.09347 (2019) https://doi.org/10.48550/arXiv.1909.09347
Chandrakala, S., Pidikiti, A., Sai Mahathi, P.: Spectro temporal fusion with clstm-autoencoder based approach for anomalous sound detection. Neural Process. Lett. 56(1), 39 (2024). https://doi.org/10.1007/s11063-024-11485-4
Author information
Authors and Affiliations
Contributions
J.W., H.S., and Q.W. conducted research conceptualization and data collection. J.W., C.L., and K.X. performed formal analysis and investigation. J.W. wrote the main manuscript text. J.W., H.S., and C.L. reviewed and edited the manuscript. All authors reviewed the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, J., Sun, H., Li, C. et al. Pre-trained representation-driven and multi-domain feature fusion method for anomalous sound detection. SIViP 19, 1063 (2025). https://doi.org/10.1007/s11760-025-04666-8
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s11760-025-04666-8

