Skip to main content
Log in

Pre-trained representation-driven and multi-domain feature fusion method for anomalous sound detection

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

With the advancement of Industry 4.0 and intelligent manufacturing, there is an increasing demand for enhanced safety and reliability in equipment operation. As a precursor to mechanical failures, abnormal sounds are critical indicators, and their accurate detection plays a vital role in accident prevention and operational efficiency. To address the limitations of existing methods—such as heavy reliance on handcrafted acoustic features and model structures, insufficient detail representation, and poor cross-device robustness—this paper proposes an end-to-end detection framework based on Pre-trained Representation-driven and Multi-domain Feature Fusion (PReMFF). Specifically, we fine-tune a large-scale pre-trained model, Wav2vec 2.0, to extract generalized acoustic features. To further improve performance, we introduce two specialized modules: an adaptive frequency band enhancement module that highlights key frequency components, and a multi-scale dilated causal temporal modeling module that captures long-range dependencies in the time domain. Finally, the three-way features are gated and fused and jointly supervised by the classifier and loss function, achieving excellent performance of 94.29% and 88.88% pAUC on the DCASE 2020 TASK 2 dataset. The MIMII dataset is used to verify its ability to quickly adapt and robustly generalize under new equipment and complex noise, indicating that it provides an efficient and feasible solution for intelligent monitoring of industrial sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020). https://doi.org/10.1016/j.ymssp.2015.09.039

  2. Wang, C., Liu, C., Liao, M., Yang, Q.: An enhanced diagnosis method for weak fault features of bearing acoustic emission signal based on compressed sensing. Math. Biosci. Eng. 18, 1670–1688 (2021). https://doi.org/10.1109/ICASSP40776.2020.9054344

    Article  MATH  Google Scholar 

  3. Suefusa, K., Nishida, T., Purohit, H., Tanabe, R., Endo, T., Kawaguchi, Y.: Anomalous sound detection based on interpolation deep neural network. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054344

  4. Jiang, A., Zhang, W.-Q., Deng, Y., Fan, P., Liu, J.: Unsupervised anomaly detection and localization of machine audio: A gan-based approach. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096813

  5. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021). https://doi.org/10.1109/ICCV48922.2021.00986

  6. Giri, R., Tenneti, S.V., Cheng, F., Helwani, K., Isik, U., Krishnaswamy, A.: Self-supervised classification for detecting anomalous sounds (2020)

  7. Dohi, K., Endo, T., Purohit, H., Tanabe, R., Kawaguchi, Y.: Flow-based self-supervised density estimation for anomalous sound detection. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 336–340 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414662

  8. Liu, Y., Guan, J., Zhu, Q., Wang, W.: Anomalous sound detection using spectral-temporal information fusion. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 816–820 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747868

  9. Guan, J., Xiao, F., Liu, Y., Zhu, Q., Wang, W.: Anomalous sound detection using audio representation with machine id based contrastive learning pretraining. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096054

  10. Zhang, Y., Liu, J., Tian, Y., Liu, H., Li, M.: A dual-path framework with frequency-and-time excited network for anomalous sound detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1266–1270 (2024). https://doi.org/10.1109/ICASSP48485.2024.10448126

  11. Han, B., Lv, Z., Jiang, A., Huang, W., Chen, Z., Deng, Y., Ding, J., Lu, C., Zhang, W.-Q., Fan, P., Liu, J., Qian, Y.: Exploring large scale pre-trained models for robust machine anomalous sound detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1326–1330 (2024). https://doi.org/10.1109/ICASSP48485.2024.10447183

  12. Jiang, A., Han, B., Lv, Z., Deng, Y., Zhang, W.-Q., Chen, X., Qian, Y., Liu, J., Fan, P.: Anopatch: Towards better consistency in machine anomalous sound detection. arXiv preprint arXiv:2406.11364 (2024)

  13. Yang, H., Liu, Z., Ma, N., Wang, X., Liu, W., Wang, H., Zhan, D., Hu, Z.: Csrm-mim: A self-supervised pre-training method for detecting catenary support components in electrified railways. IEEE Trans. Transp. Electrif. (2025)

  14. Yan, J., Cheng, Y., Zhang, F., Li, M., Zhou, N., Jin, B., Wang, H., Yang, H., Zhang, W.: Research on multimodal techniques for arc detection in railway systems with limited data. Struct. Health Monit 14759217251336797 (2025)

  15. Chen, S., Liu, Y., Gao, X., Han, Z.: Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In: Zhou, J., Wang, Y., Sun, Z., Jia, Z., Feng, J., Shan, S., Ubul, K., Guo, Z. (eds.) Biometric Recognition, pp. 428–438. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-97909-0_46

  16. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR 1(2), 3 (2022)

    Google Scholar 

  17. Koizumi, Y., Kawaguchi, Y., Imoto, K., Nakamura, T., Nikaido, Y., Tanabe, R., Purohit, H., Suefusa, K., Endo, T., Yasuda, M., et al.: Description and discussion on dcase2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring. arXiv preprint arXiv:2006.05822 (2020) https://doi.org/10.48550/arXiv.2006.05822

  18. Neri, M., Carli, M.: Low-complexity attention-based unsupervised anomalous sound detection exploiting separable convolutions and angular loss. IEEE Sensors Letters 8(11), 1–4 (2024). https://doi.org/10.1109/LSENS.2024.3480450

    Article  Google Scholar 

  19. Wang, Y., Zhang, Q., Zhang, W., Zhang, Y.: A lightweight framework for unsupervised anomalous sound detection based on selective learning of time-frequency domain features. Appl. Acoust. 228, 110308 (2025)

  20. Purohit, H., Tanabe, R., Ichige, K., Endo, T., Nikaido, Y., Suefusa, K., Kawaguchi, Y.: Mimii dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint arXiv:1909.09347 (2019) https://doi.org/10.48550/arXiv.1909.09347

  21. Chandrakala, S., Pidikiti, A., Sai Mahathi, P.: Spectro temporal fusion with clstm-autoencoder based approach for anomalous sound detection. Neural Process. Lett. 56(1), 39 (2024). https://doi.org/10.1007/s11063-024-11485-4

Download references

Author information

Authors and Affiliations

Authors

Contributions

J.W., H.S., and Q.W. conducted research conceptualization and data collection. J.W., C.L., and K.X. performed formal analysis and investigation. J.W. wrote the main manuscript text. J.W., H.S., and C.L. reviewed and edited the manuscript. All authors reviewed the final version.

Corresponding author

Correspondence to Hongjun Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, J., Sun, H., Li, C. et al. Pre-trained representation-driven and multi-domain feature fusion method for anomalous sound detection. SIViP 19, 1063 (2025). https://doi.org/10.1007/s11760-025-04666-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s11760-025-04666-8

Keywords

Profiles

  1. Jingwen Wei