Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks
Overview of Survey
Multimodal unlearning requires identifying effective intervention points within the model pipeline. Figure 2 illustrates methods spanning data-side, training-time, architecture-constrained, and decoding-time stages, producing an updated model (MFM′). Training-free approaches instead apply direct parameter or representation edits (Δ).
Figure 2: System-level intervention points for multimodal unlearning across the model pipeline.
We organize multimodal unlearning via a system-first taxonomy across five intervention stages: Data-Side Interventions (Section 3.1); Training-Time Edits (Section 3.2); Architecture-Constrained Unlearning (Section 3.3); Training-Free Unlearning (Section 3.4); Decoding-Time Unlearning (Section 3.5).
Figure 1: Taxonomy of multimodal unlearning by intervention stage and control pathway.
Evaluation Metrics
Evaluation uses metric suites that assess forgetting, utility retention, robustness, and efficiency, as summarized in Figure 3. We defer detailed metric definitions and evaluation protocols to Appendix B.
Figure 3: Evaluation dimensions and representative metrics for multimodal unlearning.
Applications of Multimodal Unlearning
Multimodal unlearning enables selective removal of specific identities, attributes, or concepts without full retraining while preserving overall capability and stability. Detailed use cases and representative studies are provided in Appendix E.
Figure 4: Core application scenarios of multimodal unlearning.
Open Challenges in Multimodal Unlearning
Multimodal unlearning faces key challenges in theoretical guarantees, cross-modal generalization, evaluation reliability, adversarial robustness, utility trade-offs, and unified benchmarking. We provide a more detailed discussion in Appendix F, covering modality-specific limitations, evaluation considerations, and emerging research directions for reliable and scalable multimodal unlearning.
Figure 4: Key open challenges in multimodal unlearning.
Contact
This repository is actively maintained and continuously updated 🚀.
If you notice any issues or would like your work included, please open an issue or contact us:
BibTeX
@inproceedings{sarwar2026mm-unlearning-survey,
title = {{Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks}},
author = {Sarwar, Nobin and Roy Dipta, Shubhashis and Liu, Zheyuan and Patil, Vaidehi},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
month = jul,
publisher = {Association for Computational Linguistics},
url = {https://doi.org/10.36227/techrxiv.176945748.88280394/v1}
}
,
,