Baidu Netdisk Images: https://pan.baidu.com/s/1g8jixAr39n06JWPwrBE6lQ?pwd=M2VD
Baidu Netdisk Videos: https://pan.baidu.com/s/1z_kMLxYejPvt_17SNGlOTA?pwd=M2VD
GoogleDrive Videos: https://drive.google.com/file/d/1bRoNhQBzWtj0y8CMGdXQvjbQCGAVddPX/view?usp=sharing
- Images are released in the native per-frame format that is directly used by the VideoFusion project (i.e., frame sequences under each clip folder), so you can plug them into the training/testing pipeline without any extra conversion.
- Videos are additionally provided by packing each frame sequence into a single video file (e.g.,
.mp4) to reduce file count and avoid storage / hosting limitations (many platforms struggle with extremely large numbers of small image files).
If you download the Videos version and need per-frame image sequences (the format directly used in VideoFusion), please use:
This script converts each .mp4 clip into an ordered frame sequence and restores the dataset layout for training/testing.
We provide both high-quality (clean/enhanced) data and degraded data for infrared and visible modalities:
- infrared_Enhance: High-quality infrared (IR) frames (clean/enhanced version).
- visible_Enhance: High-quality visible (VI) frames (clean/enhanced version).
- infrared_noise: Degraded infrared (IR) frames with stripe noise (a typical IR sensor degradation).
- visible_Blur: Degraded visible (VI) frames with blur (e.g., motion/defocus blur).
-
[2026] Our paper “VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion” has been accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026! [Paper] [Code]
-
[2025] M3SVD dataset is officially released.
M3SVD (Multi-Modal Multi-Scene Video Dataset) is a large-scale infrared-visible (IR-VI) video dataset designed for:
- 🔥 Multi-modal video fusion
- 🌙 Low-light / degraded video restoration
- 📹 Spatio-temporal modeling research
Visualization of representative scenarios in M3SVD. The dataset contains 220 temporally synchronized infrared-visible (IR-VI) video pairs with 153,797 aligned frames in total, captured at a resolution of 640×480 and 30 FPS.
Example sequences (GIF previews):
- Current release: Test split
- Full dataset access: Please contact
linfeng0419@gmail.com
We are open to academic collaboration and research usage.
VideoFusion (CVPR 2026)
Spatio-temporal collaborative network for multi-modal video fusion and restoration. [Paper] [Code]
If you use M3SVD in your research, please cite:
@inproceedings{Tang2026VideoFusion,
title = {VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion and Restoration},
author = {Tang, Linfeng and Wang, Yeda and Gong, Meiqi and Li, Zizhuo and Deng, Yuxin and Yi, Xunpeng and Li, Chunyu and Zhang, Hao and Xu, Han and Ma, Jiayi},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}





