Jiongze Yu1, Xiangbo Gao1, Pooja Verlani2, Akshay Gadde2, Yilin Wang2, Balu Adsumilli2, Zhengzhong Tuβ ,1
1Texas A&M University Β Β 2YouTube, Google
β Corresponding author
π‘ Your β star means a lot to us and helps support the continuous development of this project!
- 2026.03.17: This repo is released.π₯π₯π₯
Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer.
- β Release inference code.
- β Release pre-trained models.
- β Release training code.
- β Release project page.
- β¬ Release ComfyUI.
- Python 3.10+
- PyTorch >= 2.5.0
- Diffusers
- Other dependencies (see
requirements.txt)
# Clone the github repo and go to the directory
git clone https://github.com/taco-group/SparkVSR
cd SparkVSR
# Create and activate conda environment
conda create -n sparkvsr python=3.10
conda activate sparkvsr
# Install all required dependencies
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txtThe installation command may need to be adjusted according to your platform, CUDA version, and desired PyTorch version. Please check the official PyTorch previous versions page for more options.
Our model is trained on the same datasets as DOVE: HQ-VSR and DIV2K-HR. All datasets should be placed in the directory datasets/train/.
| Dataset | Type | # Videos / Images | Download |
|---|---|---|---|
| HQ-VSR | Video | 2,055 | Google Drive |
| DIV2K-HR | Image | 800 | Official Link |
All datasets should follow this structure:
datasets/
βββ train/
βββ HQ-VSR/
βββ DIV2K_train_HR/We use several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:
| Dataset | Type | # Videos | Average Frames | Download |
|---|---|---|---|---|
| UDM10 | Synthetic | 10 | 32 | Google Drive |
| SPMCS | Synthetic | 30 | 32 | Google Drive |
| YouHQ40 | Synthetic | 40 | 32 | Google Drive |
| RealVSR | Real-world | 50 | 50 | Google Drive |
| MovieLQ | Old-movie | 10 | 192 | Google Drive |
Make sure the path (datasets/test/) is correct before running inference.
The directory structure is as follows:
datasets/
βββ test/
βββ [DatasetName]/
βββ GT/ # Ground Truth: folder of high-quality frames (one per clip)
βββ GT-Video/ # Ground Truth (video version): lossless MKV format
βββ LQ/ # Low-quality Input: folder of degraded frames (one per clip)
βββ LQ-Video/ # Low-Quality Input (video version): lossless MKV formatBefore training or testing, you need to generate .txt files containing the relative paths of all valid video and image files in your dataset directories. These text lists act as the index for the dataloader during training and inference. Run the following commands:
# πΉ Train dataset
python finetune/scripts/prepare_dataset.py --dir datasets/train/HQ-VSR
python finetune/scripts/prepare_dataset.py --dir datasets/train/DIV2K_train_HR
# πΉ Testing dataset
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/GT-Video
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/LQ-Video
# (You may need to repeat the above for other test datasets as needed)Our model is built upon the CogVideoX1.5-5B-I2V base model. We provide pretrained weights for SparkVSR at different training stages.
| Model Name | Description | HuggingFace |
|---|---|---|
| CogVideoX1.5-5B-I2V | Base model used for initialization | zai-org/CogVideoX1.5-5B-I2V |
| SparkVSR (Stage-1) | SparkVSR Stage-1 trained weights | JiongzeYu/SparkVSR-S1 |
| SparkVSR (Stage-2) | SparkVSR Stage-2 final weights | JiongzeYu/SparkVSR |
π‘ Placement of Models:
- Place the base model (
CogVideoX1.5-5B-I2V) into thepretrained_weights/folder.- Place the downloaded SparkVSR weights (Stage-1 and Stage-2) into the
checkpoints/folder.
Note: Training requires 4ΓA100 GPUs.
β οΈ Important: The Stage-1 weight is the intermediate result of our first training stage and is trained only in latent space. We release it mainly for training-time validation and comparison. The Stage-2 model is the final SparkVSR model.
-
πΉ Stage-1 (Latent-Space): Keyframe-Conditioned Adaptation. Enter the
finetune/directory and start training:cd finetune/ bash sparkvsr_train_s1_ref.shThis stage adapts the base model to VSR by learning to fuse LR video latents with sparse HR keyframe latents for robust cross-space propagation.
-
πΉ Stage-2 (Pixel-Space): Detail Refinement. First, convert the Stage-1 checkpoint into a loadable SFT weight format:
python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s1/checkpoint-10000
(Adjust the path and step number to match your actual training output).
You can skip Stage-1 by downloading our SparkVSR Stage-1 weight as the starting point for Stage-2.
Then, run the second-stage fine-tuning:
bash sparkvsr_train_s2_ref.sh
This stage refines perceptual details in pixel space, ensuring adherence to provided keyframes while simultaneously maintaining strong no-reference blind SR capabilities when keyframes are absent or imperfect.
-
Finally, convert the Stage-2 checkpoint for inference:
python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s2/checkpoint-500
- For a quick test, you can directly run SparkVSR on the sample videos in
test_input/. - The example commands below are configured to use
test_input/for fast testing. - Before running inference on benchmark datasets, make sure you have downloaded the corresponding pre-trained models and test datasets.
- The full inference commands are provided in the shell script:
sparkvsr_inference.sh.
SparkVSR supports flexible keyframe propagation through three primary inference modes (--ref_mode).
β οΈ Important: Always use the Stage-2 checkpoint for inference. The Stage-1 checkpoint is only an intermediate latent-space result and is not our final model.π‘ Recommendation: Among the three inference modes, we strongly recommend the two reference-guided settings:
apimode (withnano-banana-proas the reference generator) andpisasrmode (with PiSA-SR as the reference generator). In these modes, SparkVSR injects high-quality spatial details through the reference frames. By contrast,no_refdoes not use external reference frames and should be treated mainly as a practical fallback and a comparison baseline, rather than the final showcase setting. If you do not have access to thenano-banana-proAPI, we strongly recommend usingpisasras the reference source.
Regardless of the mode you choose, you can customize the temporal propagation behavior using these flags:
--ref_indices: Specifies the indices of the keyframes you want to use as references (0-indexed).- Example:
--ref_indices 0 96 β οΈ Important: The interval between any two reference frame indices must be strictly greater than 4.
- Example:
--ref_guidance_scale: Controls the strength of the reference keyframe's influence on the output video (Default is1.0). Increasing this value forces the model to adhere more strictly to the provided keyframes.- For short video clips (for example, clips within 2 seconds or around 48 frames), we strongly recommend using only the first frame as the reference signal:
--ref_indices 0.
Uses keyframes restored by a commercial API as the condition signal. SparkVSR defaults to using the impressive fal-ai/nano-banana-pro/edit endpoint.
β οΈ Setup Requirement:
- Open
finetune/utils/ref_utils.py.- Locate the configuration block at the top of the file.
- Replace
'your_fal_key'with your actual API key.- (Optional) Customize the
TASK_PROMPTin the same file to better guide the restoration process.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir test_input \
--model_path $MODEL_PATH \
--output_path results/test_input/api_ref \
--is_vae_st \
--ref_mode api \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--upscale 4 \
--ref_indices 0Uses keyframes restored by the open-source PiSA-SR model.
β οΈ Setup Requirement:
- Clone the PiSA-SR Repository and follow their instructions to install dependencies in a separate Conda environment.
- Download their pre-trained weights (
stable-diffusion-2-1-baseandpisa_sr.pkl).- Update the
--pisa_*flags insparkvsr_inference.shto point to your actual cloned PiSA-SR directory, environment, and desired GPU.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir test_input \
--model_path $MODEL_PATH \
--output_path results/test_input/pisa_ref \
--is_vae_st \
--ref_mode pisasr \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--upscale 4 \
--ref_indices 0 \
--pisa_python_executable "path/to/your/pisasr/conda/env/bin/python" \
--pisa_script_path "path/to/your/PiSA-SR/test_pisasr.py" \
--pisa_sd_model_path "path/to/your/PiSA-SR/preset/models/stable-diffusion-2-1-base" \
--pisa_chkpt_path "path/to/your/PiSA-SR/preset/models/pisa_sr.pkl" \
--pisa_gpu "0"Performs blind video super-resolution without any reference keyframes. This mode is useful as a practical fallback and baseline, but it is not the recommended setting for the best visual quality.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir test_input \
--model_path $MODEL_PATH \
--output_path results/test_input/no_ref \
--is_vae_st \
--ref_mode no_ref \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--upscale 4π‘ Note: All three of the above inference modes and their complete execution commands are fully organized and ready to run in the
sparkvsr_inference.shscript!
To quantitatively evaluate the super-resolved videos, we provide a unified evaluation script: run_eval_all.sh.
β οΈ Evaluation Setup Requirement: To calculate DOVER and FastVQA/FasterVQA scores, you must clone their respective repositories and place them (along with their weights) into themetrics/directory.
- Clone VQAssessment/DOVER into
metrics/DOVER.- Clone VQAssessment/FAST-VQA-and-FasterVQA into
metrics/FastVQA.- Download the pre-trained weights specified in their repositories to their respective nested algorithm folders.
Once the metrics are set up, you can simply run the unified evaluation script run_eval_all.sh to calculate the scores. The evaluation results will be saved as all_metrics_results.json in your specified output directory.
If you find the code helpful in your research or work, please cite the following paper(s).
@misc{yu2026sparkvsrinteractivevideosuperresolution,
title={SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation},
author={Jiongze Yu and Xiangbo Gao and Pooja Verlani and Akshay Gadde and Yilin Wang and Balu Adsumilli and Zhengzhong Tu},
year={2026},
eprint={2603.16864},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.16864},
}Our work is built upon the solid foundations laid by DOVE and CogVideoX. We sincerely thank the authors for their excellent open-source contributions.


