In our pursuit of advancing video understanding through post-training of multimodal LLMs, we found that existing RL frameworks were not particularly well-suited for video understanding scenarios. Therefore, we built EasyVideoR1 to implement relevant optimizations, which we have outlined in this report. To the best of our knowledge, this should be the most suitable code repository for research on RL post-training for video understanding at the time of this report's release. It supports a wide range of video understanding tasks, incorporates research-friendly interfaces (mixed off-policy and on-policy training, joint image-video training), enhances training efficiency for video RL through systematic design, and provides an efficient, comprehensive, and accuracy-aligned evaluation framework. We hope this repository can inspire enthusiasm within the multimodal community for video understanding research. We also call upon community researchers to join us in maintaining this codebase, working together to create the most comprehensive and research-friendly repository for video understanding. We welcome and will consider merging any valuable pull requests.
-
- Offline Preprocessing and Cache-Based Training: accelerates rollout generation by 1.5Γ and log-probability computation by 2.9Γ, achieving a 1.47Γ overall speedup in both wall-clock time per step and token throughput.
-
- Task-Aware Prompt and Reward Assignment System: supports 10+ task types and their accuracy scoring/reward methods. Specifically, EasyVideoR1 fully implements the following reward types by default: multiple choice, numerical, temporal grounding, spatial-temporal grounding, and open-ended QA. Prompt formatting is also available for additional task types including spatial grounding, tracking, OCR, boolean QA, math, and code generation.
-
- More flexible video-hyperparameter settings: Video metadata support for precise frame processing
-
- Advanced VLMs: supports Qwen2-VL / Qwen2.5-VL / Qwen3-VL / Qwen3.5-VL series vision-language models.
-
- Rich RL Algorithms: inherited from EasyR1, supports GRPO, DAPO, GSPO, CISPO, Reinforce++, ReMax, RLOO, GDPO and more.
-
- Mixed-Modality Pipeline Adaptation: supports joint Text-Image-Video training with optimized gradient flow.
-
- A Lightweight Mix-policy Interface: supports hybrid online-offline training.
-
- Asynchronous Inference: Precomputed Frame Caching and Asynchronous Pipeline with AsyncLLMEngine ensure that the GPU remains productive at every scheduling step: cached I/O feeds data continuously, asynchronous queuing removes batch-boundary stalls, and chunked prefill prevents any single long sequence from monopolizing compute.
-
- Comprehensive and reproducible evaluation: supports 22+ Video Understanding Benchmarks.
-
- Accuracy-aligned: for Qwen3-VL series, evaluation results align with official scores (within 1% deviation).
Training with EasyVideoR1 yields consistent improvements over the Qwen3-VL-8B base models across 10 video understanding benchmarks, with an average accuracy gain of +2.3%.
Our video preprocessing cache reduces per-step training time by 1.47x compared to on-the-fly decoding, without sacrificing accuracy.
conda create -n easyvideor1 python=3.11
conda activate easyvideor1git clone https://github.com/cyuQ1n/EasyVideoR1.git
cd EasyVideoR1
pip install -e .pip install flash-attn==2.8.3 --no-build-isolationBelow is a minimal 3-step workflow to get training running.
Create a JSON/JSONL file. Each entry should look like:
{
"problem": "What happens in this video?",
"answer": "A cat jumps onto the table.",
"videos": ["path/to/video.mp4"],
"data_type": "video",
"problem_type": "open-ended"
}For multiple-choice tasks, add an options field:
{
"problem": "What color is the car?",
"answer": "B",
"videos": ["path/to/video.mp4"],
"data_type": "video",
"problem_type": "multiple choice",
"options": ["A. Red", "B. Blue", "C. Green", "D. White"]
}See docs/config_parameters.md for the full list of supported
problem_typevalues and data fields.
Copy and edit the example config:
cp examples/video_rl/video_rl.yaml my_config.yamlUpdate at minimum these fields:
data:
train_files: /path/to/your/train.jsonl
val_files: /path/to/your/val.json
worker:
actor:
model:
model_path: Qwen/Qwen3-VL-8B-Instruct
trainer:
experiment_name: my_first_run
save_checkpoint_path: ./checkpoints/my_first_run# Single-node (8 GPUs)
bash examples/video_rl/run_video_rl.sh
# Multi-node: set WORLD_SIZE, RANK, MASTER_ADDR on each node
WORLD_SIZE=2 RANK=0 MASTER_ADDR=<head_ip> bash examples/video_rl/run_video_rl.sh # head
WORLD_SIZE=2 RANK=1 MASTER_ADDR=<head_ip> bash examples/video_rl/run_video_rl.sh # workerAfter training, merge FSDP checkpoints to Hugging Face format:
python3 scripts/model_merger.py --local_dir checkpoints/my_first_run/global_step_100/actorEasyVideoR1/
βββ verl/ # Core RL training framework
β βββ trainer/ # Training loop & Ray orchestration
β βββ workers/ # Actor, rollout, reward, critic workers
β βββ models/ # Qwen2-VL / Qwen2.5-VL / Qwen3-VL model support
β βββ utils/ # Dataset, tokenization, FSDP utilities
βββ examples/
β βββ video_rl/ # Video-only RL pipeline (single-file reward)
β βββ unified_rl/ # Mixed image-video pipeline (modular reward)
βββ eval/ # Evaluation toolkit (25+ benchmarks)
βββ scripts/ # Checkpoint merger, video preprocessing
βββ docs/ # Detailed documentation
A self-contained pipeline for video-only RL training. The reward function (video_reward.py) handles all task types in a single file with a simple accuracy * 0.9 + format * 0.1 scoring formula.
bash examples/video_rl/run_video_rl.shA modular pipeline for mixed image-video training. The reward function routes each sample to a task-specific module (multiple choice, grounding, math, etc.) with independent scoring logic.
bash examples/unified_rl/run_unified_rl.sh| Document | Description |
|---|---|
| Configuration Parameters | Complete reference for all YAML config options |
| RL Training Deep Dive | GRPO algorithm, system architecture, training flow |
| Qwen3-VL Multimodal Processing | Vision-language model internals |
| Token Calculation | Token counting and memory estimation |
Q: Image features and image tokens do not match
A: Increase data.max_prompt_length or decrease data.max_pixels.
Q: CUDA out of memory
A: Decrease worker.rollout.gpu_memory_utilization and enable worker.actor.offload.offload_params.
Q: Multi-node training hangs
A: Run ray status to check the cluster. Ensure all nodes are connected and NCCL ports are open.
This project is built upon the excellent work of:
- EasyR1 β Efficient, scalable RL training framework
- veRL β High-performance RL with HybridEngine
- OneThinker - All-in-one Reasoning Model for Image and Video
If you use this project, please cite:
@misc{qin2026easyvideor1,
title = {EasyVideoR1: Easier RL for Video Understanding},
author = {Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang},
howpublished = {\url{https://github.com/cyuQ1n/EasyVideoR1}},
year = {2026}
}
This project follows the same license as EasyR1.
We're hiring multimodal research scientists and interns at JD Explore Academy! If you have top-tier publications and are passionate about video understanding and VLMs, please send your resume to: siqingyi.phoebus@jd.com. We'd love to hear from you!

