Overview

Relevant source files

This document introduces the verl framework, its purpose as a reinforcement learning (RL) training library for large language models (LLMs), and provides a high-level overview of the system architecture. For detailed information about specific subsystems, see the child pages: System Architecture and HybridFlow Design, Key Innovations and Design Patterns, and Supported Algorithms and Models.

Purpose and Scope

verl (Volcano Engine Reinforcement Learning) is a flexible, efficient, and production-ready framework for post-training LLMs using RL algorithms. It is the open-source implementation of the HybridFlow paper README.md26-27 presented at EuroSys 2025 README.md14

The framework addresses the challenge of efficiently orchestrating complex RL training workflows that involve multiple distributed components: policy training, inference generation, reward computation, and value estimation. verl enables researchers and practitioners to:

Implement diverse RL algorithms (PPO, GRPO, RLOO, DAPO, etc.) with minimal code changes README.md30
Scale training from single-GPU prototypes to multi-node clusters with hundreds of GPUs README.md34
Integrate with existing LLM infrastructure such as PyTorch FSDP, Megatron-LM, vLLM, and SGLang README.md32
Efficiently manage GPU resources across training and inference workloads via flexible device mapping docs/index.rst12-13
Support multimodal models (VLA), reasoning models, and multi-turn agent interactions README.md66-70 docs/index.rst64

Sources: README.md22-48 docs/index.rst1-22

What is verl?

verl provides a complete software stack for LLM post-training via reinforcement learning:

Framework Component Stack

Component	Purpose	Key Classes / Entities
Programming Model	Define RL algorithm dataflows	`HybridFlow` (single/multi-controller) docs/index.rst8-9
Training Orchestration	Coordinate distributed execution	`RayPPOTrainer` verl/trainer/ppo/ray_trainer.py29
Distributed Workers	Execute training/inference tasks	`ActorRolloutRefWorker`, `TrainingWorker` verl/trainer/main_ppo.py128-154
Training Engines	Backend for model training	`FSDPEngine`, `MegatronEngine`, `VeOmniEngine` docs/index.rst91-93 docs/index.rst144
Inference Engines	High-throughput generation	`vLLM`, `SGLang`, `TensorRT-LLM` docs/index.rst94-95
Data Pipeline	Load and process training data	`RLHFDataset`, `DataProto` verl/trainer/ppo/ray_trainer.py36 docs/index.rst49
Configuration System	Manage complex configurations	Hydra framework with OmegaConf verl/trainer/config/ppo_trainer.yaml1-10

The framework supports training on various hardware platforms including NVIDIA GPUs, AMD GPUs (ROCm), and Huawei Ascend NPUs docs/index.rst150-165

Sources: README.md24-48 docs/index.rst4-22 docs/start/install.rst10-31 verl/trainer/ppo/ray_trainer.py16-60

HybridFlow Programming Model

The HybridFlow programming model is the foundation of verl's flexibility. It combines two execution paradigms docs/index.rst8-9:

Single-Controller Paradigm: A centralized controller orchestrates the entire workflow, dispatching data to workers and collecting results. This is implemented via the RayPPOTrainer which manages worker groups through RayWorkerGroup verl/trainer/ppo/ray_trainer.py39
Multi-Controller Paradigm: Multiple independent controllers manage different parts of the workflow asynchronously. This enables off-policy training, partial rollout, and truly asynchronous architectures where generation and training are decoupled docs/index.rst119-123

For details on the programming model, see System Architecture and HybridFlow Design.

Sources: README.md28-34 docs/index.rst6-10 docs/index.rst42-43 verl/trainer/ppo/ray_trainer.py16-39

System Architecture

The following diagram shows the verl system architecture, mapping high-level concepts to concrete code entities:

System Entity Mapping

Sources: verl/trainer/ppo/ray_trainer.py17-52 verl/trainer/main_ppo.py109-154 verl/workers/engine_workers.py128-154

Core Subsystems

Training Orchestration

The RayPPOTrainer class serves as the central orchestrator verl/trainer/ppo/ray_trainer.py29 It is typically launched via a TaskRunner verl/trainer/main_ppo.py109-113 It initializes the Ray cluster, spawns worker groups (actor, critic, reward model) using ResourcePoolManager verl/trainer/ppo/ray_trainer.py39 and implements the main training loop including rollout, reward extraction verl/trainer/ppo/ray_trainer.py52 and advantage computation verl/trainer/ppo/ray_trainer.py136-144

Distributed Workers

Workers are Ray remote actors that execute specific roles:

ActorRolloutRefWorker: A hybrid worker that combines actor training, rollout generation, and reference computation verl/trainer/main_ppo.py128-130
TrainingWorker: A unified worker that handles training tasks for various backends like FSDP or Megatron verl/trainer/main_ppo.py149-153

Training Engines

The framework supports multiple training backends via configuration:

FSDP Backend: Utilizes FSDPEngine for sharding and memory management, supporting FSDP and FSDP2 docs/start/install.rst20
Megatron-LM Backend: Utilizes MegatronEngine for model parallelism and scalability, often integrated via mbridge docs/start/install.rst20 setup.py60

Inference Engines

verl integrates with high-performance inference backends:

vLLM: Optimized for high-throughput token generation docs/start/install.rst25-26
SGLang: Optimized for multi-turn interactions and RadixAttention docs/index.rst93
TensorRT-LLM: High-performance backend for NVIDIA GPUs docs/index.rst94

Data Pipeline

RLHFDataset: Manages tokenization and batching for RLHF workloads docs/index.rst49
DataProto: The primary communication protocol for passing tensors and metadata between distributed workers verl/trainer/ppo/ray_trainer.py36

Sources: verl/trainer/ppo/ray_trainer.py16-162 verl/trainer/main_ppo.py109-154 docs/start/install.rst12-31

RL Training Flow

The following diagram shows how data flows through a training iteration, associating system names with code entities:

Data and Execution Flow

Sources: verl/trainer/ppo/ray_trainer.py136-162 verl/trainer/ppo/core_algos.py70-85 verl/trainer/ppo/reward.py52

Supported Algorithms and Models

verl supports a wide range of algorithms and model architectures:

Algorithms: PPO, GRPO, RLOO, DAPO, SPIN, SPPO, REINFORCE++, GDPO, and more verl/trainer/ppo/core_algos.py88-110 docs/index.rst71-84
Models: Qwen (2, 2.5, 3), Llama (2, 3), DeepSeek (671B), and MoE architectures README.md60-63 verl/trainer/config/ppo_trainer.yaml27-31

For details, see Supported Algorithms and Models.

Sources: README.md60-63 docs/index.rst71-84 verl/trainer/ppo/core_algos.py88-112

Getting Started

To begin using verl:

Installation: Follow the instructions for Docker or local environment setup using install.sh docs/start/install.rst1-100
Data Preparation: Prepare your datasets in Parquet format with the required fields (prompt, reward) docs/examples/config.rst16-44
Run Training: Use provided recipes for common algorithms like PPO or GRPO via the main_ppo.py entry point verl/trainer/main_ppo.py36-46

For more information, see Getting Started.

Sources: docs/start/install.rst1-100 verl/trainer/main_ppo.py1-108 setup.py81-104

Overview

Relevant source files

Purpose and Scope

Implement diverse RL algorithms (PPO, GRPO, RLOO, DAPO, etc.) with minimal code changes README.md30
Scale training from single-GPU prototypes to multi-node clusters with hundreds of GPUs README.md34
Integrate with existing LLM infrastructure such as PyTorch FSDP, Megatron-LM, vLLM, and SGLang README.md32
Efficiently manage GPU resources across training and inference workloads via flexible device mapping docs/index.rst12-13
Support multimodal models (VLA), reasoning models, and multi-turn agent interactions README.md66-70 docs/index.rst64

Sources: README.md22-48 docs/index.rst1-22

What is verl?

verl provides a complete software stack for LLM post-training via reinforcement learning:

Framework Component Stack

Component	Purpose	Key Classes / Entities
Programming Model	Define RL algorithm dataflows	`HybridFlow` (single/multi-controller) docs/index.rst8-9
Training Orchestration	Coordinate distributed execution	`RayPPOTrainer` verl/trainer/ppo/ray_trainer.py29
Distributed Workers	Execute training/inference tasks	`ActorRolloutRefWorker`, `TrainingWorker` verl/trainer/main_ppo.py128-154
Training Engines	Backend for model training	`FSDPEngine`, `MegatronEngine`, `VeOmniEngine` docs/index.rst91-93 docs/index.rst144
Inference Engines	High-throughput generation	`vLLM`, `SGLang`, `TensorRT-LLM` docs/index.rst94-95
Data Pipeline	Load and process training data	`RLHFDataset`, `DataProto` verl/trainer/ppo/ray_trainer.py36 docs/index.rst49
Configuration System	Manage complex configurations	Hydra framework with OmegaConf verl/trainer/config/ppo_trainer.yaml1-10

The framework supports training on various hardware platforms including NVIDIA GPUs, AMD GPUs (ROCm), and Huawei Ascend NPUs docs/index.rst150-165

Sources: README.md24-48 docs/index.rst4-22 docs/start/install.rst10-31 verl/trainer/ppo/ray_trainer.py16-60

HybridFlow Programming Model

The HybridFlow programming model is the foundation of verl's flexibility. It combines two execution paradigms docs/index.rst8-9:

Single-Controller Paradigm: A centralized controller orchestrates the entire workflow, dispatching data to workers and collecting results. This is implemented via the RayPPOTrainer which manages worker groups through RayWorkerGroup verl/trainer/ppo/ray_trainer.py39
Multi-Controller Paradigm: Multiple independent controllers manage different parts of the workflow asynchronously. This enables off-policy training, partial rollout, and truly asynchronous architectures where generation and training are decoupled docs/index.rst119-123

For details on the programming model, see System Architecture and HybridFlow Design.

Sources: README.md28-34 docs/index.rst6-10 docs/index.rst42-43 verl/trainer/ppo/ray_trainer.py16-39

System Architecture

The following diagram shows the verl system architecture, mapping high-level concepts to concrete code entities:

System Entity Mapping

Sources: verl/trainer/ppo/ray_trainer.py17-52 verl/trainer/main_ppo.py109-154 verl/workers/engine_workers.py128-154

Core Subsystems

Training Orchestration

Distributed Workers

Workers are Ray remote actors that execute specific roles:

ActorRolloutRefWorker: A hybrid worker that combines actor training, rollout generation, and reference computation verl/trainer/main_ppo.py128-130
TrainingWorker: A unified worker that handles training tasks for various backends like FSDP or Megatron verl/trainer/main_ppo.py149-153

Training Engines

The framework supports multiple training backends via configuration:

FSDP Backend: Utilizes FSDPEngine for sharding and memory management, supporting FSDP and FSDP2 docs/start/install.rst20
Megatron-LM Backend: Utilizes MegatronEngine for model parallelism and scalability, often integrated via mbridge docs/start/install.rst20 setup.py60

Inference Engines

verl integrates with high-performance inference backends:

vLLM: Optimized for high-throughput token generation docs/start/install.rst25-26
SGLang: Optimized for multi-turn interactions and RadixAttention docs/index.rst93
TensorRT-LLM: High-performance backend for NVIDIA GPUs docs/index.rst94

Data Pipeline

RLHFDataset: Manages tokenization and batching for RLHF workloads docs/index.rst49
DataProto: The primary communication protocol for passing tensors and metadata between distributed workers verl/trainer/ppo/ray_trainer.py36

Sources: verl/trainer/ppo/ray_trainer.py16-162 verl/trainer/main_ppo.py109-154 docs/start/install.rst12-31

RL Training Flow

The following diagram shows how data flows through a training iteration, associating system names with code entities:

Data and Execution Flow

Sources: verl/trainer/ppo/ray_trainer.py136-162 verl/trainer/ppo/core_algos.py70-85 verl/trainer/ppo/reward.py52

Supported Algorithms and Models

verl supports a wide range of algorithms and model architectures:

Algorithms: PPO, GRPO, RLOO, DAPO, SPIN, SPPO, REINFORCE++, GDPO, and more verl/trainer/ppo/core_algos.py88-110 docs/index.rst71-84
Models: Qwen (2, 2.5, 3), Llama (2, 3), DeepSeek (671B), and MoE architectures README.md60-63 verl/trainer/config/ppo_trainer.yaml27-31

For details, see Supported Algorithms and Models.

Sources: README.md60-63 docs/index.rst71-84 verl/trainer/ppo/core_algos.py88-112

Getting Started

To begin using verl:

Installation: Follow the instructions for Docker or local environment setup using install.sh docs/start/install.rst1-100
Data Preparation: Prepare your datasets in Parquet format with the required fields (prompt, reward) docs/examples/config.rst16-44
Run Training: Use provided recipes for common algorithms like PPO or GRPO via the main_ppo.py entry point verl/trainer/main_ppo.py36-46

For more information, see Getting Started.

Sources: docs/start/install.rst1-100 verl/trainer/main_ppo.py1-108 setup.py81-104

Overview

Purpose and Scope

What is verl?

Framework Component Stack

HybridFlow Programming Model

System Architecture

System Entity Mapping

Core Subsystems

Training Orchestration

Distributed Workers

Training Engines

Inference Engines

Data Pipeline

RL Training Flow

Data and Execution Flow

Supported Algorithms and Models

Getting Started

On this page

Overview

Purpose and Scope

What is verl?

Framework Component Stack

HybridFlow Programming Model

System Architecture

System Entity Mapping

Core Subsystems

Training Orchestration

Distributed Workers

Training Engines

Inference Engines

Data Pipeline

RL Training Flow

Data and Execution Flow

Supported Algorithms and Models

Getting Started

On this page