[draft] prime-rl integration by biswapanda · Pull Request #8630 · ai-dynamo/dynamo

biswapanda · 2026-04-23T19:31:47Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added token ID support: token echo in responses and pre-tokenized prompt injection via /v1/chat/completions/tokens
- Added /v1/tokenize and /v1/detokenize endpoints for text encoding and decoding
- Added RL admin endpoints (/v1/rl/*) for generation lifecycle control, cache flushing, weight updates, and LoRA adapter management
- Extended logprobs with token ID resolution
Chores
- Upgraded vLLM to v0.19.1 with prime-rl plugin integration

…for Prime-RL integration

…e uses engine_client directly, NvExtResponse construction

…elds Three fixes from e2e testing: 1. preprocessor.rs: Use nvext.token_data when provided regardless of backend_instance_id. Previously token_data was only used in the GAIE Stage 2 flow (with backend_instance_id). RL/TITO needs token override with normal routing. 2. handlers.py: Use collective_rpc("reload_weights") instead of collective_rpc("update_weights_from_path") -- the latter doesn't exist in vLLM 0.18.0. reload_weights calls GPUModelRunner's built-in layerwise reload from safetensors. 3. rl_admin/service.py: Strip return_token_ids and tokens fields from standard /v1/chat/completions proxy before forwarding to Dynamo frontend (which rejects unknown parameters with 400). E2E verified: - Standard mode: 10/10 steps, real weight updates, wandb synced https://wandb.ai/test232/prime-rl/runs/fe3dntog - TITO mode: 10/10 steps, real weight updates, wandb synced https://wandb.ai/test232/prime-rl/runs/m4gz7gty

Three fixes for baseline parity with OSS vLLM: 1. publisher.py: Guard scheduler_stats is None after weight reload. The stats logger fires before scheduler_stats is populated after flush_cache + update_weights_from_path, causing AttributeError that kills the vLLM engine mid-training. 2. rl_admin/service.py: Return HTTP 502 on engine route failures. Previously returned 200 even when vLLM worker was dead, causing Prime-RL to silently skip weight updates. 3. rl_admin/service.py: Inject prompt_token_ids and choice.token_ids into proxied responses. Prime-RL's verifiers client requires these vLLM-specific fields to extract per-token logprobs. Without them, completion_logprobs are filled with zeros, triggering IPO mask on 76-89% of tokens and destroying training signal. Uses convert_tokens_to_ids() for special tokens like <think> to avoid the re-encode problem where a single special token becomes multiple tokens.

The previous convert_tokens_to_ids check used `tid != tok.unk_token_id` but Qwen3's tokenizer has unk_token_id=None, so `None != None` is False and every token fell through to the encode() fallback. For byte-level tokens like "\n" (vocab entry "Ċ"), encode() can produce different token counts than the logprobs, causing assertion failures in the trainer: AssertionError: input_ids: 209, inference_logprobs: 208 Fix: check `isinstance(tid, int)` only (no unk comparison), and in the byte-level fallback always append exactly 1 token ID (ids[0]) to maintain strict 1:1 alignment with logprobs entries.

Eliminates Rust/Python tokenizer mismatch by having the rl_admin pre-tokenize every prompt with the Python HF tokenizer and inject via nvext.token_data. The Rust frontend bypasses its own tokenizer and passes the Python tokens directly to vLLM. Parity results (20-step reverse-text, Qwen3-0.6B-SFT): - Mean reward: 0.659 (vs baseline 0.666, delta -1.1%) - Peak reward: 0.765 at step 15 - is_masked: 0.0024 (vs baseline 0.0004) - Throughput: 10.0 req/s (vs baseline 10.7, 1.07x)

…7699) Manually cherry-picked the /tokenize and /detokenize endpoint code from ai-dynamo/dynamo PR #7699 (jthomson04/tokenize-endpoint). The full PR had merge conflicts due to divergent chat_completions.rs refactoring, so only the tokenization-specific changes were applied: - New file: protocols/openai/tokenization.rs (request/response types) - Tokenizer trait additions: encode_with_special_tokens, convert_ids_to_tokens - HTTP routes: POST /tokenize, POST /detokenize - Route wiring in service_v2.rs These endpoints follow vLLM's /tokenize and /detokenize semantics, enabling consistent tokenization between the Rust frontend and Python clients for RL training.

Verified that Rust frontend /tokenize produces byte-identical token IDs to the Python HF tokenizer across all test cases (special chars, CJK, whitespace, thinking tokens, multi-turn, empty content). This means: - /tokenize now proxies directly to Rust frontend (no Python tokenizer) - /detokenize proxied to Rust frontend (new) - _pretokenize_request() removed (Rust tokenizer handles it natively) - _get_tokenizer() removed (Python HF tokenizer no longer needed) - _inject_token_ids() now uses Rust /tokenize for prompt_ids Also: - Added /ready deep health check (frontend + system port connectivity) - Added httpx connection pool limits + shutdown hook - Added assertion: len(completion_ids) == len(logprobs) - Removed /load_lora_adapter stub (dead code)

Phase 1 of rl_admin elimination (Eliminate_rl_admin.md): 1. delta.rs: accumulate delta.token_ids per chunk and emit as nvext.completion_token_ids on the final (finish_reason) chunk when extra_fields includes "completion_token_ids". 2. openai.rs (handler_chat_completions): promote return_token_ids and tokens fields from ignored to first-class on the standard /v1/chat/completions endpoint, eliminating the rl_admin stripping hack. - tokens -> nvext.token_data (TITO path, same as /v1/chat/completions/tokens) - return_token_ids=true -> extra_fields["token_ids","completion_token_ids"] + forces logprobs=true 3. openai.rs (tokenization_router): rename /tokenize -> /v1/tokenize and /detokenize -> /v1/detokenize to follow /v1/* convention. 4. openai.rs (rl_router): new /v1/rl/* admin fleet-control endpoints, gated behind DYN_ENABLE_RL env var (or HttpServiceConfig.enable_rl): - GET /v1/rl/ready composite worker health check - POST /v1/rl/pause fan-out pause_generation to all workers - POST /v1/rl/resume fan-out resume_generation to all workers - POST /v1/rl/update_weights flush_cache + update_weights_from_path - GET /v1/rl/weight_version query + consistency-check weight version Worker system URLs from DYN_RL_WORKER_SYSTEM_URLS (default localhost:8081). 5. service_v2.rs: add enable_rl to HttpServiceConfig and mount rl_router when enabled. 6. rl_admin/service.py: update proxy targets /tokenize -> /v1/tokenize and /detokenize -> /v1/detokenize to match new Rust frontend paths.

Prime-RL's check_health() calls GET /health on the admin client. With admin_base_url='http://dynamo:8000/v1/rl', this arrives as GET /v1/rl/health. Add a lightweight 200 OK handler so the readiness wait loop succeeds without needing the deep /v1/rl/ready probe.

…s + choice.token_ids When DYN_ENABLE_RL=true, the Rust frontend now: 1. Auto-enables return_token_ids promotion (even if client doesn't request it) 2. Tokenizes the prompt and injects prompt_token_ids into the response 3. Copies nvext.completion_token_ids to choice.token_ids for Prime-RL/verifiers This brings is_masked from 45% (missing token IDs) down to 0.13% (better than baseline vLLM at 1.2%). Mismatch KL drops from ~1.0 to 0.005. Validated: 20-step training with peak reward 0.825 (vs 0.798 vLLM baseline).

Prime-RL's LoRA training loop broadcasts PEFT-style adapter weights each step (adapter_model.safetensors + adapter_config.json) and expects the inference side to hot-swap them in place. The existing /v1/rl/update_weights path calls vLLM's reload_weights -> qwen2.load_weights, which treats the adapter keys as base-model weights and fails with KeyError: 'layers.0.mlp.gate_up_proj.lora_A.weight' These two new RL admin endpoints give Prime-RL a filesystem-native LoRA swap path that matches prime-rl/src/prime_rl/utils/client.py's load_lora_adapter() branch: POST /v1/rl/load_lora_adapter {"lora_name", "lora_path"} POST /v1/rl/unload_lora_adapter {"lora_name"} On the Rust frontend (lib/llm/src/http/service/openai.rs) these mirror rl_update_weights: validate body, fan out to every worker system port as /engine/<route>, require all_ok for 200 OK. Two new RouteDoc entries and route mounts in rl_router(). On the vLLM worker (components/src/dynamo/vllm/handlers.py) the new load_lora_adapter() handler is distinct from the existing URI-based load_lora() gRPC path: * Reads the adapter directly from the given filesystem path -- no LoRAManager / no URI fetch. * Hot-swap semantics: calling with a lora_name that is already loaded removes the previous adapter (same deterministic int id) and adds the new one, then invalidates the prefix cache so stale KV entries keyed to the previous adapter weights do not poison rollouts. * Publishes a ModelDeploymentCard on first load; Prime-RL's scheduler switches its request model field to the LoRA name after load (self.model_name = self.lora_name), so the frontend needs an MDC to route r16-a32 -> this worker. Skipped on hot-swap since the MDC is already in place. unload_lora_adapter unregisters the MDC. Both handlers are registered as engine routes in worker_factory.py on both decode and prefill worker code paths (sibling to pause_generation / update_weights_from_path / etc.), so the Rust fan-out to /engine/load_lora_adapter resolves identically. vLLM prerequisites: engine must be started with --enable-lora --max-lora-rank R --max-loras N with R >= the adapter rank and N >= the number of distinct lora_name values held at once. For Prime-RL's single-adapter training loop --max-loras 1 is sufficient. Validated end-to-end with Prime-RL bis/dynamo-integration against Qwen/Qwen3-0.6B + math-env on GSM8K: * 20/20 trainer steps SUCCESS (exit 0) * Real gradients throughout (Loss 1e-3 range, Entropy 0.3-0.7, Mismatch KL 1e-3, Grad Norm 0.1-0.3) * Step 1 shows "LoRA adapter loaded"; steps 2+ show "hot-swapped" * Peak trainer memory 29.6 GiB on A6000 (vs 29.8 GiB full FT) * Orchestrator rewards 0.0-0.3 across GSM8K problems Also documented both endpoints in docs/Dynamo-RL-api-draft.md (endpoint summary table plus detail sections mirroring update_weights style).

0.19.1 fixes a device-manager context bug that broke reload_weights for LoRA: `with torch.device(load_device)` in gpu_model_runner left weights on CPU when reloading into a restricted CUDA_VISIBLE_DEVICES. 0.19.1 removes that context manager entirely. Validated locally with prime-rl Full FT (20 steps, Qwen/Qwen3-0.6B, math-env / GSM8K) -- reward range 0.1875-0.6250, peak trainer memory 29.8 GiB on A6000.

…atches After vLLM is installed, pip-install prime-rl from a pinned tag (default: v0.5.1.dev101) with --no-deps so only the inference subtree + the vllm.general_plugins entry-point land in the image. Once registered, vLLM's plugin loader auto-invokes prime_rl.inference.patches:transformers_v5_compat in every worker process (including spawned subprocesses), applying the upstream-maintained patch chain (LoRA adapter load, DP engine pause/resume deadlock, Qwen 3.5 LoRA, transformers v5 config compat, etc.). Trade-offs: - --no-deps keeps the image footprint small (~3 MB of trainer .py left on disk but never imported). - --ignore-requires-python: prime-rl pins ~=3.12.0; Dynamo containers are 3.12 so this is a no-op there. The flag only helps local dev venvs on 3.11. - Override tag at build time via --build-arg PRIME_RL_REF=<tag>. - Sanity probe fails the build if the entry-point is not registered. Alternative considered and rejected: - Vendoring prime-rl's patches directly into Dynamo -> diverges from upstream; prime-rl ships frequent patch updates. - A dynamo-authored vllm.general_plugins shim -> duplicates prime-rl maintenance burden; better to re-use their authoritative version. Validated locally (vllm 0.19.1, prime-rl v0.5.0 tag): - Plugin entry-point registers - vLLM injects NCCLWeightUpdateWorker methods into Worker base at init - init_broadcaster reachable via collective_rpc - NCCL weight broadcast completes over NET/Socket transport (5/5 steps)

uv pip install does not accept --ignore-requires-python; the flag caused the image build to fail at the prime-rl install step: error: unexpected argument '--ignore-requires-python' found Removing the flag is safe for container builds: PYTHON_VERSION=3.12 in all vllm-runtime Dockerfiles matches prime-rl's requires-python = "~=3.12.0". The flag was only relevant for local 3.11 dev venvs; the comment now points those users to regular pip instead of uv.

…eResult Main commit 2cabf44 (feat: Decoder clean-up for incomplete multi-byte sequences, #8022) changed Decoder::decode() to return Result<DecodeResult> instead of Result<String>. The three backend tokenizer impls were updated, but two spots were missed: - lib/llm/src/tokenizers.rs: the default convert_ids_to_tokens trait impl .collect()s an iterator whose items are now DecodeResult, not String. Map via String::from (existing `impl From<DecodeResult> for String`) before collecting. - lib/llm/src/http/service/openai.rs: the /v1/detokenize handler passes DecodeResult directly into DetokenizeResponse { prompt: String }. Extract the inner string via .into() before constructing the response. Verified with `cargo check -p dynamo-py3 --features kv-indexer --lib` (the crate maturin builds in the container).

coderabbitai · 2026-04-23T19:42:30Z

Walkthrough

This pull request introduces Reinforcement Learning (RL) capabilities across the Dynamo platform, adding generation control (pause/resume), weight management, LoRA adapter hot-swapping, and token-in/token-out (TITO) endpoints with expanded tokenization APIs and vLLM plugin integration.

Changes

Cohort / File(s)	Summary
RL Handler & Route Registration `components/src/dynamo/vllm/handlers.py`, `components/src/dynamo/vllm/worker_factory.py`	Added 7 async HTTP handler methods (`pause_generation`, `resume_generation`, `flush_cache`, `update_weights_from_path`, `get_weight_version`, `load_lora_adapter`, `unload_lora_adapter`) in `BaseWorkerHandler` with distributed state management and LoRA lifecycle tracking; registered new routes in worker factory for generation control, cache/weight operations, and adapter management.
Frontend & Engine Communication `components/src/dynamo/frontend/vllm_processor.py`, `components/src/dynamo/vllm/publisher.py`	Added post-processing to populate logprobs from engine responses and stream completion token IDs; added null guard for `scheduler_stats` to prevent dereferences during transitional states.
vLLM Plugin Integration `container/deps/vllm/install_vllm.sh`	Bumped vLLM version to `0.19.1` and added installation of `prime-rl` plugin from Git with runtime verification of entry-point availability.
OpenAI HTTP Service & Routing `lib/llm/src/http/service/openai.rs`	Introduced tokenization (`/v1/tokenize`, `/v1/detokenize`) and TITO (`/v1/chat/completions/tokens`) routers; added RL admin router (`/v1/rl/*`) for fleet control (health, pause/resume, weight updates, LoRA operations); extended chat completions handler to promote RL request fields and rewrite responses with token IDs.
Service Configuration & Routing `lib/llm/src/http/service/service_v2.rs`	Added `enable_rl` configuration flag and wired RL routes and tokenization router into system router under `/v1/rl/*` and system endpoints, gated by flag or environment variable.
Request/Response Data Types `lib/llm/src/protocols/openai/chat_completions.rs`, `lib/llm/src/protocols/openai/nvext.rs`, `lib/llm/src/protocols/openai/tokenization.rs`	Extended chat completion request with `tokens` and `return_token_ids` fields; added `prompt_token_ids` to response; added `completion_token_ids` to nvext extension; created new tokenization module with serde-annotated request/response structs for tokenize/detokenize operations.
Response Processing & Aggregation `lib/llm/src/protocols/openai/chat_completions/aggregator.rs`, `lib/llm/src/protocols/openai/chat_completions/delta.rs`, `lib/llm/src/protocols/openai/completions/delta.rs`	Updated response aggregation to initialize `prompt_token_ids`; extended delta generator to accumulate and emit `completion_token_ids` across streaming chunks; injected `completion_token_ids` into nvext payloads.
Input Processing & Token Handling `lib/llm/src/preprocessor.rs`, `lib/llm/src/entrypoint/input/text.rs`, `lib/llm/src/protocols/openai/responses/mod.rs`	Refactored token preparation to always use `nvext.token_data` when present independent of `backend_instance_id`; added explicit initialization of `tokens` and `return_token_ids` to `None` in request construction.
Protocol Extensions `lib/llm/src/protocols/openai.rs`, `lib/llm/src/protocols/openai/validate.rs`, `lib/llm/src/protocols/anthropic/types.rs`, `lib/llm/src/audit/stream.rs`	Added `get_return_tokens_as_token_ids()` optional method to OpenAI output options; imported tokenization module; allowed `cache_salt` as passthrough field in validation; updated Anthropic and audit stream conversions to set token fields to `None`.
Tokenizer API & Implementations `lib/llm/src/tokenizers.rs`, `lib/llm/src/tokenizers/fastokens.rs`, `lib/llm/src/tokenizers/hf.rs`, `lib/llm/src/tokenizers/tiktoken.rs`	Added `encode_with_special_tokens` and `convert_ids_to_tokens` methods to tokenizer trait and all implementations (FastTokenizer, HuggingFace, TikToken); updated encoding to respect special-token handling flags and support token-to-string conversion.
Documentation `docs/Dynamo-RL-api-draft.md`	New comprehensive API specification document describing RL-oriented features, TITO endpoint, tokenization endpoints, fleet control plane routes, and internal vLLM engine route mappings with limitations and validation tables.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is completely empty—only showing the template structure with placeholder sections and no substantive content in any required fields.	Provide a complete description filling in Overview (what problem is solved), Details (changes made), and Where should the reviewer start (key files to review).
Title check	❓ Inconclusive	The title '[draft] prime-rl integration' is vague and generic, using non-descriptive terms that don't clearly convey the specific changes in this substantial multi-file changeset.	Consider a more descriptive title that highlights the primary change, e.g., 'Add Prime-RL runtime administration and token-id endpoints' or specify the main focus area.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

⚔️ Resolve merge conflicts

Resolve merge conflict in branch bis/parity-tokenize-tcp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (10)

lib/llm/src/preprocessor.rs (1)

585-594: Optional: collapse the two tokenizing branches.

The else if has_backend_instance_id { ... } and final else { ... } branches have identical result expressions — only the tracing::warn! differs. They can be flattened with a guard log, reducing the risk of future drift:

♻️ Suggested simplification

-                            } else if has_backend_instance_id {
-                                tracing::warn!(
-                                    "backend_instance_id provided but no token_data; tokenizing prompt"
-                                );
-                                let encoding = self.encode_with_timing(&prompt, tracker)?;
-                                (encoding.token_ids().to_vec(), false)
-                            } else {
-                                let encoding = self.encode_with_timing(&prompt, tracker)?;
-                                (encoding.token_ids().to_vec(), false)
-                            };
+                            } else {
+                                if has_backend_instance_id {
+                                    tracing::warn!(
+                                        "backend_instance_id provided but no token_data; tokenizing prompt"
+                                    );
+                                }
+                                let encoding = self.encode_with_timing(&prompt, tracker)?;
+                                (encoding.token_ids().to_vec(), false)
+                            };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/preprocessor.rs` around lines 585 - 594, The two branches under
the else/else if both call self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false), differing only by a tracing::warn! when
has_backend_instance_id is true; refactor by removing the duplicate branches: in
the outer else, if has_backend_instance_id { tracing::warn!(...); } then call
let encoding = self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false) — this collapses the branches while
preserving the warning behavior and using the existing encode_with_timing,
prompt, tracker, and has_backend_instance_id symbols.

lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

466-486: Nit: use std::mem::take to release the accumulator vector on emit.

self.accumulated_completion_token_ids.clone() retains the backing allocation on the generator even though no more chunks will follow finish_reason. Swapping in std::mem::take releases it promptly:

♻️ Proposed change

-        let completion_token_ids =
-            if self.options.enable_completion_token_ids && finish_reason.is_some() {
-                Some(self.accumulated_completion_token_ids.clone())
-            } else {
-                None
-            };
+        let completion_token_ids =
+            if self.options.enable_completion_token_ids && finish_reason.is_some() {
+                Some(std::mem::take(&mut self.accumulated_completion_token_ids))
+            } else {
+                None
+            };

Also, consider a unit test in this module covering: opt-in via extra_fields: ["completion_token_ids"], mid-stream chunks carry no completion_token_ids, and the final chunk's nvext.completion_token_ids equals the concatenation of per-chunk token_ids (hard invariant called out in nvext.rs).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs` around lines 466 -
486, The code currently clones self.accumulated_completion_token_ids into
completion_token_ids causing the accumulator's allocation to be retained;
replace the clone with std::mem::take(&mut
self.accumulated_completion_token_ids) when building completion_token_ids (only
when self.options.enable_completion_token_ids && finish_reason.is_some()) so the
generator releases the vector's backing buffer immediately; ensure the rest of
the branch (where completion_token_ids is moved into NvExtResponse) still
compiles and preserves behavior of NvExtResponse::completion_token_ids and
related logic involving token_ids, worker_id_info, timing_info, and
routed_experts; optionally add a unit test that opts in via extra_fields:
["completion_token_ids"] confirming mid-stream chunks lack completion_token_ids
and the final nvext.completion_token_ids equals the concatenation of per-chunk
token_ids.

components/src/dynamo/vllm/worker_factory.py (1)

338-353: Extract duplicate engine-route registration to a helper function.

Both _create_decode_worker (lines 338–353) and _create_prefill_worker (lines 586–601) register the same 7 RL routes (pause_generation, resume_generation, flush_cache, update_weights_from_path, get_weight_version, load_lora_adapter, unload_lora_adapter) with an identical log message. This duplication invites drift—e.g., adding a route to one path and forgetting the other. Since both DecodeWorkerHandler and PrefillWorkerHandler inherit from BaseWorkerHandler and all methods are defined there, consider extracting:

♻️ Proposed refactor

_RL_ROUTES: tuple[str, ...] = (
    "pause_generation",
    "resume_generation",
    "flush_cache",
    "update_weights_from_path",
    "get_weight_version",
    "load_lora_adapter",
    "unload_lora_adapter",
)

def _register_lifecycle_and_rl_routes(runtime: DistributedRuntime, handler) -> None:
    runtime.register_engine_route("sleep", handler.sleep)
    runtime.register_engine_route("wake_up", handler.wake_up)
    runtime.register_engine_route("scale_elastic_ep", handler.scale_elastic_ep)
    for name in _RL_ROUTES:
        runtime.register_engine_route(name, getattr(handler, name))
    logger.info(
        "Registered engine routes: sleep, wake_up, scale_elastic_ep, %s",
        ", ".join(_RL_ROUTES),
    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/worker_factory.py` around lines 338 - 353, Extract
the duplicated engine-route registration in _create_decode_worker and
_create_prefill_worker into a helper function (e.g.,
_register_lifecycle_and_rl_routes) that takes the DistributedRuntime and a
handler instance; define a single _RL_ROUTES tuple containing the seven RL route
names and use a loop to call runtime.register_engine_route(name,
getattr(handler, name)) for each, also registering "sleep", "wake_up", and
"scale_elastic_ep" there, and replace the duplicate
runtime.register_engine_route calls and logger.info in both functions with a
single call to the new helper; reference BaseWorkerHandler (handler methods),
runtime.register_engine_route, _create_decode_worker, and _create_prefill_worker
when making the change.

lib/llm/src/tokenizers/fastokens.rs (1)

49-63: Document (and optionally test) the HF fallback for add_special_tokens=true.

fastokens doesn't support special-token-aware encoding, so this path transparently falls back to HuggingFaceTokenizer, which means callers lose the fastokens speedup whenever they request special tokens. That's a reasonable choice, but the struct-level doc (Lines 20–22) still claims "fast BPE encoding via fastokens, decoding via HuggingFace" which is now incomplete. A one-liner here plus an extension of test_fast_matches_hf_encoding with add_special_tokens=true would make the behavior explicit and guard against regressions (e.g., mismatched special-token IDs between the two backends loaded from the same tokenizer.json).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers/fastokens.rs` around lines 49 - 63, Update the
struct-level documentation for the fastokens tokenizer to state that
encode_with_special_tokens falls back to the HuggingFace tokenizer when
add_special_tokens=true (i.e., special-token-aware encoding is performed by
hf_decoder, so callers lose fastokens speedup in that case), and add/extend the
test_fast_matches_hf_encoding test to call encode_with_special_tokens with
add_special_tokens=true and assert the HF fallback produces matching token
ids/special-token ids compared to the fast path where applicable; locate the
behavior in the encode_with_special_tokens method (uses self.hf_decoder and
self.fast_encoder) and ensure the test compares outputs from
hf_decoder.encode_with_special_tokens(...) and fast_encoder.encode(...) wrapped
as Encoding::Sp (and guards for any differences in special-token id handling).

lib/llm/src/tokenizers.rs (1)

65-71: Silent default for encode_with_special_tokens is error-prone.

The default implementation silently ignores _add_special_tokens and delegates to encode. If a new Encoder backend is added later and the author forgets to override this method, callers requesting add_special_tokens = true will quietly get encoding without special tokens — a correctness bug with no compile-time or runtime signal. All current backends override this, so a stricter contract is free today.

♻️ Two safer alternatives

Either make it a required trait method (no default), or have the default fail loudly when the flag is set:

-        fn encode_with_special_tokens(
-            &self,
-            input: &str,
-            _add_special_tokens: bool,
-        ) -> Result<Encoding> {
-            self.encode(input)
-        }
+        fn encode_with_special_tokens(
+            &self,
+            input: &str,
+            add_special_tokens: bool,
+        ) -> Result<Encoding> {
+            if add_special_tokens {
+                return Err(Error::msg(
+                    "encode_with_special_tokens(add_special_tokens=true) not supported by this backend",
+                ));
+            }
+            self.encode(input)
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers.rs` around lines 65 - 71, The trait's default impl of
encode_with_special_tokens silently ignores the _add_special_tokens flag; remove
the default implementation from the trait so encode_with_special_tokens becomes
a required method (i.e., declare fn encode_with_special_tokens(&self, input:
&str, add_special_tokens: bool) -> Result<Encoding> without a body) and then
update each Encoder implementation that currently relies on the default to
implement encode_with_special_tokens explicitly (search for
encode_with_special_tokens and implementations of the Encoder trait to add the
proper handling).

lib/llm/src/http/service/service_v2.rs (1)

546-550: Promote DYN_ENABLE_RL to an env_llm constant.

The rest of this file references environment variable names through the env_llm module (e.g., env_llm::DYN_ENABLE_STREAMING_TOOL_DISPATCH, env_llm::DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS). The hardcoded "DYN_ENABLE_RL" string here — plus the DYN_RL_WORKER_SYSTEM_URLS mentioned in the doc comment at Lines 249–251 — breaks that convention and makes these env vars harder to discover/grep and prone to typos.
♻️ Suggested change

Add the constant alongside existing ones in dynamo_runtime::config::environment_names::llm and use it here:
-        if config.enable_rl || env_is_truthy("DYN_ENABLE_RL") {
+        if config.enable_rl || env_is_truthy(env_llm::DYN_ENABLE_RL) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/service_v2.rs` around lines 546 - 550, Add a new
constant DYN_ENABLE_RL to the env_llm module
(dynamo_runtime::config::environment_names::llm) and replace the hardcoded
string usages with that constant; specifically change the condition that now
reads env_is_truthy("DYN_ENABLE_RL") to use
env_is_truthy(env_llm::DYN_ENABLE_RL) in the RL admin routes block alongside
config.enable_rl, and update any doc comments that mention the literal
DYN_RL_WORKER_SYSTEM_URLS to reference the env_llm constant name instead so all
environment names are discoverable and consistent.

components/src/dynamo/rl_admin/service.py (2)

151-168: Swallowed except Exception in /ready should log.

The two except Exception: ...False blocks hide the actual failure reason, so an operator watching a not-ready service has no signal as to whether the frontend is down, DNS is broken, or TLS handshakes are failing. Log at debug level so the reason is retrievable without changing the response shape.

♻️ Proposed fix

     try:
         frontend_ok = (await client.get(f"{DYNAMO_FRONTEND_URL}/v1/models")).status_code == 200
-    except Exception:
+    except httpx.HTTPError as e:
+        logger.debug("ready: frontend probe failed: %s", e)
         frontend_ok = False
     try:
         system_ok = (await client.get(f"{DYNAMO_SYSTEM_URL}/health")).status_code == 200
-    except Exception:
+    except httpx.HTTPError as e:
+        logger.debug("ready: system probe failed: %s", e)
         system_ok = False

As per coding guidelines: "catch only specific exceptions you can handle, and if you catch broadly (except Exception) you must log and re-raise."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 151 - 168, The
ready() health check currently swallows exceptions when probing
DYNAMO_FRONTEND_URL and DYNAMO_SYSTEM_URL; update the two try/except blocks that
call _get_http_client().get(...) to catch the exception, log the full exception
details at debug (include context like which URL failed and the exception)
before setting frontend_ok/system_ok = False, and do not change the JSONResponse
shape or status codes; use the existing logger (or create one if none) to emit
the debug-level messages so operators can inspect failures without altering
behavior.

54-59: Use lifespan context manager instead of deprecated @app.on_event("shutdown").

The on_event decorator is deprecated in FastAPI (with deprecation warnings present). While still functional, it is slated for eventual removal. Since this is a new file, implement using the lifespan async context manager from the start to avoid a future migration.

Suggested refactor

from contextlib import asynccontextmanager

`@asynccontextmanager`
async def lifespan(app: FastAPI):
    yield
    global _http_client
    if _http_client is not None:
        await _http_client.aclose()
        _http_client = None

app = FastAPI(title="Dynamo RL Admin", lifespan=lifespan)

Add from contextlib import asynccontextmanager at the top of the file with the other imports (line 24–31).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 54 - 59, Replace the
deprecated FastAPI shutdown event handler by adding an asynccontextmanager-based
lifespan and passing it into the FastAPI constructor: import asynccontextmanager
from contextlib, define async def lifespan(app: FastAPI) decorated with
`@asynccontextmanager` that yields and then performs the same cleanup of the
global _http_client (close and set to None), and create app =
FastAPI(title="Dynamo RL Admin", lifespan=lifespan) instead of relying on the
_shutdown function; remove the `@app.on_event`("shutdown") _shutdown definition
afterward.

components/src/dynamo/vllm/handlers.py (2)

705-881: Lock-map growth: new RL LoRA handlers don't prune _lora_load_locks like their URI counterparts.

load_lora (lines 1202–1210) and unload_lora (lines 1323–1330) both wrap the critical section in a try/finally that removes the per-LoRA lock from _lora_load_locks once the adapter is no longer tracked. The new load_lora_adapter / unload_lora_adapter skip that cleanup, so over a long RL run every distinct lora_name ever loaded (or ever attempted on a failed load) leaves a permanent entry in the lock dict. It's not a functional bug, but for a training loop that hot-swaps adapters repeatedly it's worth mirroring the existing pattern for consistency.
♻️ Suggested pattern (apply to both new handlers)
             lock = self._get_lora_lock(lora_name)
             async with lock:
-                ...
+                try:
+                    ...
+                finally:
+                    with self._lora_load_locks_guard:
+                        if (
+                            lora_name not in self.loaded_loras
+                            and self._lora_load_locks.get(lora_name) is lock
+                        ):
+                            self._lora_load_locks.pop(lora_name, None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/handlers.py` around lines 705 - 881, The handlers
load_lora_adapter and unload_lora_adapter leak entries in the _lora_load_locks
map because they don't remove the per-lora lock after the critical section;
update both functions to mirror the existing pattern in load_lora/unload_lora by
wrapping the async with lock block in a try/finally and in the finally remove
the lock from self._lora_load_locks (use the same key lora_name) when the
adapter is no longer tracked (i.e., after successful unload or on failure) so
_get_lora_lock and loaded_loras remain consistent and the lock-map doesn't grow
unbounded.
829-881: unload_lora_adapter state ordering is fine, but note the asymmetric error semantics.

del self.loaded_loras[lora_name] on line 853 happens before unregister_model, so if unregister fails the adapter is gone from both the engine and local tracking yet the caller sees status=error. That matches the idempotent contract in the docstring (a retry hits the lora is None no-op branch and returns ok), but it differs from the older unload_lora which rolls the engine state back on unregister failure (lines 1278–1300). Worth a one-line comment noting the intentional divergence so a future maintainer doesn't "fix" it by copying the old rollback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/handlers.py` around lines 829 - 881, The
unload_lora_adapter method currently deletes the local tracking entry (del
self.loaded_loras[lora_name]) before calling unregister_model, which means if
unregister_model fails the adapter is already removed from engine and local
state but the method returns status=error; add a one-line comment immediately
above the del self.loaded_loras[lora_name] (or above the unregister_model call)
explaining that this ordering is intentional and documents the asymmetric error
semantics (idempotent no-op on retry) and that it intentionally differs from the
older unload_lora rollback behavior to avoid re-registering the adapter on
unregister failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/rl_admin/service.py`:
- Around line 92-115: The function _extract_completion_ids_from_logprobs is
reconstructing token IDs from decoded bytes (using _byte_token_cache,
_encode_single, bytes.decode(..., errors="replace")) which is incorrect and can
silently corrupt RL training; replace its use with the authoritative token IDs
produced by the Rust/vLLM path (read choice["token_ids"] or
nvext.completion_token_ids promoted by rl_promote_token_ids_in_response) and
remove or deprecate this reconstruction logic; if you must keep a fallback, do
not use bytes.decode(..., errors="replace") or return 0 — instead populate
_byte_token_cache deterministically, ensure _tokenizer_for_fallback is valid
(propagate tokenizer load errors instead of swallowing), and return a clear
sentinel (e.g., None or raise) so callers can skip or fail fast rather than
training on <unk>.
- Around line 405-430: The service main() currently defaults host to "0.0.0.0",
exposing admin endpoints; change the default for parser.add_argument("--host",
...) to "127.0.0.1" (or "localhost") so DYNAMO admin endpoints are bound to
loopback by default, and add an explicit opt-in (e.g., a boolean flag
--allow-external or require setting an env var) that, when true, allows
overriding to external addresses; update related logging and any usage of
args.host in uvicorn.run(app, host=...) and keep DYNAMO_FRONTEND_URL,
DYNAMO_SYSTEM_URL, and MODEL_NAME handling untouched.
- Around line 315-344: The _inject_token_ids function currently uses an assert
to validate token alignment and catches all Exceptions, which can be stripped
under python -O or be swallowed by the broad except; replace the assert in
_inject_token_ids (and the validation around
_extract_completion_ids_from_logprobs and len comparisons) with an explicit
raise (e.g., raise ValueError or RuntimeError with a clear message) when the
lengths mismatch, and narrow the try/except to only catch and handle
tokenization/HTTP-specific errors coming from _tokenize_via_frontend (e.g.,
TokenizationError, HTTPError) while allowing alignment errors to propagate (or
re-raise after logging) so failures aren’t silently converted to warnings and
missing choice["token_ids"] don’t slip through.
- Around line 118-139: The fallback tokenizer loading in _encode_single
currently imports transformers inside the function, swallows all exceptions, and
silently returns 0 on failure; move the import to module top as a guarded import
(e.g., try: from transformers import AutoTokenizer except ImportError:
AutoTokenizer = None), then in _encode_single check if AutoTokenizer is not None
before attempting to instantiate; narrow the exception handling when creating
the tokenizer to specific exceptions (ImportError, OSError, ValueError) and log
failures with the module logger (e.g., logging.getLogger(__name__)) rather than
using a bare except: pass, and ensure on load failure _tokenizer_for_fallback
remains None so callers can detect fallback absence while errors are recorded;
reference symbols: _encode_single, _tokenizer_for_fallback, AutoTokenizer,
MODEL_NAME.

In `@components/src/dynamo/vllm/handlers.py`:
- Around line 801-804: The rollback's remove_lora call currently swallows all
exceptions (try: await self.engine_client.remove_lora(lora_id) except: pass);
replace this with catching the exception as e and logging it (e.g.,
logger.warning or logger.exception) including lora_id and any engine id/context
so the leak is diagnosable, and do not silently pass; keep the existing
surrounding rollback flow in the method that manipulates self.loaded_loras and
use the existing logger to record the failure instead of suppressing it.
- Around line 738-763: The hot-swap path removes the old adapter from the engine
(engine_client.remove_lora) but leaves the stale LoRAInfo in self.loaded_loras,
so an add_lora failure leaves the worker pointing at an unregistered id; fix by
removing the dict entry as soon as remove_lora succeeds (i.e., delete
self.loaded_loras[lora_name] right after the remove_lora call inside the
is_hot_swap branch) and only populate self.loaded_loras[lora_name] =
LoRAInfo(...) after engine_client.add_lora completes successfully; ensure
add_lora exceptions do not reintroduce the stale entry (handle exceptions around
engine_client.add_lora in the same method that uses _get_lora_lock and
LoRARequest).

In `@container/deps/vllm/install_vllm.sh`:
- Line 295: Replace the non-portable echo invocation that prints a leading
newline and the vLLM plugin message with a portable printf-based call: locate
the echo line containing the literal string "\n=== Installing prime-rl vLLM
plugin (ref=${PRIME_RL_REF}) ===" and replace it so the script emits an actual
newline before and after the message using printf while interpolating
${PRIME_RL_REF}; this avoids shell-dependent echo escape behavior.

In `@docs/Dynamo-RL-api-draft.md`:
- Line 201: Several fenced code blocks are untyped (MD040); update each
triple-backtick fence (``` ...) to include an appropriate language identifier
(e.g., ```json, ```bash, ```toml, ```text) so markdownlint passes and
readability improves; search the document for unannotated fences and replace
each opening ``` with the correct language token based on the block content
(examples: shell commands -> bash, JSON payloads -> json, config snippets ->
toml, plain output -> text).

In `@lib/llm/src/http/service/openai.rs`:
- Around line 206-222: Remove the stale three-line doc comment that describes
"Not Implemented Error" which was accidentally left immediately above the pub fn
bad_request; update the comment block so only the "Bad Request Error" doc
remains directly above bad_request, and ensure the not_implemented_error doc
sits only with the not_implemented_error function (referencing the functions
bad_request and not_implemented_error to locate the misplaced comment).
- Around line 1463-1478: The RL post-processing currently swallows serialization
errors from serde_json::to_value(&response) and returns a half-populated
response; modify the Err branch inside the if let Some(ref messages) =
rl_saved_messages { ... } block so that when serde_json::to_value fails you (a)
log the serialization error (include the error and context like which
model/operation) using the project's logger (e.g., tracing::warn or the existing
logger used elsewhere), and (b) return an internal-server-error response instead
of falling through — this affects the code around rl_tokenize_prompt,
serde_json::to_value(&response), and rl_promote_token_ids_in_response so the
failure is both recorded and surfaced rather than silently dropping promoted
choices[i].token_ids.
- Around line 3427-3451: The rl_router() currently exposes privileged endpoints
(rl_update_weights, rl_load_lora_adapter, rl_unload_lora_adapter, rl_pause,
rl_resume) with no auth; add a bearer/shared-secret check middleware (e.g.,
middleware::from_fn(auth_middleware)) to the router chain before
.with_state(rl_state), update the rl_router() docstring to state this router
must only run on a trusted control-plane/admin listener (not the public
completions port), and implement allow-list validation/canonicalization for path
inputs inside rl_update_weights, rl_load_lora_adapter and rl_unload_lora_adapter
(validate weight_dir and lora_path against an explicit list of allowed root
directories and canonicalize paths before fan-out to workers) so only approved
directories are forwarded.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 73-84: The merged_chat_template_kwargs method currently
unconditionally overwrites user-provided keys "add_generation_prompt" and
"continue_final_message"; update the implementation so user-supplied values are
preserved by using
kwargs.entry("add_generation_prompt").or_insert(serde_json::Value::Bool(self.add_generation_prompt))
and similarly for "continue_final_message", or alternatively add a validation in
the struct's validate() that rejects collisions if both chat_template_kwargs
contains those keys and the top-level fields are set; reference the
merged_chat_template_kwargs function (or the validate method if choosing the
reject option) to locate where to change behavior.
- Around line 25-44: Add a short documentation comment above both
TokenizeCompletionRequest::add_special_tokens and
TokenizeChatRequest::add_special_tokens explaining that the differing defaults
are intentional: TokenizeCompletionRequest defaults to true because completion
requests need explicit special-token addition, while TokenizeChatRequest
defaults to false because vLLM chat templates handle special tokens internally
(avoiding duplicate BOS), so future maintainers should not change them to match
each other. Ensure the comments appear directly above the fields in the
respective structs so they are preserved in the serde definitions.

In `@lib/llm/src/tokenizers/hf.rs`:
- Around line 74-81: The current implementation of
HuggingFaceTokenizer::convert_ids_to_tokens silently maps unknown IDs to "" via
unwrap_or_default(); change convert_ids_to_tokens to explicitly detect when
self.tokenizer.id_to_token(id) returns None and return an Err containing a clear
message with the offending id (e.g., format!("unknown token id: {}", id))
instead of producing empty strings, so callers of
Tokenizer::convert_ids_to_tokens (and ultimately TokenizeResponse.token_strs)
can surface the error; locate the logic in the convert_ids_to_tokens method of
the HuggingFaceTokenizer implementation and replace the
iterator/map+unwrap_or_default with a loop or try-fold that returns Err on any
missing id.

In `@lib/llm/src/tokenizers/tiktoken.rs`:
- Around line 102-123: Encoder::encode in TikTokenTokenizer currently calls
encode_with_special_tokens(..., true) which creates asymmetric default behavior
versus HuggingFaceTokenizer::encode and FastTokenizer::encode (which pass
false); update TikTokenTokenizer::encode to match the other backends (pass
false) so all Encoder::encode implementations behave the same, or if you intend
to preserve the asymmetry, add a clear comment next to TikTokenTokenizer::encode
and the encode_with_special_tokens method documenting why TikTokenTokenizer
defaults to true and noting the behavioral difference so callers are aware to
explicitly opt in to special tokens.
- Around line 156-169: convert_ids_to_tokens currently uses
String::from_utf8_lossy on each token and returns empty strings for unknown IDs,
which corrupts single-byte tokens and hides missing IDs; modify
Tokenizer::convert_ids_to_tokens to mirror the decode() strategy: for each
TokenIdType look up bytes via decoder_tokens or special_tokens_decoder, return
Err if an ID is missing, then try strict UTF-8 (String::from_utf8) first and
only fall back to lossy conversion (from_utf8_lossy) for multi-byte sequences
(preserving the original bytes for single-byte tokens instead of letting them
become U+FFFD), using the same helper logic/policy used by decode() to decide
when to use lossy decoding.

---

Nitpick comments:
In `@components/src/dynamo/rl_admin/service.py`:
- Around line 151-168: The ready() health check currently swallows exceptions
when probing DYNAMO_FRONTEND_URL and DYNAMO_SYSTEM_URL; update the two
try/except blocks that call _get_http_client().get(...) to catch the exception,
log the full exception details at debug (include context like which URL failed
and the exception) before setting frontend_ok/system_ok = False, and do not
change the JSONResponse shape or status codes; use the existing logger (or
create one if none) to emit the debug-level messages so operators can inspect
failures without altering behavior.
- Around line 54-59: Replace the deprecated FastAPI shutdown event handler by
adding an asynccontextmanager-based lifespan and passing it into the FastAPI
constructor: import asynccontextmanager from contextlib, define async def
lifespan(app: FastAPI) decorated with `@asynccontextmanager` that yields and then
performs the same cleanup of the global _http_client (close and set to None),
and create app = FastAPI(title="Dynamo RL Admin", lifespan=lifespan) instead of
relying on the _shutdown function; remove the `@app.on_event`("shutdown")
_shutdown definition afterward.

In `@components/src/dynamo/vllm/handlers.py`:
- Around line 705-881: The handlers load_lora_adapter and unload_lora_adapter
leak entries in the _lora_load_locks map because they don't remove the per-lora
lock after the critical section; update both functions to mirror the existing
pattern in load_lora/unload_lora by wrapping the async with lock block in a
try/finally and in the finally remove the lock from self._lora_load_locks (use
the same key lora_name) when the adapter is no longer tracked (i.e., after
successful unload or on failure) so _get_lora_lock and loaded_loras remain
consistent and the lock-map doesn't grow unbounded.
- Around line 829-881: The unload_lora_adapter method currently deletes the
local tracking entry (del self.loaded_loras[lora_name]) before calling
unregister_model, which means if unregister_model fails the adapter is already
removed from engine and local state but the method returns status=error; add a
one-line comment immediately above the del self.loaded_loras[lora_name] (or
above the unregister_model call) explaining that this ordering is intentional
and documents the asymmetric error semantics (idempotent no-op on retry) and
that it intentionally differs from the older unload_lora rollback behavior to
avoid re-registering the adapter on unregister failure.

In `@components/src/dynamo/vllm/worker_factory.py`:
- Around line 338-353: Extract the duplicated engine-route registration in
_create_decode_worker and _create_prefill_worker into a helper function (e.g.,
_register_lifecycle_and_rl_routes) that takes the DistributedRuntime and a
handler instance; define a single _RL_ROUTES tuple containing the seven RL route
names and use a loop to call runtime.register_engine_route(name,
getattr(handler, name)) for each, also registering "sleep", "wake_up", and
"scale_elastic_ep" there, and replace the duplicate
runtime.register_engine_route calls and logger.info in both functions with a
single call to the new helper; reference BaseWorkerHandler (handler methods),
runtime.register_engine_route, _create_decode_worker, and _create_prefill_worker
when making the change.

In `@lib/llm/src/http/service/service_v2.rs`:
- Around line 546-550: Add a new constant DYN_ENABLE_RL to the env_llm module
(dynamo_runtime::config::environment_names::llm) and replace the hardcoded
string usages with that constant; specifically change the condition that now
reads env_is_truthy("DYN_ENABLE_RL") to use
env_is_truthy(env_llm::DYN_ENABLE_RL) in the RL admin routes block alongside
config.enable_rl, and update any doc comments that mention the literal
DYN_RL_WORKER_SYSTEM_URLS to reference the env_llm constant name instead so all
environment names are discoverable and consistent.

In `@lib/llm/src/preprocessor.rs`:
- Around line 585-594: The two branches under the else/else if both call
self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false), differing only by a tracing::warn! when
has_backend_instance_id is true; refactor by removing the duplicate branches: in
the outer else, if has_backend_instance_id { tracing::warn!(...); } then call
let encoding = self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false) — this collapses the branches while
preserving the warning behavior and using the existing encode_with_timing,
prompt, tracker, and has_backend_instance_id symbols.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs`:
- Around line 466-486: The code currently clones
self.accumulated_completion_token_ids into completion_token_ids causing the
accumulator's allocation to be retained; replace the clone with
std::mem::take(&mut self.accumulated_completion_token_ids) when building
completion_token_ids (only when self.options.enable_completion_token_ids &&
finish_reason.is_some()) so the generator releases the vector's backing buffer
immediately; ensure the rest of the branch (where completion_token_ids is moved
into NvExtResponse) still compiles and preserves behavior of
NvExtResponse::completion_token_ids and related logic involving token_ids,
worker_id_info, timing_info, and routed_experts; optionally add a unit test that
opts in via extra_fields: ["completion_token_ids"] confirming mid-stream chunks
lack completion_token_ids and the final nvext.completion_token_ids equals the
concatenation of per-chunk token_ids.

In `@lib/llm/src/tokenizers.rs`:
- Around line 65-71: The trait's default impl of encode_with_special_tokens
silently ignores the _add_special_tokens flag; remove the default implementation
from the trait so encode_with_special_tokens becomes a required method (i.e.,
declare fn encode_with_special_tokens(&self, input: &str, add_special_tokens:
bool) -> Result<Encoding> without a body) and then update each Encoder
implementation that currently relies on the default to implement
encode_with_special_tokens explicitly (search for encode_with_special_tokens and
implementations of the Encoder trait to add the proper handling).

In `@lib/llm/src/tokenizers/fastokens.rs`:
- Around line 49-63: Update the struct-level documentation for the fastokens
tokenizer to state that encode_with_special_tokens falls back to the HuggingFace
tokenizer when add_special_tokens=true (i.e., special-token-aware encoding is
performed by hf_decoder, so callers lose fastokens speedup in that case), and
add/extend the test_fast_matches_hf_encoding test to call
encode_with_special_tokens with add_special_tokens=true and assert the HF
fallback produces matching token ids/special-token ids compared to the fast path
where applicable; locate the behavior in the encode_with_special_tokens method
(uses self.hf_decoder and self.fast_encoder) and ensure the test compares
outputs from hf_decoder.encode_with_special_tokens(...) and
fast_encoder.encode(...) wrapped as Encoding::Sp (and guards for any differences
in special-token id handling).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa7fafdb-7029-4614-bc70-dad1cad7d5da

📥 Commits

Reviewing files that changed from the base of the PR and between 6076441 and d837fbd.

📒 Files selected for processing (26)

components/src/dynamo/frontend/vllm_processor.py
components/src/dynamo/rl_admin/__init__.py
components/src/dynamo/rl_admin/service.py
components/src/dynamo/vllm/handlers.py
components/src/dynamo/vllm/publisher.py
components/src/dynamo/vllm/worker_factory.py
container/deps/vllm/install_vllm.sh
docs/Dynamo-RL-api-draft.md
lib/llm/src/audit/stream.rs
lib/llm/src/entrypoint/input/text.rs
lib/llm/src/http/service/openai.rs
lib/llm/src/http/service/service_v2.rs
lib/llm/src/preprocessor.rs
lib/llm/src/protocols/anthropic/types.rs
lib/llm/src/protocols/openai.rs
lib/llm/src/protocols/openai/chat_completions.rs
lib/llm/src/protocols/openai/chat_completions/aggregator.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs
lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/nvext.rs
lib/llm/src/protocols/openai/responses/mod.rs
lib/llm/src/protocols/openai/tokenization.rs
lib/llm/src/tokenizers.rs
lib/llm/src/tokenizers/fastokens.rs
lib/llm/src/tokenizers/hf.rs
lib/llm/src/tokenizers/tiktoken.rs

coderabbitai · 2026-04-23T19:42:33Z

+# Cache for byte->token_id mappings (populated lazily)
+_byte_token_cache: dict[tuple, int] = {}
+_tokenizer_for_fallback = None
+
+
+def _encode_single(text: str) -> int:
+    """Encode a single token's text to its ID using a lightweight tokenizer."""
+    global _tokenizer_for_fallback
+    if _tokenizer_for_fallback is None:
+        try:
+            from transformers import AutoTokenizer
+            model = MODEL_NAME or os.getenv("MODEL_NAME", "")
+            if model:
+                _tokenizer_for_fallback = AutoTokenizer.from_pretrained(
+                    model, trust_remote_code=True
+                )
+        except Exception:
+            pass
+    if _tokenizer_for_fallback is not None:
+        ids = _tokenizer_for_fallback.encode(text, add_special_tokens=False)
+        return ids[0] if ids else 0
+    return 0


🛠️ Refactor suggestion | 🟠 Major

Fallback tokenizer load violates multiple Python guidelines.

Line 128 imports transformers inside a function. The import is only conditionally needed (fallback path), so the compliant form is a top-level try: from transformers import AutoTokenizer / except ImportError: AutoTokenizer = None, then guard with if AutoTokenizer is not None.

Lines 134–135 are try/except Exception: pass, which silently hides any failure to load the tokenizer (wrong model name, missing weights, HF auth failure, permission issues). Downstream, every subsequent call will return 0 for every token ID — an outcome that's indistinguishable from success until RL loss goes nonsensical.

The broad except Exception should be narrowed (at least to ImportError, OSError, ValueError) and must log.

♻️ Proposed fix

+try: + from transformers import AutoTokenizer # type: ignore +except ImportError: + AutoTokenizer = None # type: ignore + # Cache for byte->token_id mappings (populated lazily) _byte_token_cache: dict[tuple, int] = {} _tokenizer_for_fallback = None def _encode_single(text: str) -> int: """Encode a single token's text to its ID using a lightweight tokenizer.""" global _tokenizer_for_fallback - if _tokenizer_for_fallback is None: - try: - from transformers import AutoTokenizer - model = MODEL_NAME or os.getenv("MODEL_NAME", "") - if model: - _tokenizer_for_fallback = AutoTokenizer.from_pretrained( - model, trust_remote_code=True - ) - except Exception: - pass + if _tokenizer_for_fallback is None and AutoTokenizer is not None: + model = MODEL_NAME or os.getenv("MODEL_NAME", "") + if model: + try: + _tokenizer_for_fallback = AutoTokenizer.from_pretrained( + model, trust_remote_code=True + ) + except (OSError, ValueError) as e: + logger.warning("Fallback tokenizer load failed for %s: %s", model, e) if _tokenizer_for_fallback is not None:

As per coding guidelines: "Keep all import statements at module top; do not import inside functions/methods/classes." and "Prefer failing fast: don't swallow exceptions."

🧰 Tools

🪛 Ruff (0.15.11)

[error] 134-135: try-except-pass detected, consider logging the exception

(S110)

[warning] 134-134: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/rl_admin/service.py` around lines 118 - 139, The fallback tokenizer loading in _encode_single currently imports transformers inside the function, swallows all exceptions, and silently returns 0 on failure; move the import to module top as a guarded import (e.g., try: from transformers import AutoTokenizer except ImportError: AutoTokenizer = None), then in _encode_single check if AutoTokenizer is not None before attempting to instantiate; narrow the exception handling when creating the tokenizer to specific exceptions (ImportError, OSError, ValueError) and log failures with the module logger (e.g., logging.getLogger(__name__)) rather than using a bare except: pass, and ensure on load failure _tokenizer_for_fallback remains None so callers can detect fallback absence while errors are recorded; reference symbols: _encode_single, _tokenizer_for_fallback, AutoTokenizer, MODEL_NAME.

coderabbitai · 2026-04-23T19:42:34Z

+async def _inject_token_ids(body: dict, response_data: dict) -> dict:
+    """Inject prompt_token_ids and choice.token_ids into the response.
+
+    Uses the Rust frontend's /tokenize for prompt_ids (same tokenizer as inference).
+    Extracts completion_ids from logprobs bytes (1:1 alignment guaranteed).
+    """
+    try:
+        messages = body.get("messages", [])
+        if not messages:
+            return response_data
+
+        # Get prompt token IDs from the Rust frontend (same tokenizer as inference path)
+        prompt_ids = await _tokenize_via_frontend(messages, model=body.get("model", ""))
+        response_data["prompt_token_ids"] = prompt_ids
+
+        # Extract completion token IDs from logprobs
+        for choice in response_data.get("choices", []):
+            lp = choice.get("logprobs")
+            if lp and lp.get("content"):
+                logprobs_content = lp["content"]
+                completion_ids = _extract_completion_ids_from_logprobs(logprobs_content)
+                assert len(completion_ids) == len(logprobs_content), (
+                    f"Token alignment: {len(completion_ids)} ids vs "
+                    f"{len(logprobs_content)} logprobs"
+                )
+                choice["token_ids"] = completion_ids
+    except Exception as e:
+        logger.warning(f"[admin] Failed to inject token_ids: {e}")
+
+    return response_data


⚠️ Potential issue | 🟠 Major

Don't use assert for runtime validation, and don't swallow the resulting failure.

Two problems stacked here:

Line 336 uses assert to validate token alignment. Under python -O the assertion is stripped entirely, so any real mismatch ships silently.

Even without -O, the except Exception on line 341 catches AssertionError and downgrades it to a warning log, after which the response is returned without token_ids on the affected choice — again silently wrong for RL.

Raise a proper exception on mismatch and only catch the narrow tokenization/HTTP errors you can recover from.

🛡️ Proposed fix

- for choice in response_data.get("choices", []): - lp = choice.get("logprobs") - if lp and lp.get("content"): - logprobs_content = lp["content"] - completion_ids = _extract_completion_ids_from_logprobs(logprobs_content) - assert len(completion_ids) == len(logprobs_content), ( - f"Token alignment: {len(completion_ids)} ids vs " - f"{len(logprobs_content)} logprobs" - ) - choice["token_ids"] = completion_ids - except Exception as e: - logger.warning(f"[admin] Failed to inject token_ids: {e}") + for choice in response_data.get("choices", []): + lp = choice.get("logprobs") + if lp and lp.get("content"): + logprobs_content = lp["content"] + completion_ids = _extract_completion_ids_from_logprobs(logprobs_content) + if len(completion_ids) != len(logprobs_content): + raise RuntimeError( + f"Token alignment mismatch: {len(completion_ids)} ids vs " + f"{len(logprobs_content)} logprobs" + ) + choice["token_ids"] = completion_ids + except (httpx.HTTPError, RuntimeError, ValueError, KeyError) as e: + logger.warning("[admin] Failed to inject token_ids: %s", e) + raise

As per coding guidelines: "Flag anti-patterns: ... using assert for runtime validation" and "catch only specific exceptions you can handle, and if you catch broadly (except Exception) you must log and re-raise."

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 341-341: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/rl_admin/service.py` around lines 315 - 344, The _inject_token_ids function currently uses an assert to validate token alignment and catches all Exceptions, which can be stripped under python -O or be swallowed by the broad except; replace the assert in _inject_token_ids (and the validation around _extract_completion_ids_from_logprobs and len comparisons) with an explicit raise (e.g., raise ValueError or RuntimeError with a clear message) when the lengths mismatch, and narrow the try/except to only catch and handle tokenization/HTTP-specific errors coming from _tokenize_via_frontend (e.g., TokenizationError, HTTPError) while allowing alignment errors to propagate (or re-raise after logging) so failures aren’t silently converted to warnings and missing choice["token_ids"] don’t slip through.

coderabbitai · 2026-04-23T19:42:34Z

+def main():
+    import uvicorn
+
+    parser = argparse.ArgumentParser(description="Dynamo RL Admin Service")
+    parser.add_argument("--port", type=int, default=8002, help="Port to listen on")
+    parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind to")
+    parser.add_argument("--model", type=str, default="", help="Model name (for fallback tokenizer)")
+    parser.add_argument("--dynamo-frontend", type=str, default="http://localhost:8000",
+                        help="Dynamo Rust frontend URL")
+    parser.add_argument("--dynamo-system", type=str, default="http://localhost:8081",
+                        help="Dynamo vLLM system port URL")
+    args = parser.parse_args()
+
+    global DYNAMO_FRONTEND_URL, DYNAMO_SYSTEM_URL, MODEL_NAME
+    DYNAMO_FRONTEND_URL = args.dynamo_frontend
+    DYNAMO_SYSTEM_URL = args.dynamo_system
+    if args.model:
+        MODEL_NAME = args.model
+
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s")
+    logger.info(
+        f"Starting RL Admin service on {args.host}:{args.port} "
+        f"(frontend={DYNAMO_FRONTEND_URL}, system={DYNAMO_SYSTEM_URL}, model={MODEL_NAME})"
+    )
+
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")


⚠️ Potential issue | 🟠 Major

Default host 0.0.0.0 exposes privileged admin endpoints to the network.

This service mounts /update_weights, /pause, /resume, and (downstream) /v1/chat/completions/tokens — none of which have any authentication. Binding to 0.0.0.0 by default means that on any multi-tenant host, any network-reachable client can load checkpoints, halt inference, or swap weights. This matches the Ruff S104 finding.

Change the default to 127.0.0.1 (or localhost) and require operators to opt into external exposure explicitly, ideally behind a reverse proxy that enforces authN/authZ.

🔒 Proposed fix

- parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind to") + parser.add_argument( + "--host", + type=str, + default="127.0.0.1", + help="Host to bind to. Set to 0.0.0.0 only behind an authenticating proxy.", + )

🧰 Tools

🪛 Ruff (0.15.11)

[error] 410-410: Possible binding to all interfaces

(S104)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/rl_admin/service.py` around lines 405 - 430, The service main() currently defaults host to "0.0.0.0", exposing admin endpoints; change the default for parser.add_argument("--host", ...) to "127.0.0.1" (or "localhost") so DYNAMO admin endpoints are bound to loopback by default, and add an explicit opt-in (e.g., a boolean flag --allow-external or require setting an env var) that, when true, allows overriding to external addresses; update related logging and any usage of args.host in uvicorn.run(app, host=...) and keep DYNAMO_FRONTEND_URL, DYNAMO_SYSTEM_URL, and MODEL_NAME handling untouched.

coderabbitai · 2026-04-23T19:42:34Z

+    #[serde(default = "default_true")]
+    pub add_special_tokens: bool,
+    #[serde(default = "default_false")]
+    pub return_token_strs: bool,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(deny_unknown_fields)]
+pub struct TokenizeChatRequest {
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub model: Option<String>,
+    pub messages: Vec<dynamo_protocols::types::ChatCompletionRequestMessage>,
+    #[serde(default = "default_true")]
+    pub add_generation_prompt: bool,
+    #[serde(default = "default_false")]
+    pub return_token_strs: bool,
+    #[serde(default = "default_false")]
+    pub continue_final_message: bool,
+    #[serde(default = "default_false")]
+    pub add_special_tokens: bool,


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM tokenize endpoint add_special_tokens default completion chat

💡 Result:

vLLM's OpenAI-compatible server provides a /v1/tokenize (or /tokenize) endpoint that wraps the Hugging Face tokenizer's encode method. The add_special_tokens parameter is supported with a default value of True, matching the Hugging Face Transformers default and vLLM's internal tokenizer protocol where add_special_tokens: bool = True in encode and call methods. For chat completions (/v1/chat/completions), add_special_tokens defaults to False because the chat template handles special tokens, avoiding double-adding (e.g., BOS token). For regular completions (/v1/completions), it defaults to True. The tokenize endpoint, being a direct tokenizer wrapper, uses True by default, adding special tokens like BOS unless specified otherwise.

Citations:

1: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/

2: https://docs.vllm.ai/en/v0.6.2/serving/openai_compatible_server.html

3: https://docs.vllm.ai/en/v0.10.1/api/vllm/transformers_utils/tokenizer.html

4: https://github.com/vllm-project/vllm/blob/7cc302dd/vllm/tokenizers/protocol.py

5: https://docs.vllm.ai/en/v0.10.1/serving/openai_compatible_server.html

🏁 Script executed:

cd lib/llm/src/protocols/openai && wc -l tokenization.rs && head -50 tokenization.rs

Repository: ai-dynamo/dynamo

Length of output: 1859

Add documentation comment clarifying intentional add_special_tokens asymmetry.

TokenizeCompletionRequest::add_special_tokens defaults to true while TokenizeChatRequest::add_special_tokens defaults to false. This asymmetry is intentional and aligns with vLLM's behavior—chat templates handle special tokens internally (avoiding BOS duplication), whereas completion requests require explicit token addition. Add a doc comment to both fields to prevent future "fixes" that unintentionally align these defaults.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 25 - 44, Add a short documentation comment above both TokenizeCompletionRequest::add_special_tokens and TokenizeChatRequest::add_special_tokens explaining that the differing defaults are intentional: TokenizeCompletionRequest defaults to true because completion requests need explicit special-token addition, while TokenizeChatRequest defaults to false because vLLM chat templates handle special tokens internally (avoiding duplicate BOS), so future maintainers should not change them to match each other. Ensure the comments appear directly above the fields in the respective structs so they are preserved in the serde definitions.

coderabbitai · 2026-04-23T19:42:34Z

+    pub fn merged_chat_template_kwargs(&self) -> HashMap<String, serde_json::Value> {
+        let mut kwargs = self.chat_template_kwargs.clone().unwrap_or_default();
+        kwargs.insert(
+            "add_generation_prompt".to_string(),
+            serde_json::Value::Bool(self.add_generation_prompt),
+        );
+        kwargs.insert(
+            "continue_final_message".to_string(),
+            serde_json::Value::Bool(self.continue_final_message),
+        );
+        kwargs
+    }


⚠️ Potential issue | 🟡 Minor

merged_chat_template_kwargs silently overrides user-provided keys.

If a client supplies chat_template_kwargs with add_generation_prompt or continue_final_message already set (which is plausible since vLLM accepts these either as top-level fields or nested in chat_template_kwargs), those values are unconditionally overwritten by the struct-level fields here. That can produce surprising results — the user sees their explicit kwarg ignored with no error. Consider either (a) rejecting collisions in validate(), or (b) using entry(...).or_insert(...) so explicit kwargs win.

♻️ Option: reject collisions in `validate`

pub fn validate(&self) -> Result<(), String> { if self.continue_final_message && self.add_generation_prompt { return Err( "Cannot set both `continue_final_message` and `add_generation_prompt` to True." .to_string(), ); } + if let Some(kw) = &self.chat_template_kwargs { + for key in ["add_generation_prompt", "continue_final_message"] { + if kw.contains_key(key) { + return Err(format!( + "`{key}` must be set via the top-level field, not inside `chat_template_kwargs`." + )); + } + } + } Ok(()) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 73 - 84, The merged_chat_template_kwargs method currently unconditionally overwrites user-provided keys "add_generation_prompt" and "continue_final_message"; update the implementation so user-supplied values are preserved by using kwargs.entry("add_generation_prompt").or_insert(serde_json::Value::Bool(self.add_generation_prompt)) and similarly for "continue_final_message", or alternatively add a validation in the struct's validate() that rejects collisions if both chat_template_kwargs contains those keys and the top-level fields are set; reference the merged_chat_template_kwargs function (or the validate method if choosing the reject option) to locate where to change behavior.

coderabbitai · 2026-04-23T19:42:34Z

+impl Tokenizer for HuggingFaceTokenizer {
+    fn convert_ids_to_tokens(&self, token_ids: &[TokenIdType]) -> Result<Vec<String>> {
+        Ok(token_ids
+            .iter()
+            .map(|&id| self.tokenizer.id_to_token(id).unwrap_or_default())
+            .collect())
+    }
+}


⚠️ Potential issue | 🟡 Minor

Silently mapping unknown IDs to "" may hide bugs.

id_to_token returns None for any ID outside the vocabulary (e.g., an RL worker returning a stale/mismatched token ID after a weight swap, a LoRA adapter that expanded the vocab, or an off-by-one). unwrap_or_default() papers over all of these as empty strings, which then flow into TokenizeResponse.token_strs indistinguishable from a real empty-string token. Consider either returning an Err on lookup failure, or using a sentinel like format!("<unk:{id}>") so such cases are visible in logs/responses.

🛡️ Proposed change

- Ok(token_ids - .iter() - .map(|&id| self.tokenizer.id_to_token(id).unwrap_or_default()) - .collect()) + token_ids + .iter() + .map(|&id| { + self.tokenizer + .id_to_token(id) + .ok_or_else(|| Error::msg(format!("Unknown token id: {id}"))) + }) + .collect()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/tokenizers/hf.rs` around lines 74 - 81, The current implementation of HuggingFaceTokenizer::convert_ids_to_tokens silently maps unknown IDs to "" via unwrap_or_default(); change convert_ids_to_tokens to explicitly detect when self.tokenizer.id_to_token(id) returns None and return an Err containing a clear message with the offending id (e.g., format!("unknown token id: {}", id)) instead of producing empty strings, so callers of Tokenizer::convert_ids_to_tokens (and ultimately TokenizeResponse.token_strs) can surface the error; locate the logic in the convert_ids_to_tokens method of the HuggingFaceTokenizer implementation and replace the iterator/map+unwrap_or_default with a loop or try-fold that returns Err on any missing id.

coderabbitai · 2026-04-23T19:42:34Z

 impl Encoder for TikTokenTokenizer {
    fn encode(&self, input: &str) -> Result<Encoding> {
-        let token_ids: Vec<u32> = self.bpe.encode_with_special_tokens(input);
-        Ok(Encoding::Sp(token_ids))
+        self.encode_with_special_tokens(input, true)
    }

    fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>> {
        inputs.par_iter().map(|input| self.encode(input)).collect()
    }
+
+    fn encode_with_special_tokens(
+        &self,
+        input: &str,
+        add_special_tokens: bool,
+    ) -> Result<Encoding> {
+        let token_ids: Vec<u32> = if add_special_tokens {
+            self.bpe.encode_with_special_tokens(input)
+        } else {
+            self.bpe.encode_ordinary(input)
+        };
+        Ok(Encoding::Sp(token_ids))
+    }
 }


⚠️ Potential issue | 🟡 Minor

Default add_special_tokens differs across backends.

Encoder::encode() here delegates with add_special_tokens = true, while HuggingFaceTokenizer::encode (hf.rs Line 30) and FastTokenizer::encode (fastokens.rs Line 42) delegate with false. Any code that calls the bare tokenizer.encode(text) through the Tokenizer wrapper will therefore get different special-token behavior depending on which backend is loaded — a subtle footgun for the new /v1/tokenize and TITO paths, which can be configured against either backend.

If the intent is to preserve prior per-backend semantics, please add a comment explaining the asymmetry; otherwise, consider aligning all three backends on the same default and forcing callers to opt in via encode_with_special_tokens.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/tokenizers/tiktoken.rs` around lines 102 - 123, Encoder::encode in TikTokenTokenizer currently calls encode_with_special_tokens(..., true) which creates asymmetric default behavior versus HuggingFaceTokenizer::encode and FastTokenizer::encode (which pass false); update TikTokenTokenizer::encode to match the other backends (pass false) so all Encoder::encode implementations behave the same, or if you intend to preserve the asymmetry, add a clear comment next to TikTokenTokenizer::encode and the encode_with_special_tokens method documenting why TikTokenTokenizer defaults to true and noting the behavioral difference so callers are aware to explicitly opt in to special tokens.

coderabbitai · 2026-04-23T19:42:35Z

+impl Tokenizer for TikTokenTokenizer {
+    fn convert_ids_to_tokens(&self, token_ids: &[TokenIdType]) -> Result<Vec<String>> {
+        Ok(token_ids
+            .iter()
+            .map(|id| {
+                self.decoder_tokens
+                    .get(id)
+                    .or_else(|| self.special_tokens_decoder.get(id))
+                    .map(|bytes| String::from_utf8_lossy(bytes).into_owned())
+                    .unwrap_or_default()
+            })
+            .collect())
+    }
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM /tokenize endpoint return_token_strs byte fallback format

💡 Result:

The vLLM /tokenize endpoint (at http://localhost:8000/tokenize) supports the "return_token_strs": true parameter to return the decoded strings corresponding to the token IDs in the "tokens" array of the response. When return_token_strs is enabled, vLLM uses the model's tokenizer.decode method on each individual token ID to obtain the token strings. For models using byte-level BPE tokenizers (common in multilingual models like Granite), individual tokens are raw UTF-8 bytes. The tokenizer.decode attempts to interpret these bytes as UTF-8 text. This works correctly for ASCII/Latin characters but fails for non-Latin Unicode characters, resulting in malformed "mojibake" strings (e.g., 'Ðĵ' instead of Cyrillic characters). This is because the tokenizer expects full valid UTF-8 sequences, but single-token bytes may not form complete characters. This is documented as a bug in GitHub issue #27825 for Granite models (vLLM v0.11.0+), referencing the original feature implementation in #18928 (implemented via PR #18941). A fix was proposed in PR #36054 titled "[Bugfix] Fix tokenize endpoint malformed token_strs". There is no explicit "byte fallback format" documented in official sources. The current implementation does not provide raw bytes; it always attempts UTF-8 decoding, leading to garbled output for certain tokens. Workarounds include manual re-encoding client-side or avoiding return_token_strs for affected models. Official docs confirm the /tokenize endpoint as a HuggingFace tokenizer wrapper but lack parameter details: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#tokenizer-api. Endpoint supports both "prompt" (text) and chat "messages", applying chat templates as needed.

Citations:

1: [Bug]: Tokenize endpoint for Granite models returns malformed strings in token_strs for non-Latin characters vllm-project/vllm#27825

2: [Feature]: Return token strings in addition to token ids for /tokenize vllm-project/vllm#18928

3: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/

4: https://discuss.vllm.ai/t/tell-me-about-the-current-status-of-the-tokenize-endpoint-in-vllm/2089

🏁 Script executed:

# Search for test files related to tiktoken tokenizer find . -type f -name "*.rs" | xargs rg -l "TikTokenTokenizer" | head -20

Repository: ai-dynamo/dynamo

Length of output: 220

🏁 Script executed:

# Look for test fixtures or test data files find . -type f \( -name "*test*" -o -name "*fixture*" \) | grep -i token | head -20

Repository: ai-dynamo/dynamo

Length of output: 94

🏁 Script executed:

# Search for references to byte-fallback tokens or 0xE4 in tests rg -i "byte.*fallback|0x[eE]4|rank.*100" --type rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 86

🏁 Script executed:

# Examine the tiktoken.rs file more broadly to understand context wc -l lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 98

🏁 Script executed:

# Check the test file for byte-fallback or problematic token tests cat -n lib/llm/tests/tokenizers.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 7825

🏁 Script executed:

# Search in test files for byte-fallback, 0xE4, or rank references rg -n "0x[eE]4|byte.*fallback|rank" lib/llm/tests/

Repository: ai-dynamo/dynamo

Length of output: 18207

🏁 Script executed:

# Look at test_token_data.rs to understand test fixture structure cat lib/llm/tests/test_token_data.rs | head -50

Repository: ai-dynamo/dynamo

Length of output: 1219

🏁 Script executed:

# Look at mock-tiktoken directory structure find tests/data/sample-models/mock-tiktoken -type f

Repository: ai-dynamo/dynamo

Length of output: 133

🏁 Script executed:

# Examine tiktoken.rs implementation of decoder initialization sed -n '1,200p' lib/llm/src/tokenizers/tiktoken.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 4195

🏁 Script executed:

# Look for decoder_tokens and special_tokens_decoder initialization in tiktoken.rs rg -n "decoder_tokens|special_tokens_decoder" lib/llm/src/tokenizers/tiktoken.rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 706

🏁 Script executed:

# Get the exact lines 156-169 from tiktoken.rs sed -n '156,169p' lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 551

🏁 Script executed:

# Search for tests of convert_ids_to_tokens or decode behavior rg -n "convert_ids_to_tokens|decode.*token" lib/llm/tests/tokenizers.rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 1551

🏁 Script executed:

# Look for any tests that verify behavior with unknown token IDs or invalid bytes rg -A 5 -B 5 "unknown|invalid|0xFF|0x" lib/llm/tests/tokenizers.rs

Repository: ai-dynamo/dynamo

Length of output: 42

🏁 Script executed:

# Look for the Decoder trait implementation for TikTokenTokenizer rg -A 30 "impl Decoder for TikTokenTokenizer" lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 1446

🏁 Script executed:

# Search for any comments or documentation about byte-fallback handling rg -B 5 -A 5 "byte|fallback|0xE4|lossy" lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 16351

🏁 Script executed:

# Look for the decode method implementation to compare with convert_ids_to_tokens rg -A 20 "fn decode" lib/llm/src/tokenizers/tiktoken.rs | head -60

Repository: ai-dynamo/dynamo

Length of output: 1162

🏁 Script executed:

# Search for usage of convert_ids_to_tokens in the codebase rg -r "convert_ids_to_tokens" --type rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 456

🏁 Script executed:

# Look for tokenize endpoint implementation rg -n "tokenize|/v1/" lib/llm/src/ --type rs | grep -i "api\|endpoint\|route" | head -20

Repository: ai-dynamo/dynamo

Length of output: 86

🏁 Script executed:

# Search for usage of convert_ids_to_tokens in the codebase rg "convert_ids_to_tokens" lib/llm/src/ | head -30

Repository: ai-dynamo/dynamo

Length of output: 1058

🏁 Script executed:

# Look for tokenize endpoint or API implementation rg -i "tokenize" lib/llm/src/ | grep -v "test" | head -30

Repository: ai-dynamo/dynamo

Length of output: 3210

🏁 Script executed:

# Search in protocols or openai module for tokenize endpoint find lib/llm/src -name "*.rs" | xargs grep -l "tokenize" | head -10

Repository: ai-dynamo/dynamo

Length of output: 391

🏁 Script executed:

# Look at the openai.rs usage of convert_ids_to_tokens rg -B 10 -A 10 "convert_ids_to_tokens" lib/llm/src/http/service/openai.rs | head -80

Repository: ai-dynamo/dynamo

Length of output: 1704

🏁 Script executed:

# Find the tokenize endpoint handler rg -B 5 -A 15 "tokenize" lib/llm/src/http/service/openai.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 3826

convert_ids_to_tokens corrupts byte-fallback tokens and silently fails on unknown IDs.

Single byte-fallback tokens (e.g., ID 100 → [0xE4] in test fixtures) are not valid UTF-8 on their own, so String::from_utf8_lossy replaces them with U+FFFD. This happens when clients call /v1/tokenize with return_token_strs=true, returning "\u{FFFD}" in place of the actual byte value, making the original tokenization irreversible. Additionally, an unknown ID silently produces "" via unwrap_or_default(), indistinguishable from a real empty token — consider returning an Err instead.

The decode() method already handles this correctly by attempting strict UTF-8 first, then falling back to lossy conversion only for multi-token sequences. Apply a similar pattern to convert_ids_to_tokens.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/tokenizers/tiktoken.rs` around lines 156 - 169, convert_ids_to_tokens currently uses String::from_utf8_lossy on each token and returns empty strings for unknown IDs, which corrupts single-byte tokens and hides missing IDs; modify Tokenizer::convert_ids_to_tokens to mirror the decode() strategy: for each TokenIdType look up bytes via decoder_tokens or special_tokens_decoder, return Err if an ID is missing, then try strict UTF-8 (String::from_utf8) first and only fall back to lossy conversion (from_utf8_lossy) for multi-byte sequences (preserving the original bytes for single-byte tokens instead of letting them become U+FFFD), using the same helper logic/policy used by decode() to decide when to use lossy decoding.

handler_chat_completions_tokens injects a placeholder "(token-in mode)" message into request.inner.messages so Dynamo can perform chat-template selection, but the real prompt token IDs are stored in nvext.token_data. Previously, chat_completions() saved request.inner.messages (containing the placeholder) as rl_saved_messages, then called rl_tokenize_prompt() on that placeholder at post-processing time. This returned the tokenization of the placeholder string rather than the supplied token IDs, producing wrong prompt_token_ids in the response. Fix: capture nvext.token_data before engine.generate() consumes the request, and prefer it over rl_tokenize_prompt() in the RL post-processing block. Standard message-based requests are unaffected (rl_tito_token_ids is None).

All rl_admin functionality has been re-implemented natively in the Dynamo Rust frontend (lib/llm): - Admin routes (/pause, /resume, /update_weights, /health, /ready, /load_lora_adapter, /unload_lora_adapter) → openai.rs /v1/rl/* router - Token injection (prompt_token_ids, choice.token_ids) → handler_chat_completions - Tokenization → /v1/tokenize + /v1/detokenize (cherry from #7699) Prime-RL now points admin_base_url directly at the Dynamo frontend (/v1/rl); no separate Python sidecar is deployed. The module has zero imports across the codebase and is not referenced in any Dockerfile, Helm chart, or K8s manifest.

…zer asymmetry Tier 1 — RL correctness (handlers.py) - Clear loaded_loras[name] immediately after remove_lora() succeeds and BEFORE add_lora(). Previously a successful remove followed by a failed add left a stale entry (id=N-1) in loaded_loras while the engine no longer held N-1 — causing the next rollout to attach a LoRARequest for an unregistered adapter (silent base-model fallback or request failure). Importance ratios π_N/π_old computed from those rollouts are garbage, leading to mismatch_kl spikes and potential reward collapse. - Log rollback remove_lora() failures at ERROR instead of swallowing them with bare `except: pass`. A leaked adapter is not a silent state — it must be visible in logs so on-call can diagnose engine inconsistency. - Upgrade reset_prefix_cache() failure from WARNING to ERROR. A missed cache reset after a LoRA swap means subsequent requests with a shared prefix can reuse KV state computed under the old adapter weights, causing silent logprobs mismatch on those positions. Tier 2 — Silent data-loss (openai.rs) - Replace the bare `if let Ok(...)` on serde_json::to_value() with a match that logs an ERROR on the Err branch. The previous fallthrough left choice.token_ids unpromoted while prompt_token_ids was already set — Prime-RL would see a partially-populated response with no indication of what went wrong. Tier 3 — Tokenization asymmetry (hf.rs, vllm_processor.py) - HuggingFaceTokenizer::encode() and encode_batch() now use add_special_tokens=true, matching TikTokenTokenizer::encode(). Any caller that uses the trait's encode() method gets consistent behaviour regardless of which backend is active. Callers needing no special tokens should call encode_with_special_tokens(_, false). - vllm_processor.py logprobs injection now always populates the `token` and `bytes` fields on every logprob entry, falling back to the numeric token ID string when top_logprobs is absent. Previously entries built without top_logprobs had no `token` key — a KeyError or silent None the moment any consumer reads that field (e.g. a return_token_ids=false run).

Prime-RL's orchestrator injects a cache_salt key into extra_body on every rollout request for KV prefix-cache isolation. Dynamo's generic field validator was rejecting this with a 400 Bad Request. Add a PASSTHROUGH_EXTRA_FIELDS allowlist and filter out known RL-client hints before checking for truly unsupported parameters. This makes Dynamo forward-compatible with prime-rl clients that send cache_salt without requiring a prime-rl config workaround.

github-actions · 2026-04-24T02:41:43Z

🌿 Fern Docs Preview: https://nvidia-preview-13e1cad7-7ac7-4be1-a282-9f8563836467.docs.buildwithfern.com/dynamo/dev

biswapanda · 2026-04-24T03:14:04Z

@CodeRabbit review

coderabbitai · 2026-04-24T03:14:10Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/http/service/openai.rs (1)
3674-4082: ⚠️ Potential issue | 🔴 Critical

Test file fails to compile — 8 NvCreateChatCompletionRequest literals missing the new tokens and return_token_ids fields.

The struct now includes required fields tokens: Option<Vec<u32>> and return_token_ids: Option<bool> with #[serde(default)] attributes. Since the struct doesn't derive Default, all 8 test struct literals must be updated at lines 3675, 3702, 3917, 3947, 3976, 4005, 4034, and 4065. Add tokens: None, return_token_ids: None to each. Example:
🐛 Representative fix (apply to all 8 sites)
         let request = NvCreateChatCompletionRequest {
             inner: CreateChatCompletionRequest {
                 model: "test-model".to_string(),
                 messages: vec![],
                 ..Default::default()
             },
             common: Default::default(),
             nvext: None,
             chat_template_args: None,
             media_io_kwargs: None,
+            tokens: None,
+            return_token_ids: None,
             unsupported_fields: Default::default(),
         };
Alternatively, derive Default on NvCreateChatCompletionRequest so that existing ..Default::default() patterns continue working when protocol fields are added.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 3674 - 4082, Tests
constructing NvCreateChatCompletionRequest now fail because the struct gained
required fields tokens: Option<Vec<u32>> and return_token_ids: Option<bool>;
update each test literal (e.g., in
test_validate_chat_completion_required_fields_empty_messages,
test_validate_chat_completion_required_fields_with_messages,
test_bad_base_request_for_chatcompletion and other chat tests) to include
tokens: None, return_token_ids: None in the NvCreateChatCompletionRequest
initializer, or alternatively add a Default impl/derive for
NvCreateChatCompletionRequest so the existing ..Default::default() usages
continue to work.

🧹 Nitpick comments (3)

lib/llm/src/tokenizers.rs (1)
64-72: Default encode_with_special_tokens silently drops the flag — consider making it explicit.

The default impl ignores _add_special_tokens and just calls encode. Today every concrete backend (hf, fastokens, tiktoken) overrides it so this never runs, but if a future Encoder impl forgets to override it, callers will get the wrong token stream with no error. Either delete the default (force every impl to implement it) or make the divergence loud (e.g. debug-log once, or unimplemented!-style default returning an Err). Non-blocking for this PR.
♻️ One way to remove the footgun
     pub trait Encoder: Send + Sync {
         fn encode(&self, input: &str) -> Result<Encoding>;
         fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
-
-        fn encode_with_special_tokens(
-            &self,
-            input: &str,
-            _add_special_tokens: bool,
-        ) -> Result<Encoding> {
-            self.encode(input)
-        }
+
+        /// Each backend MUST implement this; honoring `add_special_tokens`
+        /// is required for consistent behavior between `/v1/tokenize` and the
+        /// request preprocessor.
+        fn encode_with_special_tokens(
+            &self,
+            input: &str,
+            add_special_tokens: bool,
+        ) -> Result<Encoding>;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers.rs` around lines 64 - 72, The default
encode_with_special_tokens currently ignores the _add_special_tokens flag and
just calls encode, which can silently produce incorrect token streams; change
the default implementation of encode_with_special_tokens to return an explicit
Err (or use unimplemented!()) with a clear message so implementors of the
Encoder must override it instead of silently dropping the flag — update the
method body for encode_with_special_tokens (referencing the
encode_with_special_tokens and encode symbols) to return an error describing
that encode_with_special_tokens must be implemented by backend encoders.
lib/llm/src/http/service/openai.rs (2)
1984-1994: This helper duplicates ErrorMessage::bad_request.

ErrorMessage::bad_request is defined at line 211 with identical behaviour. The free bad_request here just forwards to String and reconstructs the same ErrorResponse. Use the existing associated function and drop this duplicate to keep a single source of truth for 400 responses.
♻️ Proposed refactor
-fn bad_request<T: Into<String>>(message: T) -> ErrorResponse {
-    let code = StatusCode::BAD_REQUEST;
-    (
-        code,
-        Json(ErrorMessage {
-            message: message.into(),
-            error_type: map_error_code_to_error_type(code),
-            code: code.as_u16(),
-        }),
-    )
-}
-
 fn resolve_tokenizer_model_name(
     state: &Arc<service_v2::State>,
     requested_model: Option<&str>,
 ) -> Result<String, ErrorResponse> {
     if let Some(model) = requested_model {
         if state.manager().has_model_any(model) {
             return Ok(model.to_string());
         }
         return Err(ErrorMessage::model_not_found());
     }
     let served_models = state.manager().model_display_names();
     if served_models.len() == 1 {
         return Ok(served_models.into_iter().next().unwrap());
     }
-    Err(bad_request(
+    Err(ErrorMessage::bad_request(
         "Model must be specified when more than one model is served.",
     ))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 1984 - 1994, The standalone
function bad_request<T: Into<String>>(...) duplicates ErrorMessage::bad_request;
remove the free function and update any call sites to use the associated
constructor ErrorMessage::bad_request (or ErrorMessage::bad_request(message)) so
there is a single source of truth for 400 responses; ensure the return type
remains ErrorResponse and preserve mapping using map_error_code_to_error_type
where used.
897-958: Factor the RL nvext setup out of the two handlers.

This block and handler_chat_completions_tokens (lines 2307-2354) perform nearly identical work: take tokens → nvext.token_data, push "token_ids"/"completion_token_ids" into nvext.extra_fields, force logprobs = Some(true), and inject the "(token-in mode)" placeholder when messages is empty. The literal placeholder string is also duplicated, which is easy to drift on. Extract a single helper (e.g. fn rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest, want_token_ids: bool) -> Result<(), ErrorResponse>) and call it from both handlers — this removes the risk of the two paths diverging (e.g. one path adds a new extra_field, the other forgets).

One minor concern on the placeholder itself: because validate_chat_completion_required_fields later rejects empty messages (line 1545), the placeholder is load-bearing, yet its content is an English magic string that will leak into downstream chat-template rendering if the template doesn't honour nvext.token_data. Please at least centralise the placeholder as a const next to the helper so it's greppable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 897 - 958, Extract the
duplicated RL nvext setup into a single helper (e.g. fn
rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest, want_token_ids:
bool) -> Result<(), ErrorResponse>) and call it from both the current handler
block and handler_chat_completions_tokens; the helper should: take
request.tokens and set nvext.token_data when present, ensure
request.inner.messages contains the same placeholder when empty, merge
"token_ids" and "completion_token_ids" into nvext.extra_fields when
want_token_ids is true, and set request.inner.logprobs = Some(true) if not
already set; centralize the "(token-in mode)" placeholder as a const next to the
helper and return/propagate an ErrorResponse on failure so callers can handle
validation consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 747-768: When a hot-swap fails to remove the old adapter, do not
proceed to add_lora or update self.loaded_loras: the current flow catches
exceptions from self.engine_client.remove_lora(...) and then continues, which
(given deterministic lora_id via lora_name_to_id) can leave the engine bound to
the old weights while the in-memory map points to the new path. Change the
except block around remove_lora in the hot-swap branch so that after logging the
failure you fail fast (raise an exception or return an error) instead of falling
through to the add_lora/assignment; ensure you do not call
self.engine_client.add_lora(...) or set self.loaded_loras[lora_name] when
remove_lora fails (refer to remove_lora, add_lora, self.loaded_loras,
LoRARequest, LoRAInfo).

In `@docs/Dynamo-RL-api-draft.md`:
- Around line 906-926: The doc incorrectly states there are five engine routes
and omits the two LoRA routes; update the table and the registration snippet to
include /engine/load_lora_adapter and /engine/unload_lora_adapter (methods POST)
and show they call the corresponding handlers (e.g., engine route names mapped
via runtime.register_engine_route to handler.load_lora_adapter and
handler.unload_lora_adapter), or explicitly state if those routes are registered
elsewhere; ensure the symbols load_lora_adapter, unload_lora_adapter, and
runtime.register_engine_route appear so readers wiring a backend see the full
seven-route surface.

In `@lib/llm/src/http/service/openai.rs`:
- Around line 3056-3060: The helper all_ok(results: &[serde_json::Value])
currently returns true for an empty slice because Iterator::all yields true on
empty iterators; change all_ok (referenced in fan_out/RlState handlers) to
return false when results.is_empty() and only then check that every result has
status "ok" (e.g., ensure !results.is_empty() && results.iter().all(...)). Also
update any callers that expected the old behavior if necessary, and modify
RlState::from_env so it refuses or fatally logs when the parsed
worker_system_urls list is empty (do not silently start with zero workers).
- Around line 3359-3402: The doc comment block describing promotion of token IDs
is attached to rl_tokenize_prompt but belongs to
rl_promote_token_ids_in_response; move the "Promote token IDs from the Dynamo
`nvext` response object…" paragraph out of the rl_tokenize_prompt docblock and
attach it immediately above the rl_promote_token_ids_in_response function
definition so each function has its correct doc comment (keep the
tokenizer/prompt description on rl_tokenize_prompt and the nvext promotion
paragraph with rl_promote_token_ids_in_response).

---

Outside diff comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 3674-4082: Tests constructing NvCreateChatCompletionRequest now
fail because the struct gained required fields tokens: Option<Vec<u32>> and
return_token_ids: Option<bool>; update each test literal (e.g., in
test_validate_chat_completion_required_fields_empty_messages,
test_validate_chat_completion_required_fields_with_messages,
test_bad_base_request_for_chatcompletion and other chat tests) to include
tokens: None, return_token_ids: None in the NvCreateChatCompletionRequest
initializer, or alternatively add a Default impl/derive for
NvCreateChatCompletionRequest so the existing ..Default::default() usages
continue to work.

---

Nitpick comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 1984-1994: The standalone function bad_request<T:
Into<String>>(...) duplicates ErrorMessage::bad_request; remove the free
function and update any call sites to use the associated constructor
ErrorMessage::bad_request (or ErrorMessage::bad_request(message)) so there is a
single source of truth for 400 responses; ensure the return type remains
ErrorResponse and preserve mapping using map_error_code_to_error_type where
used.
- Around line 897-958: Extract the duplicated RL nvext setup into a single
helper (e.g. fn rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest,
want_token_ids: bool) -> Result<(), ErrorResponse>) and call it from both the
current handler block and handler_chat_completions_tokens; the helper should:
take request.tokens and set nvext.token_data when present, ensure
request.inner.messages contains the same placeholder when empty, merge
"token_ids" and "completion_token_ids" into nvext.extra_fields when
want_token_ids is true, and set request.inner.logprobs = Some(true) if not
already set; centralize the "(token-in mode)" placeholder as a const next to the
helper and return/propagate an ErrorResponse on failure so callers can handle
validation consistently.

In `@lib/llm/src/tokenizers.rs`:
- Around line 64-72: The default encode_with_special_tokens currently ignores
the _add_special_tokens flag and just calls encode, which can silently produce
incorrect token streams; change the default implementation of
encode_with_special_tokens to return an explicit Err (or use unimplemented!())
with a clear message so implementors of the Encoder must override it instead of
silently dropping the flag — update the method body for
encode_with_special_tokens (referencing the encode_with_special_tokens and
encode symbols) to return an error describing that encode_with_special_tokens
must be implemented by backend encoders.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89572d54-22ae-46aa-8336-7a58ba1c028a

📥 Commits

Reviewing files that changed from the base of the PR and between d837fbd and beaafcb.

📒 Files selected for processing (7)

components/src/dynamo/frontend/vllm_processor.py
components/src/dynamo/vllm/handlers.py
docs/Dynamo-RL-api-draft.md
lib/llm/src/http/service/openai.rs
lib/llm/src/protocols/openai/validate.rs
lib/llm/src/tokenizers.rs
lib/llm/src/tokenizers/hf.rs

🚧 Files skipped from review as they are similar to previous changes (1)

components/src/dynamo/frontend/vllm_processor.py

coderabbitai · 2026-04-24T03:22:11Z

+                if is_hot_swap:
+                    old_id = self.loaded_loras[lora_name].id
+                    try:
+                        await self.engine_client.remove_lora(old_id)
+                        # Invalidate the cache entry immediately after remove succeeds.
+                        # If add_lora below fails, this prevents a stale entry pointing
+                        # at an adapter the engine no longer holds from poisoning future
+                        # rollouts with wrong importance ratios (Tier-1 RL correctness risk).
+                        self.loaded_loras.pop(lora_name, None)
+                    except Exception as e:
+                        logger.warning(
+                            f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap: {e}"
+                        )
+
+                await self.engine_client.add_lora(
+                    LoRARequest(
+                        lora_name=lora_name,
+                        lora_int_id=lora_id,
+                        lora_path=lora_path,
+                    )
+                )
+                self.loaded_loras[lora_name] = LoRAInfo(id=lora_id, path=lora_path)


⚠️ Potential issue | 🔴 Critical

Silent stale-weights bug when remove_lora fails during hot-swap.

lora_id = lora_name_to_id(lora_name) (line 741) is deterministic, so lora_id == old_id on every hot-swap. Combined with the comment on line 744 — "vLLM's add_lora is a no-op when the lora_int_id is already registered" — the current except on line 756-759 is unsafe:

If remove_lora raises, control falls through to add_lora with the same lora_int_id that is still registered, so add_lora is a no-op. The old adapter stays in the engine, but line 768 overwrites self.loaded_loras[lora_name] with the new path. From this point on, _resolve_lora_request hands out a LoRARequest(lora_path=<new_path>) for an engine-id that still points at the old weights, and every subsequent rollout silently runs under the previous step's adapter. For an RL training loop, that's exactly the kind of logprob/importance-ratio contamination the rest of this code path is trying to prevent.

Fail fast on this path instead of continuing: if we cannot remove the old adapter, we cannot safely hot-swap.

🛡️ Proposed fix — bail out when remove_lora fails on hot-swap

if is_hot_swap: old_id = self.loaded_loras[lora_name].id try: await self.engine_client.remove_lora(old_id) # Invalidate the cache entry immediately after remove succeeds. # If add_lora below fails, this prevents a stale entry pointing # at an adapter the engine no longer holds from poisoning future # rollouts with wrong importance ratios (Tier-1 RL correctness risk). self.loaded_loras.pop(lora_name, None) except Exception as e: - logger.warning( - f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap: {e}" - ) + # Cannot proceed: since lora_name_to_id is deterministic the new + # lora_id equals old_id, and vLLM's add_lora is a no-op when the + # int id is already registered. Overwriting loaded_loras with the + # new path would silently leave OLD weights live in the engine — + # exactly the correctness hole hot-swap is meant to close. + logger.error( + f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap; " + f"aborting to avoid running with stale adapter weights: {e}" + ) + return { + "status": "error", + "message": f"remove_lora failed during hot-swap of '{lora_name}': {e}", + "lora_name": lora_name, + }

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 756-756: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/handlers.py` around lines 747 - 768, When a hot-swap fails to remove the old adapter, do not proceed to add_lora or update self.loaded_loras: the current flow catches exceptions from self.engine_client.remove_lora(...) and then continues, which (given deterministic lora_id via lora_name_to_id) can leave the engine bound to the old weights while the in-memory map points to the new path. Change the except block around remove_lora in the hot-swap branch so that after logging the failure you fail fast (raise an exception or return an error) instead of falling through to the add_lora/assignment; ensure you do not call self.engine_client.add_lora(...) or set self.loaded_loras[lora_name] when remove_lora fails (refer to remove_lora, add_lora, self.loaded_loras, LoRARequest, LoRAInfo).

coderabbitai · 2026-04-24T03:22:11Z

+Five engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL.
+
+| Route | Method | vLLM API called | Description |
+|-------|--------|-----------------|-------------|
+| `/engine/pause_generation` | POST | `engine_client.pause_generation()` | Drain in-flight requests, keep model loaded in GPU memory |
+| `/engine/resume_generation` | POST | `engine_client.resume_generation()` | Resume accepting inference requests |
+| `/engine/flush_cache` | POST | `engine_client.reset_prefix_cache()` | Invalidate prefix/KV cache (required before weight reload) |
+| `/engine/update_weights_from_path` | POST | `engine_client.collective_rpc("reload_weights", ...)` | Load weights from filesystem (safetensors checkpoint) |
+| `/engine/get_weight_version` | POST | `self._weight_version` | Return current weight version string |
+
+Both decode and prefill worker types register all 5 routes. Route signatures are compatible with SGLang's merged `#6094` routes for backend interoperability.
+
+### Registration (worker_factory.py)
+
+```python
+runtime.register_engine_route("pause_generation", handler.pause_generation)
+runtime.register_engine_route("resume_generation", handler.resume_generation)
+runtime.register_engine_route("flush_cache", handler.flush_cache)
+runtime.register_engine_route("update_weights_from_path", handler.update_weights_from_path)
+runtime.register_engine_route("get_weight_version", handler.get_weight_version)
+```


⚠️ Potential issue | 🟡 Minor

Section 7 undercounts the engine routes — inconsistent with section 4.4 and the PR summary.

Line 906 says "Five engine route handlers" and the register_engine_route snippet (920-926) only lists the 5 weight-lifecycle routes, but:

Line 63 of this same doc states "7 engine route handlers (5 weight-lifecycle + 2 LoRA)".

§4.4 documents /v1/rl/load_lora_adapter and /v1/rl/unload_lora_adapter as first-class fleet-control endpoints.

components/src/dynamo/vllm/handlers.py defines load_lora_adapter / unload_lora_adapter alongside the other 5.

Please extend the table and the registration snippet to include the two LoRA routes (or note explicitly why they're registered elsewhere), otherwise someone wiring up a second backend from this doc will silently miss the LoRA surface.

📝 Suggested additions

-Five engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL. +Seven engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL. | Route | Method | vLLM API called | Description | |-------|--------|-----------------|-------------| | `/engine/pause_generation` | POST | `engine_client.pause_generation()` | Drain in-flight requests, keep model loaded in GPU memory | | `/engine/resume_generation` | POST | `engine_client.resume_generation()` | Resume accepting inference requests | | `/engine/flush_cache` | POST | `engine_client.reset_prefix_cache()` | Invalidate prefix/KV cache (required before weight reload) | | `/engine/update_weights_from_path` | POST | `engine_client.collective_rpc("reload_weights", ...)` | Load weights from filesystem (safetensors checkpoint) | | `/engine/get_weight_version` | POST | `self._weight_version` | Return current weight version string | +| `/engine/load_lora_adapter` | POST | `engine_client.add_lora(...)` (+ `remove_lora` on hot-swap, `reset_prefix_cache`) | Load or hot-swap a PEFT LoRA adapter from a filesystem path | +| `/engine/unload_lora_adapter` | POST | `engine_client.remove_lora(...)` | Remove a previously-loaded LoRA adapter by name (idempotent) |

And mirror the additions in the register_engine_route(...) snippet.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/Dynamo-RL-api-draft.md` around lines 906 - 926, The doc incorrectly states there are five engine routes and omits the two LoRA routes; update the table and the registration snippet to include /engine/load_lora_adapter and /engine/unload_lora_adapter (methods POST) and show they call the corresponding handlers (e.g., engine route names mapped via runtime.register_engine_route to handler.load_lora_adapter and handler.unload_lora_adapter), or explicitly state if those routes are registered elsewhere; ensure the symbols load_lora_adapter, unload_lora_adapter, and runtime.register_engine_route appear so readers wiring a backend see the full seven-route surface.

coderabbitai · 2026-04-24T03:22:11Z

+    fn all_ok(results: &[serde_json::Value]) -> bool {
+        results
+            .iter()
+            .all(|r| r.get("status").and_then(|s| s.as_str()) == Some("ok"))
+    }


⚠️ Potential issue | 🔴 Critical

all_ok returns true on an empty worker list — every admin endpoint silently succeeds with zero workers.

Iterator::all on an empty iterator is true, so when worker_system_urls is empty (which happens whenever DYN_RL_WORKER_SYSTEM_URLS is set to an empty string, since split(',').filter(|s| !s.is_empty()) produces no URLs), fan_out returns Vec::new() and RlState::all_ok(&[]) is true. Consequences:

POST /v1/rl/pause → 200 OK "all 0 worker(s) paused" while no worker was contacted.

POST /v1/rl/resume, POST /v1/rl/update_weights, POST /v1/rl/load_lora_adapter, POST /v1/rl/unload_lora_adapter all exhibit the same silent-no-op.

Prime-RL can then race: it thinks generation is drained, pushes new weights, resumes — but workers are serving the old model the whole time, corrupting rollouts without any error surfaced.

rl_ready already guards this with !results.is_empty() (line 3082). The other handlers should too. The fix is best applied at the source.

🐛 Proposed fix

/// Returns true only if all results have `status: "ok"`. fn all_ok(results: &[serde_json::Value]) -> bool { - results + !results.is_empty() + && results .iter() .all(|r| r.get("status").and_then(|s| s.as_str()) == Some("ok")) }

Additionally, RlState::from_env should refuse to start with an empty worker list (warn + exit/return an error), or at least log at error! so this is loud rather than silent:

let worker_system_urls = std::env::var(DYN_RL_WORKER_SYSTEM_URLS_ENV) .unwrap_or_else(|_| "http://localhost:8081".to_string()) .split(',') .map(|s| s.trim().to_string()) .filter(|s| !s.is_empty()) .collect::<Vec<_>>(); + if worker_system_urls.is_empty() { + tracing::error!( + "{DYN_RL_WORKER_SYSTEM_URLS_ENV} resolved to zero worker URLs; \ + all /v1/rl/* admin endpoints will be no-ops" + ); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 3056 - 3060, The helper all_ok(results: &[serde_json::Value]) currently returns true for an empty slice because Iterator::all yields true on empty iterators; change all_ok (referenced in fan_out/RlState handlers) to return false when results.is_empty() and only then check that every result has status "ok" (e.g., ensure !results.is_empty() && results.iter().all(...)). Also update any callers that expected the old behavior if necessary, and modify RlState::from_env so it refuses or fatally logs when the parsed worker_system_urls list is empty (do not silently start with zero workers).

coderabbitai · 2026-04-24T03:22:12Z

+/// Promote token IDs from the Dynamo `nvext` response object to the top-level
+/// locations that Prime-RL / verifiers expects:
+///
+///   response.nvext.completion_token_ids  →  response.choices[i].token_ids
+///
+/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.
+/// Used by the RL post-processing path to populate `response.prompt_token_ids`.
+fn rl_tokenize_prompt(
+    state: &Arc<service_v2::State>,
+    model: &str,
+    messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],
+) -> Option<Vec<u32>> {
+    if messages.is_empty() {
+        return None;
+    }
+    let (_, card) = resolve_model_card(state, Some(model)).ok()?;
+    let tokenizer = card.tokenizer().ok()?;
+    let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;
+    let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {
+        model: model.to_string(),
+        messages: messages.to_vec(),
+        ..Default::default()
+    };
+    let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {
+        inner: inner_request,
+        common: Default::default(),
+        nvext: None,
+        chat_template_args: None,
+        media_io_kwargs: None,
+        tokens: None,
+        return_token_ids: None,
+        unsupported_fields: Default::default(),
+    };
+    let prompt = match formatter {
+        crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),
+    }
+    .ok()?;
+    let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;
+    Some(encoding.token_ids().to_vec())
+}
+
+/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
+/// extension structure. Called on non-streaming responses when RL token ID mode
+/// is active.


⚠️ Potential issue | 🟡 Minor

Doc block for rl_promote_token_ids_in_response is attached to the wrong function.

The first paragraph at lines 3359-3363 (Promote token IDs from the Dynamo \nvext` response object …) clearly describes rl_promote_token_ids_in_response(defined at line 3403), notrl_tokenize_prompt. Same failure mode as the stale Not Implemented Errordoc that was flagged abovebad_request` — split it and move the "Promote" paragraph down to the right function.

🧹 Proposed fix

-/// Promote token IDs from the Dynamo `nvext` response object to the top-level -/// locations that Prime-RL / verifiers expects: -/// -/// response.nvext.completion_token_ids → response.choices[i].token_ids -/// /// Tokenize chat messages using the model's tokenizer and return prompt token IDs. /// Used by the RL post-processing path to populate `response.prompt_token_ids`. fn rl_tokenize_prompt( @@ -/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext` -/// extension structure. Called on non-streaming responses when RL token ID mode -/// is active. +/// Promote token IDs from the Dynamo `nvext` response object to the top-level +/// locations that Prime-RL / verifiers expects: +/// +/// response.nvext.completion_token_ids → response.choices[i].token_ids +/// +/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext` +/// extension structure. Called on non-streaming responses when RL token ID mode +/// is active. fn rl_promote_token_ids_in_response(json_val: &mut serde_json::Value) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// Promote token IDs from the Dynamo `nvext` response object to the top-level

/// locations that Prime-RL / verifiers expects:

///

/// response.nvext.completion_token_ids → response.choices[i].token_ids

///

/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.

/// Used by the RL post-processing path to populate `response.prompt_token_ids`.

fn rl_tokenize_prompt(

state: &Arc<service_v2::State>,

model: &str,

messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],

) -> Option<Vec<u32>> {

if messages.is_empty() {

return None;

}

let (_, card) = resolve_model_card(state, Some(model)).ok()?;

let tokenizer = card.tokenizer().ok()?;

let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;

let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {

model: model.to_string(),

messages: messages.to_vec(),

..Default::default()

};

let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {

inner: inner_request,

common: Default::default(),

nvext: None,

chat_template_args: None,

media_io_kwargs: None,

tokens: None,

return_token_ids: None,

unsupported_fields: Default::default(),

};

let prompt = match formatter {

crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),

}

.ok()?;

let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;

Some(encoding.token_ids().to_vec())

}

/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`

/// extension structure. Called on non-streaming responses when RL token ID mode

/// is active.

/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.

/// Used by the RL post-processing path to populate `response.prompt_token_ids`.

fn rl_tokenize_prompt(

state: &Arc<service_v2::State>,

model: &str,

messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],

) -> Option<Vec<u32>> {

if messages.is_empty() {

return None;

}

let (_, card) = resolve_model_card(state, Some(model)).ok()?;

let tokenizer = card.tokenizer().ok()?;

let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;

let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {

model: model.to_string(),

messages: messages.to_vec(),

..Default::default()

};

let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {

inner: inner_request,

common: Default::default(),

nvext: None,

chat_template_args: None,

media_io_kwargs: None,

tokens: None,

return_token_ids: None,

unsupported_fields: Default::default(),

};

let prompt = match formatter {

crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),

}

.ok()?;

let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;

Some(encoding.token_ids().to_vec())

}

/// Promote token IDs from the Dynamo `nvext` response object to the top-level

/// locations that Prime-RL / verifiers expects:

///

/// response.nvext.completion_token_ids → response.choices[i].token_ids

///

/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`

/// extension structure. Called on non-streaming responses when RL token ID mode

/// is active.

fn rl_promote_token_ids_in_response(json_val: &mut serde_json::Value) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 3359 - 3402, The doc comment block describing promotion of token IDs is attached to rl_tokenize_prompt but belongs to rl_promote_token_ids_in_response; move the "Promote token IDs from the Dynamo `nvext` response object…" paragraph out of the rl_tokenize_prompt docblock and attach it immediately above the rl_promote_token_ids_in_response function definition so each function has its correct doc comment (keep the tokenizer/prompt description on rl_tokenize_prompt and the nvext promotion paragraph with rl_promote_token_ids_in_response).

hhzhang16 · 2026-04-24T14:29:16Z

+| Pause fleet | `POST /v1/rl/pause` | Drain in-flight requests before weight update |
+| Resume fleet | `POST /v1/rl/resume` | Resume generation after weight update |


vLLM uses pause_generation and resume_generation; it also has 3 modes: "abort", "wait", and "keep". It seems as though your pause is only "wait" -- do you plan to add "abort" or "keep" logic? Could be useful to follow similar logic as vLLM (also note that their default is abort)

The user needs a way to confirm whether in-flight requests are done; /v1/rl/pause itself could be the confirmation point. This could be updated to be a synchronous pause(mode=wait) that returns only when the fleet is quisced, with a response like {"status":"paused","inflight_requests":0,...}. If you need bounded latency, you can return 202 Accepted with an operation ID or expose a GET /v1/rl/state endpoint that reports information like whether or not it's paused, inflight requests, desired weight version, applied weight version, etc. I would strongly prefer one state endpoint over making clients stitch together ready + weight_version + pause status

hhzhang16 · 2026-04-24T14:34:07Z

+| Health | `GET /v1/rl/health` | Lightweight frontend health check |
+| Readiness | `GET /v1/rl/ready` | Deep check: are workers reachable and healthy? |


We already have a broader Dynamo notion of readiness, what makes this different?

hhzhang16 · 2026-04-24T14:35:01Z

+| Update weights | `POST /v1/rl/update_weights` | Atomic flush + reload from checkpoint directory |
+| Load LoRA adapter | `POST /v1/rl/load_lora_adapter` | Hot-load/swap a PEFT-style adapter from filesystem path |
+| Unload LoRA adapter | `POST /v1/rl/unload_lora_adapter` | Remove a previously loaded adapter by name |
+| Weight version | `GET /v1/rl/weight_version` | Query current weight version across workers |


The concept is useful, but the standalone endpoint feels too narrow. I'd fold it into /v1/rl/state or /v1/rl/status rather than keep a dedicated endpoint

hhzhang16 · 2026-04-24T15:06:51Z

+}
+
+impl TokenizeChatRequest {
+    pub fn validate(&self) -> Result<(), String> {


Is this called anywhere?

hhzhang16 · 2026-04-24T15:11:18Z

+        )
+    } else {
+        (
+            StatusCode::OK,


This function returns status OK no matter what -- what if there are failures?

hhzhang16 · 2026-04-24T15:15:23Z

+### 4.2 Token-In / Token-Out (TITO)
+
+```
+POST /v1/chat/completions/tokens
+```
+
+Dedicated endpoint for Prime-RL's pre-tokenized prompt flow (multi-turn RL, turn 2+). The orchestrator sends raw token IDs instead of text messages, bypassing the frontend's tokenizer entirely. This avoids redundant encode/decode round-trips and ensures token-level alignment.


What's the distinction between this new endpoint and passing in tokens into /v1/chat/completions? Why not just use that? Line 213 of this doc

hhzhang16 · 2026-04-24T15:16:40Z

+
+---
+
+#### `POST /v1/rl/load_lora_adapter`


These LoRA endpoints are okay but very Prime-RL-shaped right now. Why not extend the existing /v1/loras model to support local-path sources?

hhzhang16 · 2026-04-24T15:17:40Z

+#### `GET /v1/rl/health`
+
+Lightweight liveness probe. Returns immediately as long as the frontend process is running. Used by prime-rl's `check_health()` on the admin client.
+
+```bash
+curl -s http://localhost:8000/v1/rl/health
+```
+
+```json
+{"status": "ok"}
+```
+
+---
+
+#### `GET /v1/rl/ready`


I mentioned this in another comment, but I'd prefer a single /v1/rl/state or /v1/rl/status endpoint over accumulating health, ready, and weight_version. That would give you one place for readiness, applied version, desired version, pause state, and per-worker detail.

biswapanda · 2026-04-27T17:19:37Z

Agree with the points. I'll address them.

biswapanda and others added 18 commits April 18, 2026 20:31

feat(rl): add vLLM weight lifecycle routes, TITO proxy, and tokenize …

e95c4be

…for Prime-RL integration

fix: engine route URL path (/engine/ not /engine_route/), pause/resum…

38a339d

…e uses engine_client directly, NvExtResponse construction

--wip-- [skip ci]

3b73f24

add draft api doc

d9cd90d

biswapanda requested review from a team as code owners April 23, 2026 19:31

pull-request-size Bot added the size/XXL label Apr 23, 2026

biswapanda marked this pull request as draft April 23, 2026 19:31

github-actions Bot added documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels Apr 23, 2026

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

biswapanda added 4 commits April 23, 2026 12:57

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 02:39 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 02:40 Inactive

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

hhzhang16 reviewed Apr 24, 2026

View reviewed changes

biswapanda mentioned this pull request May 7, 2026

feat(RL): RL integration layer to support Prime-RL #9131

Closed

biswapanda closed this May 11, 2026

biswapanda mentioned this pull request May 11, 2026

feat(rl): RL integration layer for prime-rl #9382

Closed

		\| Pause fleet \| `POST /v1/rl/pause` \| Drain in-flight requests before weight update \|
		\| Resume fleet \| `POST /v1/rl/resume` \| Resume generation after weight update \|

		\| Health \| `GET /v1/rl/health` \| Lightweight frontend health check \|
		\| Readiness \| `GET /v1/rl/ready` \| Deep check: are workers reachable and healthy? \|

Conversation

biswapanda commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

biswapanda commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

biswapanda commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading