Skip to content

[draft] prime-rl integration #8630

Closed
biswapanda wants to merge 22 commits into
mainfrom
bis/parity-tokenize-tcp
Closed

[draft] prime-rl integration #8630
biswapanda wants to merge 22 commits into
mainfrom
bis/parity-tokenize-tcp

Conversation

@biswapanda

@biswapanda biswapanda commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Open in Devin Review

Summary by CodeRabbit

  • New Features

    • Added token ID support: token echo in responses and pre-tokenized prompt injection via /v1/chat/completions/tokens
    • Added /v1/tokenize and /v1/detokenize endpoints for text encoding and decoding
    • Added RL admin endpoints (/v1/rl/*) for generation lifecycle control, cache flushing, weight updates, and LoRA adapter management
    • Extended logprobs with token ID resolution
  • Chores

    • Upgraded vLLM to v0.19.1 with prime-rl plugin integration

biswapanda and others added 18 commits April 18, 2026 20:31
…e uses engine_client directly, NvExtResponse construction
…elds

Three fixes from e2e testing:

1. preprocessor.rs: Use nvext.token_data when provided regardless of
   backend_instance_id. Previously token_data was only used in the
   GAIE Stage 2 flow (with backend_instance_id). RL/TITO needs token
   override with normal routing.

2. handlers.py: Use collective_rpc("reload_weights") instead of
   collective_rpc("update_weights_from_path") -- the latter doesn't
   exist in vLLM 0.18.0. reload_weights calls GPUModelRunner's built-in
   layerwise reload from safetensors.

3. rl_admin/service.py: Strip return_token_ids and tokens fields from
   standard /v1/chat/completions proxy before forwarding to Dynamo
   frontend (which rejects unknown parameters with 400).

E2E verified:
- Standard mode: 10/10 steps, real weight updates, wandb synced
  https://wandb.ai/test232/prime-rl/runs/fe3dntog
- TITO mode: 10/10 steps, real weight updates, wandb synced
  https://wandb.ai/test232/prime-rl/runs/m4gz7gty
Three fixes for baseline parity with OSS vLLM:

1. publisher.py: Guard scheduler_stats is None after weight reload.
   The stats logger fires before scheduler_stats is populated after
   flush_cache + update_weights_from_path, causing AttributeError
   that kills the vLLM engine mid-training.

2. rl_admin/service.py: Return HTTP 502 on engine route failures.
   Previously returned 200 even when vLLM worker was dead, causing
   Prime-RL to silently skip weight updates.

3. rl_admin/service.py: Inject prompt_token_ids and choice.token_ids
   into proxied responses. Prime-RL's verifiers client requires these
   vLLM-specific fields to extract per-token logprobs. Without them,
   completion_logprobs are filled with zeros, triggering IPO mask on
   76-89% of tokens and destroying training signal.

   Uses convert_tokens_to_ids() for special tokens like <think> to
   avoid the re-encode problem where a single special token becomes
   multiple tokens.
The previous convert_tokens_to_ids check used `tid != tok.unk_token_id`
but Qwen3's tokenizer has unk_token_id=None, so `None != None` is False
and every token fell through to the encode() fallback. For byte-level
tokens like "\n" (vocab entry "Ċ"), encode() can produce different token
counts than the logprobs, causing assertion failures in the trainer:

  AssertionError: input_ids: 209, inference_logprobs: 208

Fix: check `isinstance(tid, int)` only (no unk comparison), and in the
byte-level fallback always append exactly 1 token ID (ids[0]) to
maintain strict 1:1 alignment with logprobs entries.
Eliminates Rust/Python tokenizer mismatch by having the rl_admin
pre-tokenize every prompt with the Python HF tokenizer and inject
via nvext.token_data. The Rust frontend bypasses its own tokenizer
and passes the Python tokens directly to vLLM.

Parity results (20-step reverse-text, Qwen3-0.6B-SFT):
- Mean reward: 0.659 (vs baseline 0.666, delta -1.1%)
- Peak reward: 0.765 at step 15
- is_masked: 0.0024 (vs baseline 0.0004)
- Throughput: 10.0 req/s (vs baseline 10.7, 1.07x)
…7699)

Manually cherry-picked the /tokenize and /detokenize endpoint code from
ai-dynamo/dynamo PR #7699 (jthomson04/tokenize-endpoint). The full PR
had merge conflicts due to divergent chat_completions.rs refactoring,
so only the tokenization-specific changes were applied:

- New file: protocols/openai/tokenization.rs (request/response types)
- Tokenizer trait additions: encode_with_special_tokens, convert_ids_to_tokens
- HTTP routes: POST /tokenize, POST /detokenize
- Route wiring in service_v2.rs

These endpoints follow vLLM's /tokenize and /detokenize semantics,
enabling consistent tokenization between the Rust frontend and Python
clients for RL training.
Verified that Rust frontend /tokenize produces byte-identical token IDs
to the Python HF tokenizer across all test cases (special chars, CJK,
whitespace, thinking tokens, multi-turn, empty content). This means:

- /tokenize now proxies directly to Rust frontend (no Python tokenizer)
- /detokenize proxied to Rust frontend (new)
- _pretokenize_request() removed (Rust tokenizer handles it natively)
- _get_tokenizer() removed (Python HF tokenizer no longer needed)
- _inject_token_ids() now uses Rust /tokenize for prompt_ids

Also:
- Added /ready deep health check (frontend + system port connectivity)
- Added httpx connection pool limits + shutdown hook
- Added assertion: len(completion_ids) == len(logprobs)
- Removed /load_lora_adapter stub (dead code)
Phase 1 of rl_admin elimination (Eliminate_rl_admin.md):

1. delta.rs: accumulate delta.token_ids per chunk and emit as
   nvext.completion_token_ids on the final (finish_reason) chunk when
   extra_fields includes "completion_token_ids".

2. openai.rs (handler_chat_completions): promote return_token_ids and
   tokens fields from ignored to first-class on the standard
   /v1/chat/completions endpoint, eliminating the rl_admin stripping hack.
   - tokens -> nvext.token_data (TITO path, same as /v1/chat/completions/tokens)
   - return_token_ids=true -> extra_fields["token_ids","completion_token_ids"]
     + forces logprobs=true

3. openai.rs (tokenization_router): rename /tokenize -> /v1/tokenize and
   /detokenize -> /v1/detokenize to follow /v1/* convention.

4. openai.rs (rl_router): new /v1/rl/* admin fleet-control endpoints,
   gated behind DYN_ENABLE_RL env var (or HttpServiceConfig.enable_rl):
   - GET  /v1/rl/ready           composite worker health check
   - POST /v1/rl/pause           fan-out pause_generation to all workers
   - POST /v1/rl/resume          fan-out resume_generation to all workers
   - POST /v1/rl/update_weights  flush_cache + update_weights_from_path
   - GET  /v1/rl/weight_version  query + consistency-check weight version
   Worker system URLs from DYN_RL_WORKER_SYSTEM_URLS (default localhost:8081).

5. service_v2.rs: add enable_rl to HttpServiceConfig and mount rl_router
   when enabled.

6. rl_admin/service.py: update proxy targets /tokenize -> /v1/tokenize
   and /detokenize -> /v1/detokenize to match new Rust frontend paths.
Prime-RL's check_health() calls GET /health on the admin client.
With admin_base_url='http://dynamo:8000/v1/rl', this arrives as
GET /v1/rl/health. Add a lightweight 200 OK handler so the readiness
wait loop succeeds without needing the deep /v1/rl/ready probe.
…s + choice.token_ids

When DYN_ENABLE_RL=true, the Rust frontend now:
1. Auto-enables return_token_ids promotion (even if client doesn't request it)
2. Tokenizes the prompt and injects prompt_token_ids into the response
3. Copies nvext.completion_token_ids to choice.token_ids for Prime-RL/verifiers

This brings is_masked from 45% (missing token IDs) down to 0.13% (better than
baseline vLLM at 1.2%). Mismatch KL drops from ~1.0 to 0.005.

Validated: 20-step training with peak reward 0.825 (vs 0.798 vLLM baseline).
Prime-RL's LoRA training loop broadcasts PEFT-style adapter weights each
step (adapter_model.safetensors + adapter_config.json) and expects the
inference side to hot-swap them in place. The existing /v1/rl/update_weights
path calls vLLM's reload_weights -> qwen2.load_weights, which treats the
adapter keys as base-model weights and fails with
  KeyError: 'layers.0.mlp.gate_up_proj.lora_A.weight'

These two new RL admin endpoints give Prime-RL a filesystem-native LoRA
swap path that matches prime-rl/src/prime_rl/utils/client.py's
load_lora_adapter() branch:

POST /v1/rl/load_lora_adapter  {"lora_name", "lora_path"}
POST /v1/rl/unload_lora_adapter {"lora_name"}

On the Rust frontend (lib/llm/src/http/service/openai.rs) these mirror
rl_update_weights: validate body, fan out to every worker system port as
/engine/<route>, require all_ok for 200 OK. Two new RouteDoc entries and
route mounts in rl_router().

On the vLLM worker (components/src/dynamo/vllm/handlers.py) the new
load_lora_adapter() handler is distinct from the existing URI-based
load_lora() gRPC path:

  * Reads the adapter directly from the given filesystem path -- no
    LoRAManager / no URI fetch.
  * Hot-swap semantics: calling with a lora_name that is already loaded
    removes the previous adapter (same deterministic int id) and adds
    the new one, then invalidates the prefix cache so stale KV entries
    keyed to the previous adapter weights do not poison rollouts.
  * Publishes a ModelDeploymentCard on first load; Prime-RL's scheduler
    switches its request model field to the LoRA name after load
    (self.model_name = self.lora_name), so the frontend needs an MDC to
    route r16-a32 -> this worker. Skipped on hot-swap since the MDC is
    already in place. unload_lora_adapter unregisters the MDC.

Both handlers are registered as engine routes in worker_factory.py on
both decode and prefill worker code paths (sibling to pause_generation /
update_weights_from_path / etc.), so the Rust fan-out to
/engine/load_lora_adapter resolves identically.

vLLM prerequisites: engine must be started with
  --enable-lora --max-lora-rank R --max-loras N
with R >= the adapter rank and N >= the number of distinct lora_name
values held at once. For Prime-RL's single-adapter training loop
--max-loras 1 is sufficient.

Validated end-to-end with Prime-RL bis/dynamo-integration against
Qwen/Qwen3-0.6B + math-env on GSM8K:
  * 20/20 trainer steps SUCCESS (exit 0)
  * Real gradients throughout (Loss 1e-3 range, Entropy 0.3-0.7,
    Mismatch KL 1e-3, Grad Norm 0.1-0.3)
  * Step 1 shows "LoRA adapter loaded"; steps 2+ show "hot-swapped"
  * Peak trainer memory 29.6 GiB on A6000 (vs 29.8 GiB full FT)
  * Orchestrator rewards 0.0-0.3 across GSM8K problems

Also documented both endpoints in docs/Dynamo-RL-api-draft.md (endpoint
summary table plus detail sections mirroring update_weights style).
0.19.1 fixes a device-manager context bug that broke reload_weights
for LoRA: `with torch.device(load_device)` in gpu_model_runner left
weights on CPU when reloading into a restricted CUDA_VISIBLE_DEVICES.
0.19.1 removes that context manager entirely.

Validated locally with prime-rl Full FT (20 steps, Qwen/Qwen3-0.6B,
math-env / GSM8K) -- reward range 0.1875-0.6250, peak trainer memory
29.8 GiB on A6000.
…atches

After vLLM is installed, pip-install prime-rl from a pinned tag (default:
v0.5.1.dev101) with --no-deps so only the inference subtree + the
vllm.general_plugins entry-point land in the image.

Once registered, vLLM's plugin loader auto-invokes
prime_rl.inference.patches:transformers_v5_compat in every worker process
(including spawned subprocesses), applying the upstream-maintained patch
chain (LoRA adapter load, DP engine pause/resume deadlock, Qwen 3.5 LoRA,
transformers v5 config compat, etc.).

Trade-offs:
- --no-deps keeps the image footprint small (~3 MB of trainer .py left on
  disk but never imported).
- --ignore-requires-python: prime-rl pins ~=3.12.0; Dynamo containers are
  3.12 so this is a no-op there. The flag only helps local dev venvs on
  3.11.
- Override tag at build time via --build-arg PRIME_RL_REF=<tag>.
- Sanity probe fails the build if the entry-point is not registered.

Alternative considered and rejected:
- Vendoring prime-rl's patches directly into Dynamo -> diverges from
  upstream; prime-rl ships frequent patch updates.
- A dynamo-authored vllm.general_plugins shim -> duplicates prime-rl
  maintenance burden; better to re-use their authoritative version.

Validated locally (vllm 0.19.1, prime-rl v0.5.0 tag):
- Plugin entry-point registers
- vLLM injects NCCLWeightUpdateWorker methods into Worker base at init
- init_broadcaster reachable via collective_rpc
- NCCL weight broadcast completes over NET/Socket transport (5/5 steps)
uv pip install does not accept --ignore-requires-python; the flag caused
the image build to fail at the prime-rl install step:

    error: unexpected argument '--ignore-requires-python' found

Removing the flag is safe for container builds: PYTHON_VERSION=3.12 in all
vllm-runtime Dockerfiles matches prime-rl's requires-python = "~=3.12.0".

The flag was only relevant for local 3.11 dev venvs; the comment now points
those users to regular pip instead of uv.
…eResult

Main commit 2cabf44 (feat: Decoder clean-up for incomplete multi-byte
sequences, #8022) changed Decoder::decode() to return Result<DecodeResult>
instead of Result<String>. The three backend tokenizer impls were updated,
but two spots were missed:

- lib/llm/src/tokenizers.rs: the default convert_ids_to_tokens trait impl
  .collect()s an iterator whose items are now DecodeResult, not String.
  Map via String::from (existing `impl From<DecodeResult> for String`)
  before collecting.

- lib/llm/src/http/service/openai.rs: the /v1/detokenize handler passes
  DecodeResult directly into DetokenizeResponse { prompt: String }. Extract
  the inner string via .into() before constructing the response.

Verified with `cargo check -p dynamo-py3 --features kv-indexer --lib`
(the crate maturin builds in the container).
@biswapanda biswapanda requested review from a team as code owners April 23, 2026 19:31
@biswapanda biswapanda marked this pull request as draft April 23, 2026 19:31
@github-actions github-actions Bot added documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels Apr 23, 2026
@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This pull request introduces Reinforcement Learning (RL) capabilities across the Dynamo platform, adding generation control (pause/resume), weight management, LoRA adapter hot-swapping, and token-in/token-out (TITO) endpoints with expanded tokenization APIs and vLLM plugin integration.

Changes

Cohort / File(s) Summary
RL Handler & Route Registration
components/src/dynamo/vllm/handlers.py, components/src/dynamo/vllm/worker_factory.py
Added 7 async HTTP handler methods (pause_generation, resume_generation, flush_cache, update_weights_from_path, get_weight_version, load_lora_adapter, unload_lora_adapter) in BaseWorkerHandler with distributed state management and LoRA lifecycle tracking; registered new routes in worker factory for generation control, cache/weight operations, and adapter management.
Frontend & Engine Communication
components/src/dynamo/frontend/vllm_processor.py, components/src/dynamo/vllm/publisher.py
Added post-processing to populate logprobs from engine responses and stream completion token IDs; added null guard for scheduler_stats to prevent dereferences during transitional states.
vLLM Plugin Integration
container/deps/vllm/install_vllm.sh
Bumped vLLM version to 0.19.1 and added installation of prime-rl plugin from Git with runtime verification of entry-point availability.
OpenAI HTTP Service & Routing
lib/llm/src/http/service/openai.rs
Introduced tokenization (/v1/tokenize, /v1/detokenize) and TITO (/v1/chat/completions/tokens) routers; added RL admin router (/v1/rl/*) for fleet control (health, pause/resume, weight updates, LoRA operations); extended chat completions handler to promote RL request fields and rewrite responses with token IDs.
Service Configuration & Routing
lib/llm/src/http/service/service_v2.rs
Added enable_rl configuration flag and wired RL routes and tokenization router into system router under /v1/rl/* and system endpoints, gated by flag or environment variable.
Request/Response Data Types
lib/llm/src/protocols/openai/chat_completions.rs, lib/llm/src/protocols/openai/nvext.rs, lib/llm/src/protocols/openai/tokenization.rs
Extended chat completion request with tokens and return_token_ids fields; added prompt_token_ids to response; added completion_token_ids to nvext extension; created new tokenization module with serde-annotated request/response structs for tokenize/detokenize operations.
Response Processing & Aggregation
lib/llm/src/protocols/openai/chat_completions/aggregator.rs, lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs
Updated response aggregation to initialize prompt_token_ids; extended delta generator to accumulate and emit completion_token_ids across streaming chunks; injected completion_token_ids into nvext payloads.
Input Processing & Token Handling
lib/llm/src/preprocessor.rs, lib/llm/src/entrypoint/input/text.rs, lib/llm/src/protocols/openai/responses/mod.rs
Refactored token preparation to always use nvext.token_data when present independent of backend_instance_id; added explicit initialization of tokens and return_token_ids to None in request construction.
Protocol Extensions
lib/llm/src/protocols/openai.rs, lib/llm/src/protocols/openai/validate.rs, lib/llm/src/protocols/anthropic/types.rs, lib/llm/src/audit/stream.rs
Added get_return_tokens_as_token_ids() optional method to OpenAI output options; imported tokenization module; allowed cache_salt as passthrough field in validation; updated Anthropic and audit stream conversions to set token fields to None.
Tokenizer API & Implementations
lib/llm/src/tokenizers.rs, lib/llm/src/tokenizers/fastokens.rs, lib/llm/src/tokenizers/hf.rs, lib/llm/src/tokenizers/tiktoken.rs
Added encode_with_special_tokens and convert_ids_to_tokens methods to tokenizer trait and all implementations (FastTokenizer, HuggingFace, TikToken); updated encoding to respect special-token handling flags and support token-to-string conversion.
Documentation
docs/Dynamo-RL-api-draft.md
New comprehensive API specification document describing RL-oriented features, TITO endpoint, tokenization endpoints, fleet control plane routes, and internal vLLM engine route mappings with limitations and validation tables.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is completely empty—only showing the template structure with placeholder sections and no substantive content in any required fields. Provide a complete description filling in Overview (what problem is solved), Details (changes made), and Where should the reviewer start (key files to review).
Title check ❓ Inconclusive The title '[draft] prime-rl integration' is vague and generic, using non-descriptive terms that don't clearly convey the specific changes in this substantial multi-file changeset. Consider a more descriptive title that highlights the primary change, e.g., 'Add Prime-RL runtime administration and token-id endpoints' or specify the main focus area.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch bis/parity-tokenize-tcp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (10)
lib/llm/src/preprocessor.rs (1)

585-594: Optional: collapse the two tokenizing branches.

The else if has_backend_instance_id { ... } and final else { ... } branches have identical result expressions — only the tracing::warn! differs. They can be flattened with a guard log, reducing the risk of future drift:

♻️ Suggested simplification
-                            } else if has_backend_instance_id {
-                                tracing::warn!(
-                                    "backend_instance_id provided but no token_data; tokenizing prompt"
-                                );
-                                let encoding = self.encode_with_timing(&prompt, tracker)?;
-                                (encoding.token_ids().to_vec(), false)
-                            } else {
-                                let encoding = self.encode_with_timing(&prompt, tracker)?;
-                                (encoding.token_ids().to_vec(), false)
-                            };
+                            } else {
+                                if has_backend_instance_id {
+                                    tracing::warn!(
+                                        "backend_instance_id provided but no token_data; tokenizing prompt"
+                                    );
+                                }
+                                let encoding = self.encode_with_timing(&prompt, tracker)?;
+                                (encoding.token_ids().to_vec(), false)
+                            };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/preprocessor.rs` around lines 585 - 594, The two branches under
the else/else if both call self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false), differing only by a tracing::warn! when
has_backend_instance_id is true; refactor by removing the duplicate branches: in
the outer else, if has_backend_instance_id { tracing::warn!(...); } then call
let encoding = self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false) — this collapses the branches while
preserving the warning behavior and using the existing encode_with_timing,
prompt, tracker, and has_backend_instance_id symbols.
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

466-486: Nit: use std::mem::take to release the accumulator vector on emit.

self.accumulated_completion_token_ids.clone() retains the backing allocation on the generator even though no more chunks will follow finish_reason. Swapping in std::mem::take releases it promptly:

♻️ Proposed change
-        let completion_token_ids =
-            if self.options.enable_completion_token_ids && finish_reason.is_some() {
-                Some(self.accumulated_completion_token_ids.clone())
-            } else {
-                None
-            };
+        let completion_token_ids =
+            if self.options.enable_completion_token_ids && finish_reason.is_some() {
+                Some(std::mem::take(&mut self.accumulated_completion_token_ids))
+            } else {
+                None
+            };

Also, consider a unit test in this module covering: opt-in via extra_fields: ["completion_token_ids"], mid-stream chunks carry no completion_token_ids, and the final chunk's nvext.completion_token_ids equals the concatenation of per-chunk token_ids (hard invariant called out in nvext.rs).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs` around lines 466 -
486, The code currently clones self.accumulated_completion_token_ids into
completion_token_ids causing the accumulator's allocation to be retained;
replace the clone with std::mem::take(&mut
self.accumulated_completion_token_ids) when building completion_token_ids (only
when self.options.enable_completion_token_ids && finish_reason.is_some()) so the
generator releases the vector's backing buffer immediately; ensure the rest of
the branch (where completion_token_ids is moved into NvExtResponse) still
compiles and preserves behavior of NvExtResponse::completion_token_ids and
related logic involving token_ids, worker_id_info, timing_info, and
routed_experts; optionally add a unit test that opts in via extra_fields:
["completion_token_ids"] confirming mid-stream chunks lack completion_token_ids
and the final nvext.completion_token_ids equals the concatenation of per-chunk
token_ids.
components/src/dynamo/vllm/worker_factory.py (1)

338-353: Extract duplicate engine-route registration to a helper function.

Both _create_decode_worker (lines 338–353) and _create_prefill_worker (lines 586–601) register the same 7 RL routes (pause_generation, resume_generation, flush_cache, update_weights_from_path, get_weight_version, load_lora_adapter, unload_lora_adapter) with an identical log message. This duplication invites drift—e.g., adding a route to one path and forgetting the other. Since both DecodeWorkerHandler and PrefillWorkerHandler inherit from BaseWorkerHandler and all methods are defined there, consider extracting:

♻️ Proposed refactor
_RL_ROUTES: tuple[str, ...] = (
    "pause_generation",
    "resume_generation",
    "flush_cache",
    "update_weights_from_path",
    "get_weight_version",
    "load_lora_adapter",
    "unload_lora_adapter",
)

def _register_lifecycle_and_rl_routes(runtime: DistributedRuntime, handler) -> None:
    runtime.register_engine_route("sleep", handler.sleep)
    runtime.register_engine_route("wake_up", handler.wake_up)
    runtime.register_engine_route("scale_elastic_ep", handler.scale_elastic_ep)
    for name in _RL_ROUTES:
        runtime.register_engine_route(name, getattr(handler, name))
    logger.info(
        "Registered engine routes: sleep, wake_up, scale_elastic_ep, %s",
        ", ".join(_RL_ROUTES),
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/worker_factory.py` around lines 338 - 353, Extract
the duplicated engine-route registration in _create_decode_worker and
_create_prefill_worker into a helper function (e.g.,
_register_lifecycle_and_rl_routes) that takes the DistributedRuntime and a
handler instance; define a single _RL_ROUTES tuple containing the seven RL route
names and use a loop to call runtime.register_engine_route(name,
getattr(handler, name)) for each, also registering "sleep", "wake_up", and
"scale_elastic_ep" there, and replace the duplicate
runtime.register_engine_route calls and logger.info in both functions with a
single call to the new helper; reference BaseWorkerHandler (handler methods),
runtime.register_engine_route, _create_decode_worker, and _create_prefill_worker
when making the change.
lib/llm/src/tokenizers/fastokens.rs (1)

49-63: Document (and optionally test) the HF fallback for add_special_tokens=true.

fastokens doesn't support special-token-aware encoding, so this path transparently falls back to HuggingFaceTokenizer, which means callers lose the fastokens speedup whenever they request special tokens. That's a reasonable choice, but the struct-level doc (Lines 20–22) still claims "fast BPE encoding via fastokens, decoding via HuggingFace" which is now incomplete. A one-liner here plus an extension of test_fast_matches_hf_encoding with add_special_tokens=true would make the behavior explicit and guard against regressions (e.g., mismatched special-token IDs between the two backends loaded from the same tokenizer.json).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers/fastokens.rs` around lines 49 - 63, Update the
struct-level documentation for the fastokens tokenizer to state that
encode_with_special_tokens falls back to the HuggingFace tokenizer when
add_special_tokens=true (i.e., special-token-aware encoding is performed by
hf_decoder, so callers lose fastokens speedup in that case), and add/extend the
test_fast_matches_hf_encoding test to call encode_with_special_tokens with
add_special_tokens=true and assert the HF fallback produces matching token
ids/special-token ids compared to the fast path where applicable; locate the
behavior in the encode_with_special_tokens method (uses self.hf_decoder and
self.fast_encoder) and ensure the test compares outputs from
hf_decoder.encode_with_special_tokens(...) and fast_encoder.encode(...) wrapped
as Encoding::Sp (and guards for any differences in special-token id handling).
lib/llm/src/tokenizers.rs (1)

65-71: Silent default for encode_with_special_tokens is error-prone.

The default implementation silently ignores _add_special_tokens and delegates to encode. If a new Encoder backend is added later and the author forgets to override this method, callers requesting add_special_tokens = true will quietly get encoding without special tokens — a correctness bug with no compile-time or runtime signal. All current backends override this, so a stricter contract is free today.

♻️ Two safer alternatives

Either make it a required trait method (no default), or have the default fail loudly when the flag is set:

-        fn encode_with_special_tokens(
-            &self,
-            input: &str,
-            _add_special_tokens: bool,
-        ) -> Result<Encoding> {
-            self.encode(input)
-        }
+        fn encode_with_special_tokens(
+            &self,
+            input: &str,
+            add_special_tokens: bool,
+        ) -> Result<Encoding> {
+            if add_special_tokens {
+                return Err(Error::msg(
+                    "encode_with_special_tokens(add_special_tokens=true) not supported by this backend",
+                ));
+            }
+            self.encode(input)
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers.rs` around lines 65 - 71, The trait's default impl of
encode_with_special_tokens silently ignores the _add_special_tokens flag; remove
the default implementation from the trait so encode_with_special_tokens becomes
a required method (i.e., declare fn encode_with_special_tokens(&self, input:
&str, add_special_tokens: bool) -> Result<Encoding> without a body) and then
update each Encoder implementation that currently relies on the default to
implement encode_with_special_tokens explicitly (search for
encode_with_special_tokens and implementations of the Encoder trait to add the
proper handling).
lib/llm/src/http/service/service_v2.rs (1)

546-550: Promote DYN_ENABLE_RL to an env_llm constant.

The rest of this file references environment variable names through the env_llm module (e.g., env_llm::DYN_ENABLE_STREAMING_TOOL_DISPATCH, env_llm::DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS). The hardcoded "DYN_ENABLE_RL" string here — plus the DYN_RL_WORKER_SYSTEM_URLS mentioned in the doc comment at Lines 249–251 — breaks that convention and makes these env vars harder to discover/grep and prone to typos.

♻️ Suggested change

Add the constant alongside existing ones in dynamo_runtime::config::environment_names::llm and use it here:

-        if config.enable_rl || env_is_truthy("DYN_ENABLE_RL") {
+        if config.enable_rl || env_is_truthy(env_llm::DYN_ENABLE_RL) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/service_v2.rs` around lines 546 - 550, Add a new
constant DYN_ENABLE_RL to the env_llm module
(dynamo_runtime::config::environment_names::llm) and replace the hardcoded
string usages with that constant; specifically change the condition that now
reads env_is_truthy("DYN_ENABLE_RL") to use
env_is_truthy(env_llm::DYN_ENABLE_RL) in the RL admin routes block alongside
config.enable_rl, and update any doc comments that mention the literal
DYN_RL_WORKER_SYSTEM_URLS to reference the env_llm constant name instead so all
environment names are discoverable and consistent.
components/src/dynamo/rl_admin/service.py (2)

151-168: Swallowed except Exception in /ready should log.

The two except Exception: ...False blocks hide the actual failure reason, so an operator watching a not-ready service has no signal as to whether the frontend is down, DNS is broken, or TLS handshakes are failing. Log at debug level so the reason is retrievable without changing the response shape.

♻️ Proposed fix
     try:
         frontend_ok = (await client.get(f"{DYNAMO_FRONTEND_URL}/v1/models")).status_code == 200
-    except Exception:
+    except httpx.HTTPError as e:
+        logger.debug("ready: frontend probe failed: %s", e)
         frontend_ok = False
     try:
         system_ok = (await client.get(f"{DYNAMO_SYSTEM_URL}/health")).status_code == 200
-    except Exception:
+    except httpx.HTTPError as e:
+        logger.debug("ready: system probe failed: %s", e)
         system_ok = False

As per coding guidelines: "catch only specific exceptions you can handle, and if you catch broadly (except Exception) you must log and re-raise."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 151 - 168, The
ready() health check currently swallows exceptions when probing
DYNAMO_FRONTEND_URL and DYNAMO_SYSTEM_URL; update the two try/except blocks that
call _get_http_client().get(...) to catch the exception, log the full exception
details at debug (include context like which URL failed and the exception)
before setting frontend_ok/system_ok = False, and do not change the JSONResponse
shape or status codes; use the existing logger (or create one if none) to emit
the debug-level messages so operators can inspect failures without altering
behavior.

54-59: Use lifespan context manager instead of deprecated @app.on_event("shutdown").

The on_event decorator is deprecated in FastAPI (with deprecation warnings present). While still functional, it is slated for eventual removal. Since this is a new file, implement using the lifespan async context manager from the start to avoid a future migration.

Suggested refactor
from contextlib import asynccontextmanager

`@asynccontextmanager`
async def lifespan(app: FastAPI):
    yield
    global _http_client
    if _http_client is not None:
        await _http_client.aclose()
        _http_client = None

app = FastAPI(title="Dynamo RL Admin", lifespan=lifespan)

Add from contextlib import asynccontextmanager at the top of the file with the other imports (line 24–31).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 54 - 59, Replace the
deprecated FastAPI shutdown event handler by adding an asynccontextmanager-based
lifespan and passing it into the FastAPI constructor: import asynccontextmanager
from contextlib, define async def lifespan(app: FastAPI) decorated with
`@asynccontextmanager` that yields and then performs the same cleanup of the
global _http_client (close and set to None), and create app =
FastAPI(title="Dynamo RL Admin", lifespan=lifespan) instead of relying on the
_shutdown function; remove the `@app.on_event`("shutdown") _shutdown definition
afterward.
components/src/dynamo/vllm/handlers.py (2)

705-881: Lock-map growth: new RL LoRA handlers don't prune _lora_load_locks like their URI counterparts.

load_lora (lines 1202–1210) and unload_lora (lines 1323–1330) both wrap the critical section in a try/finally that removes the per-LoRA lock from _lora_load_locks once the adapter is no longer tracked. The new load_lora_adapter / unload_lora_adapter skip that cleanup, so over a long RL run every distinct lora_name ever loaded (or ever attempted on a failed load) leaves a permanent entry in the lock dict. It's not a functional bug, but for a training loop that hot-swaps adapters repeatedly it's worth mirroring the existing pattern for consistency.

♻️ Suggested pattern (apply to both new handlers)
             lock = self._get_lora_lock(lora_name)
             async with lock:
-                ...
+                try:
+                    ...
+                finally:
+                    with self._lora_load_locks_guard:
+                        if (
+                            lora_name not in self.loaded_loras
+                            and self._lora_load_locks.get(lora_name) is lock
+                        ):
+                            self._lora_load_locks.pop(lora_name, None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/handlers.py` around lines 705 - 881, The handlers
load_lora_adapter and unload_lora_adapter leak entries in the _lora_load_locks
map because they don't remove the per-lora lock after the critical section;
update both functions to mirror the existing pattern in load_lora/unload_lora by
wrapping the async with lock block in a try/finally and in the finally remove
the lock from self._lora_load_locks (use the same key lora_name) when the
adapter is no longer tracked (i.e., after successful unload or on failure) so
_get_lora_lock and loaded_loras remain consistent and the lock-map doesn't grow
unbounded.

829-881: unload_lora_adapter state ordering is fine, but note the asymmetric error semantics.

del self.loaded_loras[lora_name] on line 853 happens before unregister_model, so if unregister fails the adapter is gone from both the engine and local tracking yet the caller sees status=error. That matches the idempotent contract in the docstring (a retry hits the lora is None no-op branch and returns ok), but it differs from the older unload_lora which rolls the engine state back on unregister failure (lines 1278–1300). Worth a one-line comment noting the intentional divergence so a future maintainer doesn't "fix" it by copying the old rollback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/handlers.py` around lines 829 - 881, The
unload_lora_adapter method currently deletes the local tracking entry (del
self.loaded_loras[lora_name]) before calling unregister_model, which means if
unregister_model fails the adapter is already removed from engine and local
state but the method returns status=error; add a one-line comment immediately
above the del self.loaded_loras[lora_name] (or above the unregister_model call)
explaining that this ordering is intentional and documents the asymmetric error
semantics (idempotent no-op on retry) and that it intentionally differs from the
older unload_lora rollback behavior to avoid re-registering the adapter on
unregister failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/rl_admin/service.py`:
- Around line 92-115: The function _extract_completion_ids_from_logprobs is
reconstructing token IDs from decoded bytes (using _byte_token_cache,
_encode_single, bytes.decode(..., errors="replace")) which is incorrect and can
silently corrupt RL training; replace its use with the authoritative token IDs
produced by the Rust/vLLM path (read choice["token_ids"] or
nvext.completion_token_ids promoted by rl_promote_token_ids_in_response) and
remove or deprecate this reconstruction logic; if you must keep a fallback, do
not use bytes.decode(..., errors="replace") or return 0 — instead populate
_byte_token_cache deterministically, ensure _tokenizer_for_fallback is valid
(propagate tokenizer load errors instead of swallowing), and return a clear
sentinel (e.g., None or raise) so callers can skip or fail fast rather than
training on <unk>.
- Around line 405-430: The service main() currently defaults host to "0.0.0.0",
exposing admin endpoints; change the default for parser.add_argument("--host",
...) to "127.0.0.1" (or "localhost") so DYNAMO admin endpoints are bound to
loopback by default, and add an explicit opt-in (e.g., a boolean flag
--allow-external or require setting an env var) that, when true, allows
overriding to external addresses; update related logging and any usage of
args.host in uvicorn.run(app, host=...) and keep DYNAMO_FRONTEND_URL,
DYNAMO_SYSTEM_URL, and MODEL_NAME handling untouched.
- Around line 315-344: The _inject_token_ids function currently uses an assert
to validate token alignment and catches all Exceptions, which can be stripped
under python -O or be swallowed by the broad except; replace the assert in
_inject_token_ids (and the validation around
_extract_completion_ids_from_logprobs and len comparisons) with an explicit
raise (e.g., raise ValueError or RuntimeError with a clear message) when the
lengths mismatch, and narrow the try/except to only catch and handle
tokenization/HTTP-specific errors coming from _tokenize_via_frontend (e.g.,
TokenizationError, HTTPError) while allowing alignment errors to propagate (or
re-raise after logging) so failures aren’t silently converted to warnings and
missing choice["token_ids"] don’t slip through.
- Around line 118-139: The fallback tokenizer loading in _encode_single
currently imports transformers inside the function, swallows all exceptions, and
silently returns 0 on failure; move the import to module top as a guarded import
(e.g., try: from transformers import AutoTokenizer except ImportError:
AutoTokenizer = None), then in _encode_single check if AutoTokenizer is not None
before attempting to instantiate; narrow the exception handling when creating
the tokenizer to specific exceptions (ImportError, OSError, ValueError) and log
failures with the module logger (e.g., logging.getLogger(__name__)) rather than
using a bare except: pass, and ensure on load failure _tokenizer_for_fallback
remains None so callers can detect fallback absence while errors are recorded;
reference symbols: _encode_single, _tokenizer_for_fallback, AutoTokenizer,
MODEL_NAME.

In `@components/src/dynamo/vllm/handlers.py`:
- Around line 801-804: The rollback's remove_lora call currently swallows all
exceptions (try: await self.engine_client.remove_lora(lora_id) except: pass);
replace this with catching the exception as e and logging it (e.g.,
logger.warning or logger.exception) including lora_id and any engine id/context
so the leak is diagnosable, and do not silently pass; keep the existing
surrounding rollback flow in the method that manipulates self.loaded_loras and
use the existing logger to record the failure instead of suppressing it.
- Around line 738-763: The hot-swap path removes the old adapter from the engine
(engine_client.remove_lora) but leaves the stale LoRAInfo in self.loaded_loras,
so an add_lora failure leaves the worker pointing at an unregistered id; fix by
removing the dict entry as soon as remove_lora succeeds (i.e., delete
self.loaded_loras[lora_name] right after the remove_lora call inside the
is_hot_swap branch) and only populate self.loaded_loras[lora_name] =
LoRAInfo(...) after engine_client.add_lora completes successfully; ensure
add_lora exceptions do not reintroduce the stale entry (handle exceptions around
engine_client.add_lora in the same method that uses _get_lora_lock and
LoRARequest).

In `@container/deps/vllm/install_vllm.sh`:
- Line 295: Replace the non-portable echo invocation that prints a leading
newline and the vLLM plugin message with a portable printf-based call: locate
the echo line containing the literal string "\n=== Installing prime-rl vLLM
plugin (ref=${PRIME_RL_REF}) ===" and replace it so the script emits an actual
newline before and after the message using printf while interpolating
${PRIME_RL_REF}; this avoids shell-dependent echo escape behavior.

In `@docs/Dynamo-RL-api-draft.md`:
- Line 201: Several fenced code blocks are untyped (MD040); update each
triple-backtick fence (``` ...) to include an appropriate language identifier
(e.g., ```json, ```bash, ```toml, ```text) so markdownlint passes and
readability improves; search the document for unannotated fences and replace
each opening ``` with the correct language token based on the block content
(examples: shell commands -> bash, JSON payloads -> json, config snippets ->
toml, plain output -> text).

In `@lib/llm/src/http/service/openai.rs`:
- Around line 206-222: Remove the stale three-line doc comment that describes
"Not Implemented Error" which was accidentally left immediately above the pub fn
bad_request; update the comment block so only the "Bad Request Error" doc
remains directly above bad_request, and ensure the not_implemented_error doc
sits only with the not_implemented_error function (referencing the functions
bad_request and not_implemented_error to locate the misplaced comment).
- Around line 1463-1478: The RL post-processing currently swallows serialization
errors from serde_json::to_value(&response) and returns a half-populated
response; modify the Err branch inside the if let Some(ref messages) =
rl_saved_messages { ... } block so that when serde_json::to_value fails you (a)
log the serialization error (include the error and context like which
model/operation) using the project's logger (e.g., tracing::warn or the existing
logger used elsewhere), and (b) return an internal-server-error response instead
of falling through — this affects the code around rl_tokenize_prompt,
serde_json::to_value(&response), and rl_promote_token_ids_in_response so the
failure is both recorded and surfaced rather than silently dropping promoted
choices[i].token_ids.
- Around line 3427-3451: The rl_router() currently exposes privileged endpoints
(rl_update_weights, rl_load_lora_adapter, rl_unload_lora_adapter, rl_pause,
rl_resume) with no auth; add a bearer/shared-secret check middleware (e.g.,
middleware::from_fn(auth_middleware)) to the router chain before
.with_state(rl_state), update the rl_router() docstring to state this router
must only run on a trusted control-plane/admin listener (not the public
completions port), and implement allow-list validation/canonicalization for path
inputs inside rl_update_weights, rl_load_lora_adapter and rl_unload_lora_adapter
(validate weight_dir and lora_path against an explicit list of allowed root
directories and canonicalize paths before fan-out to workers) so only approved
directories are forwarded.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 73-84: The merged_chat_template_kwargs method currently
unconditionally overwrites user-provided keys "add_generation_prompt" and
"continue_final_message"; update the implementation so user-supplied values are
preserved by using
kwargs.entry("add_generation_prompt").or_insert(serde_json::Value::Bool(self.add_generation_prompt))
and similarly for "continue_final_message", or alternatively add a validation in
the struct's validate() that rejects collisions if both chat_template_kwargs
contains those keys and the top-level fields are set; reference the
merged_chat_template_kwargs function (or the validate method if choosing the
reject option) to locate where to change behavior.
- Around line 25-44: Add a short documentation comment above both
TokenizeCompletionRequest::add_special_tokens and
TokenizeChatRequest::add_special_tokens explaining that the differing defaults
are intentional: TokenizeCompletionRequest defaults to true because completion
requests need explicit special-token addition, while TokenizeChatRequest
defaults to false because vLLM chat templates handle special tokens internally
(avoiding duplicate BOS), so future maintainers should not change them to match
each other. Ensure the comments appear directly above the fields in the
respective structs so they are preserved in the serde definitions.

In `@lib/llm/src/tokenizers/hf.rs`:
- Around line 74-81: The current implementation of
HuggingFaceTokenizer::convert_ids_to_tokens silently maps unknown IDs to "" via
unwrap_or_default(); change convert_ids_to_tokens to explicitly detect when
self.tokenizer.id_to_token(id) returns None and return an Err containing a clear
message with the offending id (e.g., format!("unknown token id: {}", id))
instead of producing empty strings, so callers of
Tokenizer::convert_ids_to_tokens (and ultimately TokenizeResponse.token_strs)
can surface the error; locate the logic in the convert_ids_to_tokens method of
the HuggingFaceTokenizer implementation and replace the
iterator/map+unwrap_or_default with a loop or try-fold that returns Err on any
missing id.

In `@lib/llm/src/tokenizers/tiktoken.rs`:
- Around line 102-123: Encoder::encode in TikTokenTokenizer currently calls
encode_with_special_tokens(..., true) which creates asymmetric default behavior
versus HuggingFaceTokenizer::encode and FastTokenizer::encode (which pass
false); update TikTokenTokenizer::encode to match the other backends (pass
false) so all Encoder::encode implementations behave the same, or if you intend
to preserve the asymmetry, add a clear comment next to TikTokenTokenizer::encode
and the encode_with_special_tokens method documenting why TikTokenTokenizer
defaults to true and noting the behavioral difference so callers are aware to
explicitly opt in to special tokens.
- Around line 156-169: convert_ids_to_tokens currently uses
String::from_utf8_lossy on each token and returns empty strings for unknown IDs,
which corrupts single-byte tokens and hides missing IDs; modify
Tokenizer::convert_ids_to_tokens to mirror the decode() strategy: for each
TokenIdType look up bytes via decoder_tokens or special_tokens_decoder, return
Err if an ID is missing, then try strict UTF-8 (String::from_utf8) first and
only fall back to lossy conversion (from_utf8_lossy) for multi-byte sequences
(preserving the original bytes for single-byte tokens instead of letting them
become U+FFFD), using the same helper logic/policy used by decode() to decide
when to use lossy decoding.

---

Nitpick comments:
In `@components/src/dynamo/rl_admin/service.py`:
- Around line 151-168: The ready() health check currently swallows exceptions
when probing DYNAMO_FRONTEND_URL and DYNAMO_SYSTEM_URL; update the two
try/except blocks that call _get_http_client().get(...) to catch the exception,
log the full exception details at debug (include context like which URL failed
and the exception) before setting frontend_ok/system_ok = False, and do not
change the JSONResponse shape or status codes; use the existing logger (or
create one if none) to emit the debug-level messages so operators can inspect
failures without altering behavior.
- Around line 54-59: Replace the deprecated FastAPI shutdown event handler by
adding an asynccontextmanager-based lifespan and passing it into the FastAPI
constructor: import asynccontextmanager from contextlib, define async def
lifespan(app: FastAPI) decorated with `@asynccontextmanager` that yields and then
performs the same cleanup of the global _http_client (close and set to None),
and create app = FastAPI(title="Dynamo RL Admin", lifespan=lifespan) instead of
relying on the _shutdown function; remove the `@app.on_event`("shutdown")
_shutdown definition afterward.

In `@components/src/dynamo/vllm/handlers.py`:
- Around line 705-881: The handlers load_lora_adapter and unload_lora_adapter
leak entries in the _lora_load_locks map because they don't remove the per-lora
lock after the critical section; update both functions to mirror the existing
pattern in load_lora/unload_lora by wrapping the async with lock block in a
try/finally and in the finally remove the lock from self._lora_load_locks (use
the same key lora_name) when the adapter is no longer tracked (i.e., after
successful unload or on failure) so _get_lora_lock and loaded_loras remain
consistent and the lock-map doesn't grow unbounded.
- Around line 829-881: The unload_lora_adapter method currently deletes the
local tracking entry (del self.loaded_loras[lora_name]) before calling
unregister_model, which means if unregister_model fails the adapter is already
removed from engine and local state but the method returns status=error; add a
one-line comment immediately above the del self.loaded_loras[lora_name] (or
above the unregister_model call) explaining that this ordering is intentional
and documents the asymmetric error semantics (idempotent no-op on retry) and
that it intentionally differs from the older unload_lora rollback behavior to
avoid re-registering the adapter on unregister failure.

In `@components/src/dynamo/vllm/worker_factory.py`:
- Around line 338-353: Extract the duplicated engine-route registration in
_create_decode_worker and _create_prefill_worker into a helper function (e.g.,
_register_lifecycle_and_rl_routes) that takes the DistributedRuntime and a
handler instance; define a single _RL_ROUTES tuple containing the seven RL route
names and use a loop to call runtime.register_engine_route(name,
getattr(handler, name)) for each, also registering "sleep", "wake_up", and
"scale_elastic_ep" there, and replace the duplicate
runtime.register_engine_route calls and logger.info in both functions with a
single call to the new helper; reference BaseWorkerHandler (handler methods),
runtime.register_engine_route, _create_decode_worker, and _create_prefill_worker
when making the change.

In `@lib/llm/src/http/service/service_v2.rs`:
- Around line 546-550: Add a new constant DYN_ENABLE_RL to the env_llm module
(dynamo_runtime::config::environment_names::llm) and replace the hardcoded
string usages with that constant; specifically change the condition that now
reads env_is_truthy("DYN_ENABLE_RL") to use
env_is_truthy(env_llm::DYN_ENABLE_RL) in the RL admin routes block alongside
config.enable_rl, and update any doc comments that mention the literal
DYN_RL_WORKER_SYSTEM_URLS to reference the env_llm constant name instead so all
environment names are discoverable and consistent.

In `@lib/llm/src/preprocessor.rs`:
- Around line 585-594: The two branches under the else/else if both call
self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false), differing only by a tracing::warn! when
has_backend_instance_id is true; refactor by removing the duplicate branches: in
the outer else, if has_backend_instance_id { tracing::warn!(...); } then call
let encoding = self.encode_with_timing(&prompt, tracker)? and return
(encoding.token_ids().to_vec(), false) — this collapses the branches while
preserving the warning behavior and using the existing encode_with_timing,
prompt, tracker, and has_backend_instance_id symbols.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs`:
- Around line 466-486: The code currently clones
self.accumulated_completion_token_ids into completion_token_ids causing the
accumulator's allocation to be retained; replace the clone with
std::mem::take(&mut self.accumulated_completion_token_ids) when building
completion_token_ids (only when self.options.enable_completion_token_ids &&
finish_reason.is_some()) so the generator releases the vector's backing buffer
immediately; ensure the rest of the branch (where completion_token_ids is moved
into NvExtResponse) still compiles and preserves behavior of
NvExtResponse::completion_token_ids and related logic involving token_ids,
worker_id_info, timing_info, and routed_experts; optionally add a unit test that
opts in via extra_fields: ["completion_token_ids"] confirming mid-stream chunks
lack completion_token_ids and the final nvext.completion_token_ids equals the
concatenation of per-chunk token_ids.

In `@lib/llm/src/tokenizers.rs`:
- Around line 65-71: The trait's default impl of encode_with_special_tokens
silently ignores the _add_special_tokens flag; remove the default implementation
from the trait so encode_with_special_tokens becomes a required method (i.e.,
declare fn encode_with_special_tokens(&self, input: &str, add_special_tokens:
bool) -> Result<Encoding> without a body) and then update each Encoder
implementation that currently relies on the default to implement
encode_with_special_tokens explicitly (search for encode_with_special_tokens and
implementations of the Encoder trait to add the proper handling).

In `@lib/llm/src/tokenizers/fastokens.rs`:
- Around line 49-63: Update the struct-level documentation for the fastokens
tokenizer to state that encode_with_special_tokens falls back to the HuggingFace
tokenizer when add_special_tokens=true (i.e., special-token-aware encoding is
performed by hf_decoder, so callers lose fastokens speedup in that case), and
add/extend the test_fast_matches_hf_encoding test to call
encode_with_special_tokens with add_special_tokens=true and assert the HF
fallback produces matching token ids/special-token ids compared to the fast path
where applicable; locate the behavior in the encode_with_special_tokens method
(uses self.hf_decoder and self.fast_encoder) and ensure the test compares
outputs from hf_decoder.encode_with_special_tokens(...) and
fast_encoder.encode(...) wrapped as Encoding::Sp (and guards for any differences
in special-token id handling).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa7fafdb-7029-4614-bc70-dad1cad7d5da

📥 Commits

Reviewing files that changed from the base of the PR and between 6076441 and d837fbd.

📒 Files selected for processing (26)
  • components/src/dynamo/frontend/vllm_processor.py
  • components/src/dynamo/rl_admin/__init__.py
  • components/src/dynamo/rl_admin/service.py
  • components/src/dynamo/vllm/handlers.py
  • components/src/dynamo/vllm/publisher.py
  • components/src/dynamo/vllm/worker_factory.py
  • container/deps/vllm/install_vllm.sh
  • docs/Dynamo-RL-api-draft.md
  • lib/llm/src/audit/stream.rs
  • lib/llm/src/entrypoint/input/text.rs
  • lib/llm/src/http/service/openai.rs
  • lib/llm/src/http/service/service_v2.rs
  • lib/llm/src/preprocessor.rs
  • lib/llm/src/protocols/anthropic/types.rs
  • lib/llm/src/protocols/openai.rs
  • lib/llm/src/protocols/openai/chat_completions.rs
  • lib/llm/src/protocols/openai/chat_completions/aggregator.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
  • lib/llm/src/protocols/openai/nvext.rs
  • lib/llm/src/protocols/openai/responses/mod.rs
  • lib/llm/src/protocols/openai/tokenization.rs
  • lib/llm/src/tokenizers.rs
  • lib/llm/src/tokenizers/fastokens.rs
  • lib/llm/src/tokenizers/hf.rs
  • lib/llm/src/tokenizers/tiktoken.rs

Comment thread components/src/dynamo/frontend/vllm_processor.py
Comment thread components/src/dynamo/rl_admin/service.py Outdated
Comment on lines +118 to +139
# Cache for byte->token_id mappings (populated lazily)
_byte_token_cache: dict[tuple, int] = {}
_tokenizer_for_fallback = None


def _encode_single(text: str) -> int:
"""Encode a single token's text to its ID using a lightweight tokenizer."""
global _tokenizer_for_fallback
if _tokenizer_for_fallback is None:
try:
from transformers import AutoTokenizer
model = MODEL_NAME or os.getenv("MODEL_NAME", "")
if model:
_tokenizer_for_fallback = AutoTokenizer.from_pretrained(
model, trust_remote_code=True
)
except Exception:
pass
if _tokenizer_for_fallback is not None:
ids = _tokenizer_for_fallback.encode(text, add_special_tokens=False)
return ids[0] if ids else 0
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fallback tokenizer load violates multiple Python guidelines.

  • Line 128 imports transformers inside a function. The import is only conditionally needed (fallback path), so the compliant form is a top-level try: from transformers import AutoTokenizer / except ImportError: AutoTokenizer = None, then guard with if AutoTokenizer is not None.
  • Lines 134–135 are try/except Exception: pass, which silently hides any failure to load the tokenizer (wrong model name, missing weights, HF auth failure, permission issues). Downstream, every subsequent call will return 0 for every token ID — an outcome that's indistinguishable from success until RL loss goes nonsensical.
  • The broad except Exception should be narrowed (at least to ImportError, OSError, ValueError) and must log.
♻️ Proposed fix
+try:
+    from transformers import AutoTokenizer  # type: ignore
+except ImportError:
+    AutoTokenizer = None  # type: ignore
+
 # Cache for byte->token_id mappings (populated lazily)
 _byte_token_cache: dict[tuple, int] = {}
 _tokenizer_for_fallback = None


 def _encode_single(text: str) -> int:
     """Encode a single token's text to its ID using a lightweight tokenizer."""
     global _tokenizer_for_fallback
-    if _tokenizer_for_fallback is None:
-        try:
-            from transformers import AutoTokenizer
-            model = MODEL_NAME or os.getenv("MODEL_NAME", "")
-            if model:
-                _tokenizer_for_fallback = AutoTokenizer.from_pretrained(
-                    model, trust_remote_code=True
-                )
-        except Exception:
-            pass
+    if _tokenizer_for_fallback is None and AutoTokenizer is not None:
+        model = MODEL_NAME or os.getenv("MODEL_NAME", "")
+        if model:
+            try:
+                _tokenizer_for_fallback = AutoTokenizer.from_pretrained(
+                    model, trust_remote_code=True
+                )
+            except (OSError, ValueError) as e:
+                logger.warning("Fallback tokenizer load failed for %s: %s", model, e)
     if _tokenizer_for_fallback is not None:

As per coding guidelines: "Keep all import statements at module top; do not import inside functions/methods/classes." and "Prefer failing fast: don't swallow exceptions."

🧰 Tools
🪛 Ruff (0.15.11)

[error] 134-135: try-except-pass detected, consider logging the exception

(S110)


[warning] 134-134: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 118 - 139, The
fallback tokenizer loading in _encode_single currently imports transformers
inside the function, swallows all exceptions, and silently returns 0 on failure;
move the import to module top as a guarded import (e.g., try: from transformers
import AutoTokenizer except ImportError: AutoTokenizer = None), then in
_encode_single check if AutoTokenizer is not None before attempting to
instantiate; narrow the exception handling when creating the tokenizer to
specific exceptions (ImportError, OSError, ValueError) and log failures with the
module logger (e.g., logging.getLogger(__name__)) rather than using a bare
except: pass, and ensure on load failure _tokenizer_for_fallback remains None so
callers can detect fallback absence while errors are recorded; reference
symbols: _encode_single, _tokenizer_for_fallback, AutoTokenizer, MODEL_NAME.

Comment on lines +315 to +344
async def _inject_token_ids(body: dict, response_data: dict) -> dict:
"""Inject prompt_token_ids and choice.token_ids into the response.

Uses the Rust frontend's /tokenize for prompt_ids (same tokenizer as inference).
Extracts completion_ids from logprobs bytes (1:1 alignment guaranteed).
"""
try:
messages = body.get("messages", [])
if not messages:
return response_data

# Get prompt token IDs from the Rust frontend (same tokenizer as inference path)
prompt_ids = await _tokenize_via_frontend(messages, model=body.get("model", ""))
response_data["prompt_token_ids"] = prompt_ids

# Extract completion token IDs from logprobs
for choice in response_data.get("choices", []):
lp = choice.get("logprobs")
if lp and lp.get("content"):
logprobs_content = lp["content"]
completion_ids = _extract_completion_ids_from_logprobs(logprobs_content)
assert len(completion_ids) == len(logprobs_content), (
f"Token alignment: {len(completion_ids)} ids vs "
f"{len(logprobs_content)} logprobs"
)
choice["token_ids"] = completion_ids
except Exception as e:
logger.warning(f"[admin] Failed to inject token_ids: {e}")

return response_data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't use assert for runtime validation, and don't swallow the resulting failure.

Two problems stacked here:

  • Line 336 uses assert to validate token alignment. Under python -O the assertion is stripped entirely, so any real mismatch ships silently.
  • Even without -O, the except Exception on line 341 catches AssertionError and downgrades it to a warning log, after which the response is returned without token_ids on the affected choice — again silently wrong for RL.

Raise a proper exception on mismatch and only catch the narrow tokenization/HTTP errors you can recover from.

🛡️ Proposed fix
-        for choice in response_data.get("choices", []):
-            lp = choice.get("logprobs")
-            if lp and lp.get("content"):
-                logprobs_content = lp["content"]
-                completion_ids = _extract_completion_ids_from_logprobs(logprobs_content)
-                assert len(completion_ids) == len(logprobs_content), (
-                    f"Token alignment: {len(completion_ids)} ids vs "
-                    f"{len(logprobs_content)} logprobs"
-                )
-                choice["token_ids"] = completion_ids
-    except Exception as e:
-        logger.warning(f"[admin] Failed to inject token_ids: {e}")
+        for choice in response_data.get("choices", []):
+            lp = choice.get("logprobs")
+            if lp and lp.get("content"):
+                logprobs_content = lp["content"]
+                completion_ids = _extract_completion_ids_from_logprobs(logprobs_content)
+                if len(completion_ids) != len(logprobs_content):
+                    raise RuntimeError(
+                        f"Token alignment mismatch: {len(completion_ids)} ids vs "
+                        f"{len(logprobs_content)} logprobs"
+                    )
+                choice["token_ids"] = completion_ids
+    except (httpx.HTTPError, RuntimeError, ValueError, KeyError) as e:
+        logger.warning("[admin] Failed to inject token_ids: %s", e)
+        raise

As per coding guidelines: "Flag anti-patterns: ... using assert for runtime validation" and "catch only specific exceptions you can handle, and if you catch broadly (except Exception) you must log and re-raise."

🧰 Tools
🪛 Ruff (0.15.11)

[warning] 341-341: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 315 - 344, The
_inject_token_ids function currently uses an assert to validate token alignment
and catches all Exceptions, which can be stripped under python -O or be
swallowed by the broad except; replace the assert in _inject_token_ids (and the
validation around _extract_completion_ids_from_logprobs and len comparisons)
with an explicit raise (e.g., raise ValueError or RuntimeError with a clear
message) when the lengths mismatch, and narrow the try/except to only catch and
handle tokenization/HTTP-specific errors coming from _tokenize_via_frontend
(e.g., TokenizationError, HTTPError) while allowing alignment errors to
propagate (or re-raise after logging) so failures aren’t silently converted to
warnings and missing choice["token_ids"] don’t slip through.

Comment on lines +405 to +430
def main():
import uvicorn

parser = argparse.ArgumentParser(description="Dynamo RL Admin Service")
parser.add_argument("--port", type=int, default=8002, help="Port to listen on")
parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind to")
parser.add_argument("--model", type=str, default="", help="Model name (for fallback tokenizer)")
parser.add_argument("--dynamo-frontend", type=str, default="http://localhost:8000",
help="Dynamo Rust frontend URL")
parser.add_argument("--dynamo-system", type=str, default="http://localhost:8081",
help="Dynamo vLLM system port URL")
args = parser.parse_args()

global DYNAMO_FRONTEND_URL, DYNAMO_SYSTEM_URL, MODEL_NAME
DYNAMO_FRONTEND_URL = args.dynamo_frontend
DYNAMO_SYSTEM_URL = args.dynamo_system
if args.model:
MODEL_NAME = args.model

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s")
logger.info(
f"Starting RL Admin service on {args.host}:{args.port} "
f"(frontend={DYNAMO_FRONTEND_URL}, system={DYNAMO_SYSTEM_URL}, model={MODEL_NAME})"
)

uvicorn.run(app, host=args.host, port=args.port, log_level="info")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Default host 0.0.0.0 exposes privileged admin endpoints to the network.

This service mounts /update_weights, /pause, /resume, and (downstream) /v1/chat/completions/tokens — none of which have any authentication. Binding to 0.0.0.0 by default means that on any multi-tenant host, any network-reachable client can load checkpoints, halt inference, or swap weights. This matches the Ruff S104 finding.

Change the default to 127.0.0.1 (or localhost) and require operators to opt into external exposure explicitly, ideally behind a reverse proxy that enforces authN/authZ.

🔒 Proposed fix
-    parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind to")
+    parser.add_argument(
+        "--host",
+        type=str,
+        default="127.0.0.1",
+        help="Host to bind to. Set to 0.0.0.0 only behind an authenticating proxy.",
+    )
🧰 Tools
🪛 Ruff (0.15.11)

[error] 410-410: Possible binding to all interfaces

(S104)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/rl_admin/service.py` around lines 405 - 430, The
service main() currently defaults host to "0.0.0.0", exposing admin endpoints;
change the default for parser.add_argument("--host", ...) to "127.0.0.1" (or
"localhost") so DYNAMO admin endpoints are bound to loopback by default, and add
an explicit opt-in (e.g., a boolean flag --allow-external or require setting an
env var) that, when true, allows overriding to external addresses; update
related logging and any usage of args.host in uvicorn.run(app, host=...) and
keep DYNAMO_FRONTEND_URL, DYNAMO_SYSTEM_URL, and MODEL_NAME handling untouched.

Comment on lines +25 to +44
#[serde(default = "default_true")]
pub add_special_tokens: bool,
#[serde(default = "default_false")]
pub return_token_strs: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(deny_unknown_fields)]
pub struct TokenizeChatRequest {
#[serde(default, skip_serializing_if = "Option::is_none")]
pub model: Option<String>,
pub messages: Vec<dynamo_protocols::types::ChatCompletionRequestMessage>,
#[serde(default = "default_true")]
pub add_generation_prompt: bool,
#[serde(default = "default_false")]
pub return_token_strs: bool,
#[serde(default = "default_false")]
pub continue_final_message: bool,
#[serde(default = "default_false")]
pub add_special_tokens: bool,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM tokenize endpoint add_special_tokens default completion chat

💡 Result:

vLLM's OpenAI-compatible server provides a /v1/tokenize (or /tokenize) endpoint that wraps the Hugging Face tokenizer's encode method. The add_special_tokens parameter is supported with a default value of True, matching the Hugging Face Transformers default and vLLM's internal tokenizer protocol where add_special_tokens: bool = True in encode and call methods. For chat completions (/v1/chat/completions), add_special_tokens defaults to False because the chat template handles special tokens, avoiding double-adding (e.g., BOS token). For regular completions (/v1/completions), it defaults to True. The tokenize endpoint, being a direct tokenizer wrapper, uses True by default, adding special tokens like BOS unless specified otherwise.

Citations:


🏁 Script executed:

cd lib/llm/src/protocols/openai && wc -l tokenization.rs && head -50 tokenization.rs

Repository: ai-dynamo/dynamo

Length of output: 1859


Add documentation comment clarifying intentional add_special_tokens asymmetry.

TokenizeCompletionRequest::add_special_tokens defaults to true while TokenizeChatRequest::add_special_tokens defaults to false. This asymmetry is intentional and aligns with vLLM's behavior—chat templates handle special tokens internally (avoiding BOS duplication), whereas completion requests require explicit token addition. Add a doc comment to both fields to prevent future "fixes" that unintentionally align these defaults.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 25 - 44, Add a
short documentation comment above both
TokenizeCompletionRequest::add_special_tokens and
TokenizeChatRequest::add_special_tokens explaining that the differing defaults
are intentional: TokenizeCompletionRequest defaults to true because completion
requests need explicit special-token addition, while TokenizeChatRequest
defaults to false because vLLM chat templates handle special tokens internally
(avoiding duplicate BOS), so future maintainers should not change them to match
each other. Ensure the comments appear directly above the fields in the
respective structs so they are preserved in the serde definitions.

Comment on lines +73 to +84
pub fn merged_chat_template_kwargs(&self) -> HashMap<String, serde_json::Value> {
let mut kwargs = self.chat_template_kwargs.clone().unwrap_or_default();
kwargs.insert(
"add_generation_prompt".to_string(),
serde_json::Value::Bool(self.add_generation_prompt),
);
kwargs.insert(
"continue_final_message".to_string(),
serde_json::Value::Bool(self.continue_final_message),
);
kwargs
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

merged_chat_template_kwargs silently overrides user-provided keys.

If a client supplies chat_template_kwargs with add_generation_prompt or continue_final_message already set (which is plausible since vLLM accepts these either as top-level fields or nested in chat_template_kwargs), those values are unconditionally overwritten by the struct-level fields here. That can produce surprising results — the user sees their explicit kwarg ignored with no error. Consider either (a) rejecting collisions in validate(), or (b) using entry(...).or_insert(...) so explicit kwargs win.

♻️ Option: reject collisions in `validate`
     pub fn validate(&self) -> Result<(), String> {
         if self.continue_final_message && self.add_generation_prompt {
             return Err(
                 "Cannot set both `continue_final_message` and `add_generation_prompt` to True."
                     .to_string(),
             );
         }
+        if let Some(kw) = &self.chat_template_kwargs {
+            for key in ["add_generation_prompt", "continue_final_message"] {
+                if kw.contains_key(key) {
+                    return Err(format!(
+                        "`{key}` must be set via the top-level field, not inside `chat_template_kwargs`."
+                    ));
+                }
+            }
+        }
         Ok(())
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 73 - 84, The
merged_chat_template_kwargs method currently unconditionally overwrites
user-provided keys "add_generation_prompt" and "continue_final_message"; update
the implementation so user-supplied values are preserved by using
kwargs.entry("add_generation_prompt").or_insert(serde_json::Value::Bool(self.add_generation_prompt))
and similarly for "continue_final_message", or alternatively add a validation in
the struct's validate() that rejects collisions if both chat_template_kwargs
contains those keys and the top-level fields are set; reference the
merged_chat_template_kwargs function (or the validate method if choosing the
reject option) to locate where to change behavior.

Comment on lines +74 to +81
impl Tokenizer for HuggingFaceTokenizer {
fn convert_ids_to_tokens(&self, token_ids: &[TokenIdType]) -> Result<Vec<String>> {
Ok(token_ids
.iter()
.map(|&id| self.tokenizer.id_to_token(id).unwrap_or_default())
.collect())
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently mapping unknown IDs to "" may hide bugs.

id_to_token returns None for any ID outside the vocabulary (e.g., an RL worker returning a stale/mismatched token ID after a weight swap, a LoRA adapter that expanded the vocab, or an off-by-one). unwrap_or_default() papers over all of these as empty strings, which then flow into TokenizeResponse.token_strs indistinguishable from a real empty-string token. Consider either returning an Err on lookup failure, or using a sentinel like format!("<unk:{id}>") so such cases are visible in logs/responses.

🛡️ Proposed change
-        Ok(token_ids
-            .iter()
-            .map(|&id| self.tokenizer.id_to_token(id).unwrap_or_default())
-            .collect())
+        token_ids
+            .iter()
+            .map(|&id| {
+                self.tokenizer
+                    .id_to_token(id)
+                    .ok_or_else(|| Error::msg(format!("Unknown token id: {id}")))
+            })
+            .collect()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers/hf.rs` around lines 74 - 81, The current
implementation of HuggingFaceTokenizer::convert_ids_to_tokens silently maps
unknown IDs to "" via unwrap_or_default(); change convert_ids_to_tokens to
explicitly detect when self.tokenizer.id_to_token(id) returns None and return an
Err containing a clear message with the offending id (e.g., format!("unknown
token id: {}", id)) instead of producing empty strings, so callers of
Tokenizer::convert_ids_to_tokens (and ultimately TokenizeResponse.token_strs)
can surface the error; locate the logic in the convert_ids_to_tokens method of
the HuggingFaceTokenizer implementation and replace the
iterator/map+unwrap_or_default with a loop or try-fold that returns Err on any
missing id.

Comment on lines 102 to 123
impl Encoder for TikTokenTokenizer {
fn encode(&self, input: &str) -> Result<Encoding> {
let token_ids: Vec<u32> = self.bpe.encode_with_special_tokens(input);
Ok(Encoding::Sp(token_ids))
self.encode_with_special_tokens(input, true)
}

fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>> {
inputs.par_iter().map(|input| self.encode(input)).collect()
}

fn encode_with_special_tokens(
&self,
input: &str,
add_special_tokens: bool,
) -> Result<Encoding> {
let token_ids: Vec<u32> = if add_special_tokens {
self.bpe.encode_with_special_tokens(input)
} else {
self.bpe.encode_ordinary(input)
};
Ok(Encoding::Sp(token_ids))
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default add_special_tokens differs across backends.

Encoder::encode() here delegates with add_special_tokens = true, while HuggingFaceTokenizer::encode (hf.rs Line 30) and FastTokenizer::encode (fastokens.rs Line 42) delegate with false. Any code that calls the bare tokenizer.encode(text) through the Tokenizer wrapper will therefore get different special-token behavior depending on which backend is loaded — a subtle footgun for the new /v1/tokenize and TITO paths, which can be configured against either backend.

If the intent is to preserve prior per-backend semantics, please add a comment explaining the asymmetry; otherwise, consider aligning all three backends on the same default and forcing callers to opt in via encode_with_special_tokens.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers/tiktoken.rs` around lines 102 - 123, Encoder::encode
in TikTokenTokenizer currently calls encode_with_special_tokens(..., true) which
creates asymmetric default behavior versus HuggingFaceTokenizer::encode and
FastTokenizer::encode (which pass false); update TikTokenTokenizer::encode to
match the other backends (pass false) so all Encoder::encode implementations
behave the same, or if you intend to preserve the asymmetry, add a clear comment
next to TikTokenTokenizer::encode and the encode_with_special_tokens method
documenting why TikTokenTokenizer defaults to true and noting the behavioral
difference so callers are aware to explicitly opt in to special tokens.

Comment on lines +156 to +169
impl Tokenizer for TikTokenTokenizer {
fn convert_ids_to_tokens(&self, token_ids: &[TokenIdType]) -> Result<Vec<String>> {
Ok(token_ids
.iter()
.map(|id| {
self.decoder_tokens
.get(id)
.or_else(|| self.special_tokens_decoder.get(id))
.map(|bytes| String::from_utf8_lossy(bytes).into_owned())
.unwrap_or_default()
})
.collect())
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM /tokenize endpoint return_token_strs byte fallback format

💡 Result:

The vLLM /tokenize endpoint (at http://localhost:8000/tokenize) supports the "return_token_strs": true parameter to return the decoded strings corresponding to the token IDs in the "tokens" array of the response. When return_token_strs is enabled, vLLM uses the model's tokenizer.decode method on each individual token ID to obtain the token strings. For models using byte-level BPE tokenizers (common in multilingual models like Granite), individual tokens are raw UTF-8 bytes. The tokenizer.decode attempts to interpret these bytes as UTF-8 text. This works correctly for ASCII/Latin characters but fails for non-Latin Unicode characters, resulting in malformed "mojibake" strings (e.g., 'Ðĵ' instead of Cyrillic characters). This is because the tokenizer expects full valid UTF-8 sequences, but single-token bytes may not form complete characters. This is documented as a bug in GitHub issue #27825 for Granite models (vLLM v0.11.0+), referencing the original feature implementation in #18928 (implemented via PR #18941). A fix was proposed in PR #36054 titled "[Bugfix] Fix tokenize endpoint malformed token_strs". There is no explicit "byte fallback format" documented in official sources. The current implementation does not provide raw bytes; it always attempts UTF-8 decoding, leading to garbled output for certain tokens. Workarounds include manual re-encoding client-side or avoiding return_token_strs for affected models. Official docs confirm the /tokenize endpoint as a HuggingFace tokenizer wrapper but lack parameter details: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#tokenizer-api. Endpoint supports both "prompt" (text) and chat "messages", applying chat templates as needed.

Citations:


🏁 Script executed:

# Search for test files related to tiktoken tokenizer
find . -type f -name "*.rs" | xargs rg -l "TikTokenTokenizer" | head -20

Repository: ai-dynamo/dynamo

Length of output: 220


🏁 Script executed:

# Look for test fixtures or test data files
find . -type f \( -name "*test*" -o -name "*fixture*" \) | grep -i token | head -20

Repository: ai-dynamo/dynamo

Length of output: 94


🏁 Script executed:

# Search for references to byte-fallback tokens or 0xE4 in tests
rg -i "byte.*fallback|0x[eE]4|rank.*100" --type rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 86


🏁 Script executed:

# Examine the tiktoken.rs file more broadly to understand context
wc -l lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 98


🏁 Script executed:

# Check the test file for byte-fallback or problematic token tests
cat -n lib/llm/tests/tokenizers.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 7825


🏁 Script executed:

# Search in test files for byte-fallback, 0xE4, or rank references
rg -n "0x[eE]4|byte.*fallback|rank" lib/llm/tests/

Repository: ai-dynamo/dynamo

Length of output: 18207


🏁 Script executed:

# Look at test_token_data.rs to understand test fixture structure
cat lib/llm/tests/test_token_data.rs | head -50

Repository: ai-dynamo/dynamo

Length of output: 1219


🏁 Script executed:

# Look at mock-tiktoken directory structure
find tests/data/sample-models/mock-tiktoken -type f

Repository: ai-dynamo/dynamo

Length of output: 133


🏁 Script executed:

# Examine tiktoken.rs implementation of decoder initialization
sed -n '1,200p' lib/llm/src/tokenizers/tiktoken.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 4195


🏁 Script executed:

# Look for decoder_tokens and special_tokens_decoder initialization in tiktoken.rs
rg -n "decoder_tokens|special_tokens_decoder" lib/llm/src/tokenizers/tiktoken.rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 706


🏁 Script executed:

# Get the exact lines 156-169 from tiktoken.rs
sed -n '156,169p' lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 551


🏁 Script executed:

# Search for tests of convert_ids_to_tokens or decode behavior
rg -n "convert_ids_to_tokens|decode.*token" lib/llm/tests/tokenizers.rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 1551


🏁 Script executed:

# Look for any tests that verify behavior with unknown token IDs or invalid bytes
rg -A 5 -B 5 "unknown|invalid|0xFF|0x" lib/llm/tests/tokenizers.rs

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Look for the Decoder trait implementation for TikTokenTokenizer
rg -A 30 "impl Decoder for TikTokenTokenizer" lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 1446


🏁 Script executed:

# Search for any comments or documentation about byte-fallback handling
rg -B 5 -A 5 "byte|fallback|0xE4|lossy" lib/llm/src/tokenizers/tiktoken.rs

Repository: ai-dynamo/dynamo

Length of output: 16351


🏁 Script executed:

# Look for the decode method implementation to compare with convert_ids_to_tokens
rg -A 20 "fn decode" lib/llm/src/tokenizers/tiktoken.rs | head -60

Repository: ai-dynamo/dynamo

Length of output: 1162


🏁 Script executed:

# Search for usage of convert_ids_to_tokens in the codebase
rg -r "convert_ids_to_tokens" --type rs | head -30

Repository: ai-dynamo/dynamo

Length of output: 456


🏁 Script executed:

# Look for tokenize endpoint implementation
rg -n "tokenize|/v1/" lib/llm/src/ --type rs | grep -i "api\|endpoint\|route" | head -20

Repository: ai-dynamo/dynamo

Length of output: 86


🏁 Script executed:

# Search for usage of convert_ids_to_tokens in the codebase
rg "convert_ids_to_tokens" lib/llm/src/ | head -30

Repository: ai-dynamo/dynamo

Length of output: 1058


🏁 Script executed:

# Look for tokenize endpoint or API implementation
rg -i "tokenize" lib/llm/src/ | grep -v "test" | head -30

Repository: ai-dynamo/dynamo

Length of output: 3210


🏁 Script executed:

# Search in protocols or openai module for tokenize endpoint
find lib/llm/src -name "*.rs" | xargs grep -l "tokenize" | head -10

Repository: ai-dynamo/dynamo

Length of output: 391


🏁 Script executed:

# Look at the openai.rs usage of convert_ids_to_tokens
rg -B 10 -A 10 "convert_ids_to_tokens" lib/llm/src/http/service/openai.rs | head -80

Repository: ai-dynamo/dynamo

Length of output: 1704


🏁 Script executed:

# Find the tokenize endpoint handler
rg -B 5 -A 15 "tokenize" lib/llm/src/http/service/openai.rs | head -100

Repository: ai-dynamo/dynamo

Length of output: 3826


convert_ids_to_tokens corrupts byte-fallback tokens and silently fails on unknown IDs.

Single byte-fallback tokens (e.g., ID 100 → [0xE4] in test fixtures) are not valid UTF-8 on their own, so String::from_utf8_lossy replaces them with U+FFFD. This happens when clients call /v1/tokenize with return_token_strs=true, returning "\u{FFFD}" in place of the actual byte value, making the original tokenization irreversible. Additionally, an unknown ID silently produces "" via unwrap_or_default(), indistinguishable from a real empty token — consider returning an Err instead.

The decode() method already handles this correctly by attempting strict UTF-8 first, then falling back to lossy conversion only for multi-token sequences. Apply a similar pattern to convert_ids_to_tokens.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers/tiktoken.rs` around lines 156 - 169,
convert_ids_to_tokens currently uses String::from_utf8_lossy on each token and
returns empty strings for unknown IDs, which corrupts single-byte tokens and
hides missing IDs; modify Tokenizer::convert_ids_to_tokens to mirror the
decode() strategy: for each TokenIdType look up bytes via decoder_tokens or
special_tokens_decoder, return Err if an ID is missing, then try strict UTF-8
(String::from_utf8) first and only fall back to lossy conversion
(from_utf8_lossy) for multi-byte sequences (preserving the original bytes for
single-byte tokens instead of letting them become U+FFFD), using the same helper
logic/policy used by decode() to decide when to use lossy decoding.

handler_chat_completions_tokens injects a placeholder "(token-in mode)"
message into request.inner.messages so Dynamo can perform chat-template
selection, but the real prompt token IDs are stored in nvext.token_data.

Previously, chat_completions() saved request.inner.messages (containing
the placeholder) as rl_saved_messages, then called rl_tokenize_prompt()
on that placeholder at post-processing time.  This returned the tokenization
of the placeholder string rather than the supplied token IDs, producing
wrong prompt_token_ids in the response.

Fix: capture nvext.token_data before engine.generate() consumes the request,
and prefer it over rl_tokenize_prompt() in the RL post-processing block.
Standard message-based requests are unaffected (rl_tito_token_ids is None).
All rl_admin functionality has been re-implemented natively in the Dynamo
Rust frontend (lib/llm):

  - Admin routes (/pause, /resume, /update_weights, /health, /ready,
    /load_lora_adapter, /unload_lora_adapter) → openai.rs /v1/rl/* router
  - Token injection (prompt_token_ids, choice.token_ids) → handler_chat_completions
  - Tokenization → /v1/tokenize + /v1/detokenize (cherry from #7699)

Prime-RL now points admin_base_url directly at the Dynamo frontend
(/v1/rl); no separate Python sidecar is deployed. The module has zero
imports across the codebase and is not referenced in any Dockerfile,
Helm chart, or K8s manifest.
…zer asymmetry

Tier 1 — RL correctness (handlers.py)

- Clear loaded_loras[name] immediately after remove_lora() succeeds and
  BEFORE add_lora().  Previously a successful remove followed by a failed
  add left a stale entry (id=N-1) in loaded_loras while the engine no longer
  held N-1 — causing the next rollout to attach a LoRARequest for an
  unregistered adapter (silent base-model fallback or request failure).
  Importance ratios π_N/π_old computed from those rollouts are garbage,
  leading to mismatch_kl spikes and potential reward collapse.

- Log rollback remove_lora() failures at ERROR instead of swallowing them
  with bare `except: pass`.  A leaked adapter is not a silent state — it
  must be visible in logs so on-call can diagnose engine inconsistency.

- Upgrade reset_prefix_cache() failure from WARNING to ERROR.  A missed
  cache reset after a LoRA swap means subsequent requests with a shared
  prefix can reuse KV state computed under the old adapter weights, causing
  silent logprobs mismatch on those positions.

Tier 2 — Silent data-loss (openai.rs)

- Replace the bare `if let Ok(...)` on serde_json::to_value() with a match
  that logs an ERROR on the Err branch.  The previous fallthrough left
  choice.token_ids unpromoted while prompt_token_ids was already set —
  Prime-RL would see a partially-populated response with no indication of
  what went wrong.

Tier 3 — Tokenization asymmetry (hf.rs, vllm_processor.py)

- HuggingFaceTokenizer::encode() and encode_batch() now use
  add_special_tokens=true, matching TikTokenTokenizer::encode().
  Any caller that uses the trait's encode() method gets consistent
  behaviour regardless of which backend is active.  Callers needing
  no special tokens should call encode_with_special_tokens(_, false).

- vllm_processor.py logprobs injection now always populates the `token`
  and `bytes` fields on every logprob entry, falling back to the numeric
  token ID string when top_logprobs is absent.  Previously entries built
  without top_logprobs had no `token` key — a KeyError or silent None the
  moment any consumer reads that field (e.g. a return_token_ids=false run).
Prime-RL's orchestrator injects a cache_salt key into extra_body on every
rollout request for KV prefix-cache isolation.  Dynamo's generic field
validator was rejecting this with a 400 Bad Request.

Add a PASSTHROUGH_EXTRA_FIELDS allowlist and filter out known RL-client
hints before checking for truly unsupported parameters.  This makes Dynamo
forward-compatible with prime-rl clients that send cache_salt without
requiring a prime-rl config workaround.
@github-actions

Copy link
Copy Markdown
Contributor

@biswapanda

Copy link
Copy Markdown
Contributor Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/http/service/openai.rs (1)

3674-4082: ⚠️ Potential issue | 🔴 Critical

Test file fails to compile — 8 NvCreateChatCompletionRequest literals missing the new tokens and return_token_ids fields.

The struct now includes required fields tokens: Option<Vec<u32>> and return_token_ids: Option<bool> with #[serde(default)] attributes. Since the struct doesn't derive Default, all 8 test struct literals must be updated at lines 3675, 3702, 3917, 3947, 3976, 4005, 4034, and 4065. Add tokens: None, return_token_ids: None to each. Example:

🐛 Representative fix (apply to all 8 sites)
         let request = NvCreateChatCompletionRequest {
             inner: CreateChatCompletionRequest {
                 model: "test-model".to_string(),
                 messages: vec![],
                 ..Default::default()
             },
             common: Default::default(),
             nvext: None,
             chat_template_args: None,
             media_io_kwargs: None,
+            tokens: None,
+            return_token_ids: None,
             unsupported_fields: Default::default(),
         };

Alternatively, derive Default on NvCreateChatCompletionRequest so that existing ..Default::default() patterns continue working when protocol fields are added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 3674 - 4082, Tests
constructing NvCreateChatCompletionRequest now fail because the struct gained
required fields tokens: Option<Vec<u32>> and return_token_ids: Option<bool>;
update each test literal (e.g., in
test_validate_chat_completion_required_fields_empty_messages,
test_validate_chat_completion_required_fields_with_messages,
test_bad_base_request_for_chatcompletion and other chat tests) to include
tokens: None, return_token_ids: None in the NvCreateChatCompletionRequest
initializer, or alternatively add a Default impl/derive for
NvCreateChatCompletionRequest so the existing ..Default::default() usages
continue to work.
🧹 Nitpick comments (3)
lib/llm/src/tokenizers.rs (1)

64-72: Default encode_with_special_tokens silently drops the flag — consider making it explicit.

The default impl ignores _add_special_tokens and just calls encode. Today every concrete backend (hf, fastokens, tiktoken) overrides it so this never runs, but if a future Encoder impl forgets to override it, callers will get the wrong token stream with no error. Either delete the default (force every impl to implement it) or make the divergence loud (e.g. debug-log once, or unimplemented!-style default returning an Err). Non-blocking for this PR.

♻️ One way to remove the footgun
     pub trait Encoder: Send + Sync {
         fn encode(&self, input: &str) -> Result<Encoding>;
         fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
-
-        fn encode_with_special_tokens(
-            &self,
-            input: &str,
-            _add_special_tokens: bool,
-        ) -> Result<Encoding> {
-            self.encode(input)
-        }
+
+        /// Each backend MUST implement this; honoring `add_special_tokens`
+        /// is required for consistent behavior between `/v1/tokenize` and the
+        /// request preprocessor.
+        fn encode_with_special_tokens(
+            &self,
+            input: &str,
+            add_special_tokens: bool,
+        ) -> Result<Encoding>;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/tokenizers.rs` around lines 64 - 72, The default
encode_with_special_tokens currently ignores the _add_special_tokens flag and
just calls encode, which can silently produce incorrect token streams; change
the default implementation of encode_with_special_tokens to return an explicit
Err (or use unimplemented!()) with a clear message so implementors of the
Encoder must override it instead of silently dropping the flag — update the
method body for encode_with_special_tokens (referencing the
encode_with_special_tokens and encode symbols) to return an error describing
that encode_with_special_tokens must be implemented by backend encoders.
lib/llm/src/http/service/openai.rs (2)

1984-1994: This helper duplicates ErrorMessage::bad_request.

ErrorMessage::bad_request is defined at line 211 with identical behaviour. The free bad_request here just forwards to String and reconstructs the same ErrorResponse. Use the existing associated function and drop this duplicate to keep a single source of truth for 400 responses.

♻️ Proposed refactor
-fn bad_request<T: Into<String>>(message: T) -> ErrorResponse {
-    let code = StatusCode::BAD_REQUEST;
-    (
-        code,
-        Json(ErrorMessage {
-            message: message.into(),
-            error_type: map_error_code_to_error_type(code),
-            code: code.as_u16(),
-        }),
-    )
-}
-
 fn resolve_tokenizer_model_name(
     state: &Arc<service_v2::State>,
     requested_model: Option<&str>,
 ) -> Result<String, ErrorResponse> {
     if let Some(model) = requested_model {
         if state.manager().has_model_any(model) {
             return Ok(model.to_string());
         }
         return Err(ErrorMessage::model_not_found());
     }
     let served_models = state.manager().model_display_names();
     if served_models.len() == 1 {
         return Ok(served_models.into_iter().next().unwrap());
     }
-    Err(bad_request(
+    Err(ErrorMessage::bad_request(
         "Model must be specified when more than one model is served.",
     ))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 1984 - 1994, The standalone
function bad_request<T: Into<String>>(...) duplicates ErrorMessage::bad_request;
remove the free function and update any call sites to use the associated
constructor ErrorMessage::bad_request (or ErrorMessage::bad_request(message)) so
there is a single source of truth for 400 responses; ensure the return type
remains ErrorResponse and preserve mapping using map_error_code_to_error_type
where used.

897-958: Factor the RL nvext setup out of the two handlers.

This block and handler_chat_completions_tokens (lines 2307-2354) perform nearly identical work: take tokensnvext.token_data, push "token_ids"/"completion_token_ids" into nvext.extra_fields, force logprobs = Some(true), and inject the "(token-in mode)" placeholder when messages is empty. The literal placeholder string is also duplicated, which is easy to drift on. Extract a single helper (e.g. fn rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest, want_token_ids: bool) -> Result<(), ErrorResponse>) and call it from both handlers — this removes the risk of the two paths diverging (e.g. one path adds a new extra_field, the other forgets).

One minor concern on the placeholder itself: because validate_chat_completion_required_fields later rejects empty messages (line 1545), the placeholder is load-bearing, yet its content is an English magic string that will leak into downstream chat-template rendering if the template doesn't honour nvext.token_data. Please at least centralise the placeholder as a const next to the helper so it's greppable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 897 - 958, Extract the
duplicated RL nvext setup into a single helper (e.g. fn
rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest, want_token_ids:
bool) -> Result<(), ErrorResponse>) and call it from both the current handler
block and handler_chat_completions_tokens; the helper should: take
request.tokens and set nvext.token_data when present, ensure
request.inner.messages contains the same placeholder when empty, merge
"token_ids" and "completion_token_ids" into nvext.extra_fields when
want_token_ids is true, and set request.inner.logprobs = Some(true) if not
already set; centralize the "(token-in mode)" placeholder as a const next to the
helper and return/propagate an ErrorResponse on failure so callers can handle
validation consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 747-768: When a hot-swap fails to remove the old adapter, do not
proceed to add_lora or update self.loaded_loras: the current flow catches
exceptions from self.engine_client.remove_lora(...) and then continues, which
(given deterministic lora_id via lora_name_to_id) can leave the engine bound to
the old weights while the in-memory map points to the new path. Change the
except block around remove_lora in the hot-swap branch so that after logging the
failure you fail fast (raise an exception or return an error) instead of falling
through to the add_lora/assignment; ensure you do not call
self.engine_client.add_lora(...) or set self.loaded_loras[lora_name] when
remove_lora fails (refer to remove_lora, add_lora, self.loaded_loras,
LoRARequest, LoRAInfo).

In `@docs/Dynamo-RL-api-draft.md`:
- Around line 906-926: The doc incorrectly states there are five engine routes
and omits the two LoRA routes; update the table and the registration snippet to
include /engine/load_lora_adapter and /engine/unload_lora_adapter (methods POST)
and show they call the corresponding handlers (e.g., engine route names mapped
via runtime.register_engine_route to handler.load_lora_adapter and
handler.unload_lora_adapter), or explicitly state if those routes are registered
elsewhere; ensure the symbols load_lora_adapter, unload_lora_adapter, and
runtime.register_engine_route appear so readers wiring a backend see the full
seven-route surface.

In `@lib/llm/src/http/service/openai.rs`:
- Around line 3056-3060: The helper all_ok(results: &[serde_json::Value])
currently returns true for an empty slice because Iterator::all yields true on
empty iterators; change all_ok (referenced in fan_out/RlState handlers) to
return false when results.is_empty() and only then check that every result has
status "ok" (e.g., ensure !results.is_empty() && results.iter().all(...)). Also
update any callers that expected the old behavior if necessary, and modify
RlState::from_env so it refuses or fatally logs when the parsed
worker_system_urls list is empty (do not silently start with zero workers).
- Around line 3359-3402: The doc comment block describing promotion of token IDs
is attached to rl_tokenize_prompt but belongs to
rl_promote_token_ids_in_response; move the "Promote token IDs from the Dynamo
`nvext` response object…" paragraph out of the rl_tokenize_prompt docblock and
attach it immediately above the rl_promote_token_ids_in_response function
definition so each function has its correct doc comment (keep the
tokenizer/prompt description on rl_tokenize_prompt and the nvext promotion
paragraph with rl_promote_token_ids_in_response).

---

Outside diff comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 3674-4082: Tests constructing NvCreateChatCompletionRequest now
fail because the struct gained required fields tokens: Option<Vec<u32>> and
return_token_ids: Option<bool>; update each test literal (e.g., in
test_validate_chat_completion_required_fields_empty_messages,
test_validate_chat_completion_required_fields_with_messages,
test_bad_base_request_for_chatcompletion and other chat tests) to include
tokens: None, return_token_ids: None in the NvCreateChatCompletionRequest
initializer, or alternatively add a Default impl/derive for
NvCreateChatCompletionRequest so the existing ..Default::default() usages
continue to work.

---

Nitpick comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 1984-1994: The standalone function bad_request<T:
Into<String>>(...) duplicates ErrorMessage::bad_request; remove the free
function and update any call sites to use the associated constructor
ErrorMessage::bad_request (or ErrorMessage::bad_request(message)) so there is a
single source of truth for 400 responses; ensure the return type remains
ErrorResponse and preserve mapping using map_error_code_to_error_type where
used.
- Around line 897-958: Extract the duplicated RL nvext setup into a single
helper (e.g. fn rl_promote_token_fields(req: &mut NvCreateChatCompletionRequest,
want_token_ids: bool) -> Result<(), ErrorResponse>) and call it from both the
current handler block and handler_chat_completions_tokens; the helper should:
take request.tokens and set nvext.token_data when present, ensure
request.inner.messages contains the same placeholder when empty, merge
"token_ids" and "completion_token_ids" into nvext.extra_fields when
want_token_ids is true, and set request.inner.logprobs = Some(true) if not
already set; centralize the "(token-in mode)" placeholder as a const next to the
helper and return/propagate an ErrorResponse on failure so callers can handle
validation consistently.

In `@lib/llm/src/tokenizers.rs`:
- Around line 64-72: The default encode_with_special_tokens currently ignores
the _add_special_tokens flag and just calls encode, which can silently produce
incorrect token streams; change the default implementation of
encode_with_special_tokens to return an explicit Err (or use unimplemented!())
with a clear message so implementors of the Encoder must override it instead of
silently dropping the flag — update the method body for
encode_with_special_tokens (referencing the encode_with_special_tokens and
encode symbols) to return an error describing that encode_with_special_tokens
must be implemented by backend encoders.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89572d54-22ae-46aa-8336-7a58ba1c028a

📥 Commits

Reviewing files that changed from the base of the PR and between d837fbd and beaafcb.

📒 Files selected for processing (7)
  • components/src/dynamo/frontend/vllm_processor.py
  • components/src/dynamo/vllm/handlers.py
  • docs/Dynamo-RL-api-draft.md
  • lib/llm/src/http/service/openai.rs
  • lib/llm/src/protocols/openai/validate.rs
  • lib/llm/src/tokenizers.rs
  • lib/llm/src/tokenizers/hf.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/src/dynamo/frontend/vllm_processor.py

Comment on lines +747 to +768
if is_hot_swap:
old_id = self.loaded_loras[lora_name].id
try:
await self.engine_client.remove_lora(old_id)
# Invalidate the cache entry immediately after remove succeeds.
# If add_lora below fails, this prevents a stale entry pointing
# at an adapter the engine no longer holds from poisoning future
# rollouts with wrong importance ratios (Tier-1 RL correctness risk).
self.loaded_loras.pop(lora_name, None)
except Exception as e:
logger.warning(
f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap: {e}"
)

await self.engine_client.add_lora(
LoRARequest(
lora_name=lora_name,
lora_int_id=lora_id,
lora_path=lora_path,
)
)
self.loaded_loras[lora_name] = LoRAInfo(id=lora_id, path=lora_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Silent stale-weights bug when remove_lora fails during hot-swap.

lora_id = lora_name_to_id(lora_name) (line 741) is deterministic, so lora_id == old_id on every hot-swap. Combined with the comment on line 744 — "vLLM's add_lora is a no-op when the lora_int_id is already registered" — the current except on line 756-759 is unsafe:

If remove_lora raises, control falls through to add_lora with the same lora_int_id that is still registered, so add_lora is a no-op. The old adapter stays in the engine, but line 768 overwrites self.loaded_loras[lora_name] with the new path. From this point on, _resolve_lora_request hands out a LoRARequest(lora_path=<new_path>) for an engine-id that still points at the old weights, and every subsequent rollout silently runs under the previous step's adapter. For an RL training loop, that's exactly the kind of logprob/importance-ratio contamination the rest of this code path is trying to prevent.

Fail fast on this path instead of continuing: if we cannot remove the old adapter, we cannot safely hot-swap.

🛡️ Proposed fix — bail out when remove_lora fails on hot-swap
                 if is_hot_swap:
                     old_id = self.loaded_loras[lora_name].id
                     try:
                         await self.engine_client.remove_lora(old_id)
                         # Invalidate the cache entry immediately after remove succeeds.
                         # If add_lora below fails, this prevents a stale entry pointing
                         # at an adapter the engine no longer holds from poisoning future
                         # rollouts with wrong importance ratios (Tier-1 RL correctness risk).
                         self.loaded_loras.pop(lora_name, None)
                     except Exception as e:
-                        logger.warning(
-                            f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap: {e}"
-                        )
+                        # Cannot proceed: since lora_name_to_id is deterministic the new
+                        # lora_id equals old_id, and vLLM's add_lora is a no-op when the
+                        # int id is already registered. Overwriting loaded_loras with the
+                        # new path would silently leave OLD weights live in the engine —
+                        # exactly the correctness hole hot-swap is meant to close.
+                        logger.error(
+                            f"[RL] remove_lora({lora_name}, id={old_id}) failed during hot-swap; "
+                            f"aborting to avoid running with stale adapter weights: {e}"
+                        )
+                        return {
+                            "status": "error",
+                            "message": f"remove_lora failed during hot-swap of '{lora_name}': {e}",
+                            "lora_name": lora_name,
+                        }
🧰 Tools
🪛 Ruff (0.15.11)

[warning] 756-756: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/handlers.py` around lines 747 - 768, When a
hot-swap fails to remove the old adapter, do not proceed to add_lora or update
self.loaded_loras: the current flow catches exceptions from
self.engine_client.remove_lora(...) and then continues, which (given
deterministic lora_id via lora_name_to_id) can leave the engine bound to the old
weights while the in-memory map points to the new path. Change the except block
around remove_lora in the hot-swap branch so that after logging the failure you
fail fast (raise an exception or return an error) instead of falling through to
the add_lora/assignment; ensure you do not call self.engine_client.add_lora(...)
or set self.loaded_loras[lora_name] when remove_lora fails (refer to
remove_lora, add_lora, self.loaded_loras, LoRARequest, LoRAInfo).

Comment on lines +906 to +926
Five engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL.

| Route | Method | vLLM API called | Description |
|-------|--------|-----------------|-------------|
| `/engine/pause_generation` | POST | `engine_client.pause_generation()` | Drain in-flight requests, keep model loaded in GPU memory |
| `/engine/resume_generation` | POST | `engine_client.resume_generation()` | Resume accepting inference requests |
| `/engine/flush_cache` | POST | `engine_client.reset_prefix_cache()` | Invalidate prefix/KV cache (required before weight reload) |
| `/engine/update_weights_from_path` | POST | `engine_client.collective_rpc("reload_weights", ...)` | Load weights from filesystem (safetensors checkpoint) |
| `/engine/get_weight_version` | POST | `self._weight_version` | Return current weight version string |

Both decode and prefill worker types register all 5 routes. Route signatures are compatible with SGLang's merged `#6094` routes for backend interoperability.

### Registration (worker_factory.py)

```python
runtime.register_engine_route("pause_generation", handler.pause_generation)
runtime.register_engine_route("resume_generation", handler.resume_generation)
runtime.register_engine_route("flush_cache", handler.flush_cache)
runtime.register_engine_route("update_weights_from_path", handler.update_weights_from_path)
runtime.register_engine_route("get_weight_version", handler.get_weight_version)
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Section 7 undercounts the engine routes — inconsistent with section 4.4 and the PR summary.

Line 906 says "Five engine route handlers" and the register_engine_route snippet (920-926) only lists the 5 weight-lifecycle routes, but:

  • Line 63 of this same doc states "7 engine route handlers (5 weight-lifecycle + 2 LoRA)".
  • §4.4 documents /v1/rl/load_lora_adapter and /v1/rl/unload_lora_adapter as first-class fleet-control endpoints.
  • components/src/dynamo/vllm/handlers.py defines load_lora_adapter / unload_lora_adapter alongside the other 5.

Please extend the table and the registration snippet to include the two LoRA routes (or note explicitly why they're registered elsewhere), otherwise someone wiring up a second backend from this doc will silently miss the LoRA surface.

📝 Suggested additions
-Five engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL.
+Seven engine route handlers are registered on each vLLM worker's system HTTP port (default `8081` local / `9090` in k8s). These are **internal** routes called by the Rust frontend's `/v1/rl/*` handlers -- not called directly by Prime-RL.

 | Route | Method | vLLM API called | Description |
 |-------|--------|-----------------|-------------|
 | `/engine/pause_generation` | POST | `engine_client.pause_generation()` | Drain in-flight requests, keep model loaded in GPU memory |
 | `/engine/resume_generation` | POST | `engine_client.resume_generation()` | Resume accepting inference requests |
 | `/engine/flush_cache` | POST | `engine_client.reset_prefix_cache()` | Invalidate prefix/KV cache (required before weight reload) |
 | `/engine/update_weights_from_path` | POST | `engine_client.collective_rpc("reload_weights", ...)` | Load weights from filesystem (safetensors checkpoint) |
 | `/engine/get_weight_version` | POST | `self._weight_version` | Return current weight version string |
+| `/engine/load_lora_adapter`   | POST | `engine_client.add_lora(...)` (+ `remove_lora` on hot-swap, `reset_prefix_cache`) | Load or hot-swap a PEFT LoRA adapter from a filesystem path |
+| `/engine/unload_lora_adapter` | POST | `engine_client.remove_lora(...)`                                                | Remove a previously-loaded LoRA adapter by name (idempotent) |

And mirror the additions in the register_engine_route(...) snippet.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/Dynamo-RL-api-draft.md` around lines 906 - 926, The doc incorrectly
states there are five engine routes and omits the two LoRA routes; update the
table and the registration snippet to include /engine/load_lora_adapter and
/engine/unload_lora_adapter (methods POST) and show they call the corresponding
handlers (e.g., engine route names mapped via runtime.register_engine_route to
handler.load_lora_adapter and handler.unload_lora_adapter), or explicitly state
if those routes are registered elsewhere; ensure the symbols load_lora_adapter,
unload_lora_adapter, and runtime.register_engine_route appear so readers wiring
a backend see the full seven-route surface.

Comment on lines +3056 to +3060
fn all_ok(results: &[serde_json::Value]) -> bool {
results
.iter()
.all(|r| r.get("status").and_then(|s| s.as_str()) == Some("ok"))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

all_ok returns true on an empty worker list — every admin endpoint silently succeeds with zero workers.

Iterator::all on an empty iterator is true, so when worker_system_urls is empty (which happens whenever DYN_RL_WORKER_SYSTEM_URLS is set to an empty string, since split(',').filter(|s| !s.is_empty()) produces no URLs), fan_out returns Vec::new() and RlState::all_ok(&[]) is true. Consequences:

  • POST /v1/rl/pause → 200 OK "all 0 worker(s) paused" while no worker was contacted.
  • POST /v1/rl/resume, POST /v1/rl/update_weights, POST /v1/rl/load_lora_adapter, POST /v1/rl/unload_lora_adapter all exhibit the same silent-no-op.
  • Prime-RL can then race: it thinks generation is drained, pushes new weights, resumes — but workers are serving the old model the whole time, corrupting rollouts without any error surfaced.

rl_ready already guards this with !results.is_empty() (line 3082). The other handlers should too. The fix is best applied at the source.

🐛 Proposed fix
     /// Returns true only if all results have `status: "ok"`.
     fn all_ok(results: &[serde_json::Value]) -> bool {
-        results
+        !results.is_empty()
+            && results
             .iter()
             .all(|r| r.get("status").and_then(|s| s.as_str()) == Some("ok"))
     }

Additionally, RlState::from_env should refuse to start with an empty worker list (warn + exit/return an error), or at least log at error! so this is loud rather than silent:

         let worker_system_urls = std::env::var(DYN_RL_WORKER_SYSTEM_URLS_ENV)
             .unwrap_or_else(|_| "http://localhost:8081".to_string())
             .split(',')
             .map(|s| s.trim().to_string())
             .filter(|s| !s.is_empty())
             .collect::<Vec<_>>();
+        if worker_system_urls.is_empty() {
+            tracing::error!(
+                "{DYN_RL_WORKER_SYSTEM_URLS_ENV} resolved to zero worker URLs; \
+                 all /v1/rl/* admin endpoints will be no-ops"
+            );
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 3056 - 3060, The helper
all_ok(results: &[serde_json::Value]) currently returns true for an empty slice
because Iterator::all yields true on empty iterators; change all_ok (referenced
in fan_out/RlState handlers) to return false when results.is_empty() and only
then check that every result has status "ok" (e.g., ensure !results.is_empty()
&& results.iter().all(...)). Also update any callers that expected the old
behavior if necessary, and modify RlState::from_env so it refuses or fatally
logs when the parsed worker_system_urls list is empty (do not silently start
with zero workers).

Comment on lines +3359 to +3402
/// Promote token IDs from the Dynamo `nvext` response object to the top-level
/// locations that Prime-RL / verifiers expects:
///
/// response.nvext.completion_token_ids → response.choices[i].token_ids
///
/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.
/// Used by the RL post-processing path to populate `response.prompt_token_ids`.
fn rl_tokenize_prompt(
state: &Arc<service_v2::State>,
model: &str,
messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],
) -> Option<Vec<u32>> {
if messages.is_empty() {
return None;
}
let (_, card) = resolve_model_card(state, Some(model)).ok()?;
let tokenizer = card.tokenizer().ok()?;
let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;
let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {
model: model.to_string(),
messages: messages.to_vec(),
..Default::default()
};
let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {
inner: inner_request,
common: Default::default(),
nvext: None,
chat_template_args: None,
media_io_kwargs: None,
tokens: None,
return_token_ids: None,
unsupported_fields: Default::default(),
};
let prompt = match formatter {
crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),
}
.ok()?;
let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;
Some(encoding.token_ids().to_vec())
}

/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
/// extension structure. Called on non-streaming responses when RL token ID mode
/// is active.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Doc block for rl_promote_token_ids_in_response is attached to the wrong function.

The first paragraph at lines 3359-3363 (Promote token IDs from the Dynamo \nvext` response object …) clearly describes rl_promote_token_ids_in_response(defined at line 3403), notrl_tokenize_prompt. Same failure mode as the stale Not Implemented Errordoc that was flagged abovebad_request` — split it and move the "Promote" paragraph down to the right function.

🧹 Proposed fix
-/// Promote token IDs from the Dynamo `nvext` response object to the top-level
-/// locations that Prime-RL / verifiers expects:
-///
-///   response.nvext.completion_token_ids  →  response.choices[i].token_ids
-///
 /// Tokenize chat messages using the model's tokenizer and return prompt token IDs.
 /// Used by the RL post-processing path to populate `response.prompt_token_ids`.
 fn rl_tokenize_prompt(
@@
-/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
-/// extension structure. Called on non-streaming responses when RL token ID mode
-/// is active.
+/// Promote token IDs from the Dynamo `nvext` response object to the top-level
+/// locations that Prime-RL / verifiers expects:
+///
+///   response.nvext.completion_token_ids  →  response.choices[i].token_ids
+///
+/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
+/// extension structure. Called on non-streaming responses when RL token ID mode
+/// is active.
 fn rl_promote_token_ids_in_response(json_val: &mut serde_json::Value) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Promote token IDs from the Dynamo `nvext` response object to the top-level
/// locations that Prime-RL / verifiers expects:
///
/// response.nvext.completion_token_ids → response.choices[i].token_ids
///
/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.
/// Used by the RL post-processing path to populate `response.prompt_token_ids`.
fn rl_tokenize_prompt(
state: &Arc<service_v2::State>,
model: &str,
messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],
) -> Option<Vec<u32>> {
if messages.is_empty() {
return None;
}
let (_, card) = resolve_model_card(state, Some(model)).ok()?;
let tokenizer = card.tokenizer().ok()?;
let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;
let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {
model: model.to_string(),
messages: messages.to_vec(),
..Default::default()
};
let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {
inner: inner_request,
common: Default::default(),
nvext: None,
chat_template_args: None,
media_io_kwargs: None,
tokens: None,
return_token_ids: None,
unsupported_fields: Default::default(),
};
let prompt = match formatter {
crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),
}
.ok()?;
let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;
Some(encoding.token_ids().to_vec())
}
/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
/// extension structure. Called on non-streaming responses when RL token ID mode
/// is active.
/// Tokenize chat messages using the model's tokenizer and return prompt token IDs.
/// Used by the RL post-processing path to populate `response.prompt_token_ids`.
fn rl_tokenize_prompt(
state: &Arc<service_v2::State>,
model: &str,
messages: &[dynamo_protocols::types::ChatCompletionRequestMessage],
) -> Option<Vec<u32>> {
if messages.is_empty() {
return None;
}
let (_, card) = resolve_model_card(state, Some(model)).ok()?;
let tokenizer = card.tokenizer().ok()?;
let formatter = crate::preprocessor::prompt::PromptFormatter::from_mdc(&card).ok()?;
let inner_request = dynamo_protocols::types::CreateChatCompletionRequest {
model: model.to_string(),
messages: messages.to_vec(),
..Default::default()
};
let wrapped = crate::protocols::openai::chat_completions::NvCreateChatCompletionRequest {
inner: inner_request,
common: Default::default(),
nvext: None,
chat_template_args: None,
media_io_kwargs: None,
tokens: None,
return_token_ids: None,
unsupported_fields: Default::default(),
};
let prompt = match formatter {
crate::preprocessor::prompt::PromptFormatter::OAI(f) => f.render(&wrapped),
}
.ok()?;
let encoding = tokenizer.encode_with_special_tokens(&prompt, true).ok()?;
Some(encoding.token_ids().to_vec())
}
/// Promote token IDs from the Dynamo `nvext` response object to the top-level
/// locations that Prime-RL / verifiers expects:
///
/// response.nvext.completion_token_ids → response.choices[i].token_ids
///
/// This lets Prime-RL read `choice.token_ids` without knowing about the `nvext`
/// extension structure. Called on non-streaming responses when RL token ID mode
/// is active.
fn rl_promote_token_ids_in_response(json_val: &mut serde_json::Value) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 3359 - 3402, The doc comment
block describing promotion of token IDs is attached to rl_tokenize_prompt but
belongs to rl_promote_token_ids_in_response; move the "Promote token IDs from
the Dynamo `nvext` response object…" paragraph out of the rl_tokenize_prompt
docblock and attach it immediately above the rl_promote_token_ids_in_response
function definition so each function has its correct doc comment (keep the
tokenizer/prompt description on rl_tokenize_prompt and the nvext promotion
paragraph with rl_promote_token_ids_in_response).

Comment on lines +44 to +45
| Pause fleet | `POST /v1/rl/pause` | Drain in-flight requests before weight update |
| Resume fleet | `POST /v1/rl/resume` | Resume generation after weight update |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM uses pause_generation and resume_generation; it also has 3 modes: "abort", "wait", and "keep". It seems as though your pause is only "wait" -- do you plan to add "abort" or "keep" logic? Could be useful to follow similar logic as vLLM (also note that their default is abort)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user needs a way to confirm whether in-flight requests are done; /v1/rl/pause itself could be the confirmation point. This could be updated to be a synchronous pause(mode=wait) that returns only when the fleet is quisced, with a response like {"status":"paused","inflight_requests":0,...}. If you need bounded latency, you can return 202 Accepted with an operation ID or expose a GET /v1/rl/state endpoint that reports information like whether or not it's paused, inflight requests, desired weight version, applied weight version, etc. I would strongly prefer one state endpoint over making clients stitch together ready + weight_version + pause status

Comment on lines +50 to +51
| Health | `GET /v1/rl/health` | Lightweight frontend health check |
| Readiness | `GET /v1/rl/ready` | Deep check: are workers reachable and healthy? |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a broader Dynamo notion of readiness, what makes this different?

| Update weights | `POST /v1/rl/update_weights` | Atomic flush + reload from checkpoint directory |
| Load LoRA adapter | `POST /v1/rl/load_lora_adapter` | Hot-load/swap a PEFT-style adapter from filesystem path |
| Unload LoRA adapter | `POST /v1/rl/unload_lora_adapter` | Remove a previously loaded adapter by name |
| Weight version | `GET /v1/rl/weight_version` | Query current weight version across workers |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept is useful, but the standalone endpoint feels too narrow. I'd fold it into /v1/rl/state or /v1/rl/status rather than keep a dedicated endpoint

}

impl TokenizeChatRequest {
pub fn validate(&self) -> Result<(), String> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called anywhere?

)
} else {
(
StatusCode::OK,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns status OK no matter what -- what if there are failures?

Comment on lines +308 to +314
### 4.2 Token-In / Token-Out (TITO)

```
POST /v1/chat/completions/tokens
```

Dedicated endpoint for Prime-RL's pre-tokenized prompt flow (multi-turn RL, turn 2+). The orchestrator sends raw token IDs instead of text messages, bypassing the frontend's tokenizer entirely. This avoids redundant encode/decode round-trips and ensures token-level alignment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the distinction between this new endpoint and passing in tokens into /v1/chat/completions? Why not just use that? Line 213 of this doc


---

#### `POST /v1/rl/load_lora_adapter`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LoRA endpoints are okay but very Prime-RL-shaped right now. Why not extend the existing /v1/loras model to support local-path sources?

Comment on lines +468 to +482
#### `GET /v1/rl/health`

Lightweight liveness probe. Returns immediately as long as the frontend process is running. Used by prime-rl's `check_health()` on the admin client.

```bash
curl -s http://localhost:8000/v1/rl/health
```

```json
{"status": "ok"}
```

---

#### `GET /v1/rl/ready`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in another comment, but I'd prefer a single /v1/rl/state or /v1/rl/status endpoint over accumulating health, ready, and weight_version. That would give you one place for readiness, applied version, desired version, pause state, and per-worker detail.

@biswapanda

Copy link
Copy Markdown
Contributor Author

Agree with the points. I'll address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend container documentation Improvements or additions to documentation frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants