feat: Nemo-rl support#7699
Conversation
WalkthroughThis PR introduces tokenization and detokenization support to the OpenAI HTTP service via new endpoints, adds tokenization protocol types for completion and chat requests, extends tokenizer traits with special-token handling and token-to-string conversion, implements these new capabilities across multiple tokenizer backends, enables chat template overrides during formatter initialization, and adds model card persistence in the engine configuration. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
lib/llm/src/protocols/openai/tokenization.rs (1)
87-92: Untagged enum order may affect ambiguous JSON deserialization.With
#[serde(untagged)], serde attempts deserialization in variant order. SinceTokenizeCompletionRequestis listed first, a JSON object with bothpromptandmessagesfields would deserialize asCompletion(silently ignoringmessages) due to serde tryingCompletionfirst.This should work correctly in practice since valid requests typically contain only the fields for one variant, but consider adding a note in documentation or using
#[serde(deny_unknown_fields)]at the enum level if strict validation is desired.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 87 - 92, The untagged enum TokenizeRequest (variants TokenizeCompletionRequest and TokenizeChatRequest) can deserialize ambiguously because serde tries variants in order; a JSON with both prompt and messages would match Completion first and drop messages. Fix by either documenting this behavior on TokenizeRequest so callers avoid mixed payloads or make deserialization stricter (e.g., add #[serde(deny_unknown_fields)] to TokenizeCompletionRequest and TokenizeChatRequest or change to a tagged enum) so mixed-field objects are rejected; update the TokenizeRequest definition and related docs accordingly.lib/llm/src/http/service/openai.rs (3)
2054-2068: Consider documenting root-level path choice.The endpoints are registered at
/tokenizeand/detokenize(root level), while other OpenAI-compatible endpoints use/v1/prefix (e.g.,/v1/completions). This appears intentional for vllm compatibility, but a brief comment would clarify this design decision.📝 Suggested documentation
+/// Create an Axum [`Router`] for the tokenization endpoints. +/// Paths are `/tokenize` and `/detokenize` (root level, matching vllm API convention). pub fn tokenization_router(state: Arc<service_v2::State>) -> (Vec<RouteDoc>, Router) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 2054 - 2068, The tokenization endpoints are registered at root-level paths (tokenize_path and detokenize_path) inside tokenization_router which differs from the other OpenAI-compatible routes that use the "/v1/" prefix; add a concise comment immediately above the pub fn tokenization_router(...) explaining that the root-level paths are intentional (for vllm compatibility or any other reason), and mention the expected behavior/compatibility implications so future readers know why tokenize and detokenize are not under "/v1/".
368-375: Empty join separator may concatenate text parts incorrectly.When
contentis anArray, the text parts are joined with an empty string (""). If the array contains parts like["Hello", " world"](with leading space), this works correctly. However, if parts are["Hello", "world"], they become"Helloworld".Verify this matches the expected format of
ChatCompletionRequestAssistantMessageContentPart::Textfrom the OpenAI spec. If parts represent discrete text segments, consider joining with a space or preserving original spacing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 368 - 375, The current handling of Content::Array collects Part::Text pieces and joins them with an empty separator, which can incorrectly concatenate words (e.g., "Hello" + "world" => "Helloworld"); update the logic in the Content::Array branch (where Part::Text is mapped) to preserve spacing between segments per the OpenAI ChatCompletionRequestAssistantMessageContentPart::Text semantics—either join with a single space or otherwise preserve original whitespace boundaries between parts so discrete text segments do not run together.
405-414: rfind-based trimming assumes literal text match in rendered output.The function uses
rfind(trimmed_final_message)to locate where to truncate. This works if the chat template renders the assistant content verbatim, but some templates may escape, quote, or transform the content (e.g., XML escaping<to<).If the template transforms content,
rfindwill fail and return an internal server error. Consider adding a debug log or comment explaining this assumption.📝 Suggested documentation
// Use rfind to locate the last occurrence of the assistant content in the rendered prompt, // then truncate everything after it (e.g. EOS tokens, generation prompts). This assumes the // final assistant message text appears only once at the end of the rendered output. + // NOTE: This approach assumes the chat template renders assistant content verbatim without + // escaping or transformation. Templates that modify content (e.g., XML escaping) will fail. let Some(final_msg_loc) = rendered_prompt.rfind(trimmed_final_message) else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 405 - 414, The rfind-based trim in the continue-final-message logic (where rendered_prompt.rfind(trimmed_final_message) is used) assumes the assistant content appears verbatim; update the failure branch to log debug information (using the module's logger) including a sanitized preview of rendered_prompt and trimmed_final_message and then return the ErrorMessage::internal_server_error with a clearer hint about possible escaping/transformations; also add an inline comment near the rfind call documenting this assumption so future readers know why a literal match may fail and to guide implementing an escaping-aware fallback if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 310-339: Add a concise in-code comment inside
resolve_tokenizer_model_name explaining the fallback hierarchy: that when no
requested_model is provided the function first prefers
state.manager().model_display_names() (models known to the serving layer), and
only if that yields a single name falls back to
state.manager().get_model_cards() (card metadata which may exist before a model
is fully registered); mention that this is intentional to allow tokenizer
endpoints to use card-only metadata while noting the subtle risk that card names
may not map to serving-capable models and referencing has_model_any,
ErrorMessage::model_not_found, and bad_request so future maintainers understand
why the two checks exist and the order must be preserved.
In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 103-110: The module defines TokenizeResponse but is missing
DetokenizeRequest and DetokenizeResponse, causing imports in openai.rs to fail;
add two struct definitions named DetokenizeRequest and DetokenizeResponse
immediately after TokenizeResponse: DetokenizeRequest should carry a
Vec<TokenIdType> (e.g., tokens) and an optional model info if needed, and
DetokenizeResponse should contain the resulting String (e.g., text) and any
optional metadata (match fields used by the detokenize handler in openai.rs).
Ensure both derive the same traits as TokenizeResponse (Debug, Clone, Serialize,
Deserialize, PartialEq, Eq) and use serde attributes consistent with token_strs
for optional fields so the detokenize handler compiles.
In `@lib/llm/src/tokenizers/tiktoken.rs`:
- Around line 103-105: The TikTokenTokenizer encode() currently calls
self.encode_with_special_tokens(input, true) which includes special tokens by
default and is inconsistent with HuggingFaceTokenizer and FastTokenizer; change
TikTokenTokenizer::encode to call self.encode_with_special_tokens(input, false)
so the default behavior excludes special tokens like the other backends (or
alternatively add explicit documentation on this intentional difference if you
prefer to keep true).
---
Nitpick comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 2054-2068: The tokenization endpoints are registered at root-level
paths (tokenize_path and detokenize_path) inside tokenization_router which
differs from the other OpenAI-compatible routes that use the "/v1/" prefix; add
a concise comment immediately above the pub fn tokenization_router(...)
explaining that the root-level paths are intentional (for vllm compatibility or
any other reason), and mention the expected behavior/compatibility implications
so future readers know why tokenize and detokenize are not under "/v1/".
- Around line 368-375: The current handling of Content::Array collects
Part::Text pieces and joins them with an empty separator, which can incorrectly
concatenate words (e.g., "Hello" + "world" => "Helloworld"); update the logic in
the Content::Array branch (where Part::Text is mapped) to preserve spacing
between segments per the OpenAI
ChatCompletionRequestAssistantMessageContentPart::Text semantics—either join
with a single space or otherwise preserve original whitespace boundaries between
parts so discrete text segments do not run together.
- Around line 405-414: The rfind-based trim in the continue-final-message logic
(where rendered_prompt.rfind(trimmed_final_message) is used) assumes the
assistant content appears verbatim; update the failure branch to log debug
information (using the module's logger) including a sanitized preview of
rendered_prompt and trimmed_final_message and then return the
ErrorMessage::internal_server_error with a clearer hint about possible
escaping/transformations; also add an inline comment near the rfind call
documenting this assumption so future readers know why a literal match may fail
and to guide implementing an escaping-aware fallback if needed.
In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 87-92: The untagged enum TokenizeRequest (variants
TokenizeCompletionRequest and TokenizeChatRequest) can deserialize ambiguously
because serde tries variants in order; a JSON with both prompt and messages
would match Completion first and drop messages. Fix by either documenting this
behavior on TokenizeRequest so callers avoid mixed payloads or make
deserialization stricter (e.g., add #[serde(deny_unknown_fields)] to
TokenizeCompletionRequest and TokenizeChatRequest or change to a tagged enum) so
mixed-field objects are rejected; update the TokenizeRequest definition and
related docs accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b5291155-c3ef-4acf-affa-8df89d991f48
📒 Files selected for processing (10)
lib/llm/src/entrypoint/input/http.rslib/llm/src/http/service/openai.rslib/llm/src/http/service/service_v2.rslib/llm/src/preprocessor/prompt/template.rslib/llm/src/protocols/openai.rslib/llm/src/protocols/openai/tokenization.rslib/llm/src/tokenizers.rslib/llm/src/tokenizers/fastokens.rslib/llm/src/tokenizers/hf.rslib/llm/src/tokenizers/tiktoken.rs
| manager.save_model_card( | ||
| &format!("__local_model_card_{}", model.display_name()), | ||
| model.card().clone(), | ||
| )?; |
There was a problem hiding this comment.
The original motivation was to support the local/in-process models, but we can probably get rid of that.
|
@jthomson04 In the spirit of recent Sync, if you have the agent plan file, could you attach it? |
296ce5a to
084dd6f
Compare
I didn't make a plan file for this one, but I'll make sure to do that in future MRs |
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Add try/except fallbacks for import paths that moved between vLLM 0.18 and 0.19 (TokensPrompt, MultiModalUUIDDict) and relax the version pin to accept both releases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The JailedStream accumulates content while detecting tool-call markers, but previously discarded any logprobs attached to jailed chunks. This meant that when `logprobs` was requested, tool-call responses would come back with `logprobs: null` even though the backend had produced them. Accumulate logprobs alongside content in ChoiceJailState and attach them to the emitted choice when the jail ends (both mid-stream unjail and stream-finish flush). Also propagate logprobs through `create_tool_call_choice` so the final tool-call chunk carries the full token-level log-probability data. Adds three unit tests covering single-chunk, multi-chunk, and mixed text+tool_call scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4cad176 to
d3d39f5
Compare
…from #7699 Cherry-picks the following from jthomson04/tokenize-endpoint: - POST /v1/tokenize and /v1/detokenize HTTP endpoints - Tokenizer trait: encode_with_special_tokens(), convert_ids_to_tokens() - return_tokens_as_token_ids parameter for chat completions - Multi-instance tokenize fix (discovery watcher) - Jail logprobs preservation through tool-call jailing
…robs Add return_tokens_as_token_ids support to the SGLang decode handler, mirroring what PR #7699 added for vLLM. When enabled, logprob token fields are returned as "token_id:<id>" instead of decoded text. Changes: - decode_handler.py: Read return_tokens_as_token_ids from output_options, pass through _process_token_stream to _extract_logprobs, format token strings accordingly - sglang_processor.py: Forward return_tokens_as_token_ids through _build_dynamo_preproc output_options - vllm/handlers.py: Remove debug print left in cherry-picked code
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
…7699) Manually cherry-picked the /tokenize and /detokenize endpoint code from ai-dynamo/dynamo PR #7699 (jthomson04/tokenize-endpoint). The full PR had merge conflicts due to divergent chat_completions.rs refactoring, so only the tokenization-specific changes were applied: - New file: protocols/openai/tokenization.rs (request/response types) - Tokenizer trait additions: encode_with_special_tokens, convert_ids_to_tokens - HTTP routes: POST /tokenize, POST /detokenize - Route wiring in service_v2.rs These endpoints follow vLLM's /tokenize and /detokenize semantics, enabling consistent tokenization between the Rust frontend and Python clients for RL training.
All rl_admin functionality has been re-implemented natively in the Dynamo
Rust frontend (lib/llm):
- Admin routes (/pause, /resume, /update_weights, /health, /ready,
/load_lora_adapter, /unload_lora_adapter) → openai.rs /v1/rl/* router
- Token injection (prompt_token_ids, choice.token_ids) → handler_chat_completions
- Tokenization → /v1/tokenize + /v1/detokenize (cherry from #7699)
Prime-RL now points admin_base_url directly at the Dynamo frontend
(/v1/rl); no separate Python sidecar is deployed. The module has zero
imports across the codebase and is not referenced in any Dockerfile,
Helm chart, or K8s manifest.
Cleanup-only — no behavior change. Strips review-tracker noise that accumulated on top of PR-added text during iteration: - "Closes hhzhang16 HH-19/HH-21/HH-22/HH-23/HH-25/HH-26/HH-27" - "CR-8 / CR-9 / CR-10 closure" prefixes on serde-error / doc-attach fixes - Branch-name references: bis/parity-tokenize-tcp, bis/prime-rl-merged - Internal PR numbers: #6094, #7699, #8197, #9141 - Phase numbers from internal design docs (rl-support.md Phase 1/4/5) - "prime-rl" mentions in narrative copy and mermaid diagrams → generic "RL trainer / RL orchestrator / external client" Technical content (semantics, invariants, why-this-exists rationale) preserved everywhere; only the internal-process scaffolding is removed. Scope verification: every removed line is one this branch ADDED (diff main..HEAD shows the removed text on a `+` line). No edits land on pre-existing main-branch comments. Specifically reverted the nvext.rs cleanup attempt — its target lines (GAIE Stage 1/2, SGLang-specific) live on main, not in this PR's diff. Files touched: components/src/dynamo/vllm/handlers.py +9 -10 components/src/dynamo/vllm/worker_factory.py +6 -4 docs/dynamo-RL-api.md +19 -32 lib/llm/src/http/service/openai.rs +32 -34 lib/llm/src/protocols/openai/chat_completions/delta.rs +4 -4 lib/llm/src/protocols/openai/completions/delta.rs +3 -3 lib/llm/src/protocols/openai/validate.rs +20 -20 cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
…e shutdowns, tier the docs
Stack-ranked review fixes from the latest Graham-style pass on this branch.
Tier 4 (commit message conventions and tokenizer-crate scope) deliberately
deferred.
Tier 1 — correctness / contract claims
- Drop the word "atomic" from the `/v1/rl/update_weights` docblock and
spell out the partial-failure semantics: workers `0..N-1` may have
switched while worker `N` failed; per-worker status is in the
response, true rollback is a follow-up.
- Reject `n > 1` on `/v1/chat/completions` when RL token IDs are
requested. The streaming aggregator's `completion_token_ids` Vec is
shared across all choices, so per-choice promotion downstream cannot
recover which tokens belong to which choice with `n > 1`. Hard-reject
is the interim guard until the keyed-by-index refactor lands.
- `update_weights` empty `weight_dir` (`""`) is now treated the same
as null/missing (NCCL-mode no-op) instead of being forwarded to the
engine as `path=""`.
Tier 2 — hard rules
- Hot-path `[RL]` `logger.info` → `logger.debug` for pause / resume /
flush_cache / weights-load / LoRA load / LoRA unload (8 sites). RL
trainers fire these per training step; info-level was a log flood.
- Extract the duplicated 4-line `EngineDeadError` shutdown stanza into
`BaseWorkerHandler._shutdown_on_engine_dead(e) -> NoReturn` and
collapse the 9 call sites (8 RL handlers + 1 generate path) to a
single line each. ~32 lines removed.
- Strip the remaining internal-tracker comments missed by an earlier
chore-scrub: `service_v2.rs` had `jthomson04 PR #7699`,
`bis-dev/design-docs/rl-support.md §1`, and `hhzhang16 HH-22 / HH-26`
references in two places. Replaced with neutral prose.
- Strip the SGLang-coordination comment in `handlers.py` ("Signatures
intentionally line up with the SGLang RL admin routes"). The kind of
line that goes stale when SGLang's admin set drifts.
- Delete dead-code carcasses for the dropped routes
(`/v1/chat/completions/tokens`, `/v1/tokenize`, `/v1/detokenize`):
remove `tokenize`, `detokenize`, `tokenization_router`,
`chat_completions_tokens_router`, `handler_chat_completions_tokens`
(~240 lines), drop `pub mod tokenization;`, delete the
`tokenization.rs` module file (124 lines), drop unused tokenize-type
imports. All five were behind `#[allow(dead_code)]` and unmounted.
Tier 3 — tests
- Make `RlState::new(...)` a pub(super) test-friendly constructor so
handler-level tests don't need `from_env` / process env vars.
- Convert `RlPauseQuery::mode: Option<String>` to `Option<PauseMode>`,
a typed enum with `serde(rename_all = "lowercase")`. Axum now returns
400 on `mode=foo` before the handler runs; the runtime string match
is gone.
- Add four behavior tests in `mod tests`:
test_pause_mode_serde_roundtrip
test_pause_mode_rejects_unknown_value
test_rl_update_weights_body_defaults
test_rl_state_new_constructs_without_env
- Fix unrelated test struct-init breakage that shipped earlier in
this branch: 26 sites across 11 files were missing the
`prompt_token_ids`, `return_token_ids`, `tokens`, and
`completion_token_ids` fields added to the response/request/nvext
structs. `cargo test -p dynamo-llm` now compiles cleanly.
Tier 5 — design / nits
- `DYN_RL_LIVENESS_TIMEOUT_MS` is read once in `RlState::from_env` and
cached as `RlState.probe_timeout: Duration`; `rl_ready` and
`rl_liveness` use the cached value instead of re-parsing the env on
every request.
- `rl_ready` worker-probe body simplified from a four-link
`.ok().and_then(Result::ok).map(...).unwrap_or(false)` chain to a
`match (timeout, send_result)` that surfaces the timeout/network
distinction for a future log line.
cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --lib: 58 passed (4 new RL tests).
cargo test -p dynamo-llm --tests integration suites: all green.
Adds /tokenize and /detokenize endpoints, following the semantics of vllm. This is needed for RL.
Summary by CodeRabbit
POST /tokenize,POST /detokenize) for encoding prompts and chat messages to tokens and decoding tokens back to text