Skip to content

feat: Nemo-rl support#7699

Draft
jthomson04 wants to merge 20 commits into
mainfrom
jthomson04/tokenize-endpoint
Draft

feat: Nemo-rl support#7699
jthomson04 wants to merge 20 commits into
mainfrom
jthomson04/tokenize-endpoint

Conversation

@jthomson04

@jthomson04 jthomson04 commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Adds /tokenize and /detokenize endpoints, following the semantics of vllm. This is needed for RL.

Summary by CodeRabbit

  • New Features
    • Added tokenization endpoints (POST /tokenize, POST /detokenize) for encoding prompts and chat messages to tokens and decoding tokens back to text
    • Added support for custom chat template overrides when rendering prompts
    • Added optional token string output in tokenization responses for human-readable token representation

@jthomson04 jthomson04 requested a review from a team as a code owner March 31, 2026 04:35
@github-actions github-actions Bot added feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Mar 31, 2026
@coderabbitai

coderabbitai Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR introduces tokenization and detokenization support to the OpenAI HTTP service via new endpoints, adds tokenization protocol types for completion and chat requests, extends tokenizer traits with special-token handling and token-to-string conversion, implements these new capabilities across multiple tokenizer backends, enables chat template overrides during formatter initialization, and adds model card persistence in the engine configuration.

Changes

Cohort / File(s) Summary
HTTP Service Tokenization Endpoints
lib/llm/src/http/service/openai.rs, lib/llm/src/http/service/service_v2.rs
Added tokenization_router with POST /tokenize and POST /detokenize endpoints for completion and chat requests, including model resolution, prompt rendering, token string optional output, validation helpers, and comprehensive unit test coverage. Registered router in base routes list.
Tokenization Protocol Types
lib/llm/src/protocols/openai.rs, lib/llm/src/protocols/openai/tokenization.rs
Introduced new module with TokenizeCompletionRequest, TokenizeChatRequest, TokenizeRequest enum, and TokenizeResponse structures, including validation for conflicting flags and merged template kwargs computation.
Tokenizer Trait Extensions
lib/llm/src/tokenizers.rs, lib/llm/src/tokenizers/fastokens.rs, lib/llm/src/tokenizers/hf.rs, lib/llm/src/tokenizers/tiktoken.rs
Extended Encoder trait with encode_with_special_tokens() and Tokenizer trait with convert_ids_to_tokens(), with concrete implementations across FastTokenizer, HuggingFace, and TikToken backends. TikToken now precomputes token-to-string decoders at construction.
Model & Template Configuration
lib/llm/src/preprocessor/prompt/template.rs, lib/llm/src/entrypoint/input/http.rs
Added PromptFormatter::from_mdc_with_chat_template() supporting optional chat template overrides, and integrated model card saving immediately after model checksum computation in engine initialization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title 'feat: Nemo-rl support' does not align with the actual changes. The PR implements /tokenize and /detokenize endpoints, not Nemo-rl support. Update the title to reflect the actual changes, e.g., 'feat: Add /tokenize and /detokenize endpoints' or 'feat: Add tokenization endpoints for RL workflows'.
Description check ❓ Inconclusive The description is minimal but incomplete. It lacks required sections (Overview, Details, Where to start, Related Issues) specified in the template. Expand the description to follow the template structure: add an Overview section with context, a detailed Description section explaining the changes, guidance on where to start reviewing, and link related GitHub issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
lib/llm/src/protocols/openai/tokenization.rs (1)

87-92: Untagged enum order may affect ambiguous JSON deserialization.

With #[serde(untagged)], serde attempts deserialization in variant order. Since TokenizeCompletionRequest is listed first, a JSON object with both prompt and messages fields would deserialize as Completion (silently ignoring messages) due to serde trying Completion first.

This should work correctly in practice since valid requests typically contain only the fields for one variant, but consider adding a note in documentation or using #[serde(deny_unknown_fields)] at the enum level if strict validation is desired.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/tokenization.rs` around lines 87 - 92, The
untagged enum TokenizeRequest (variants TokenizeCompletionRequest and
TokenizeChatRequest) can deserialize ambiguously because serde tries variants in
order; a JSON with both prompt and messages would match Completion first and
drop messages. Fix by either documenting this behavior on TokenizeRequest so
callers avoid mixed payloads or make deserialization stricter (e.g., add
#[serde(deny_unknown_fields)] to TokenizeCompletionRequest and
TokenizeChatRequest or change to a tagged enum) so mixed-field objects are
rejected; update the TokenizeRequest definition and related docs accordingly.
lib/llm/src/http/service/openai.rs (3)

2054-2068: Consider documenting root-level path choice.

The endpoints are registered at /tokenize and /detokenize (root level), while other OpenAI-compatible endpoints use /v1/ prefix (e.g., /v1/completions). This appears intentional for vllm compatibility, but a brief comment would clarify this design decision.

📝 Suggested documentation
+/// Create an Axum [`Router`] for the tokenization endpoints.
+/// Paths are `/tokenize` and `/detokenize` (root level, matching vllm API convention).
 pub fn tokenization_router(state: Arc<service_v2::State>) -> (Vec<RouteDoc>, Router) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 2054 - 2068, The
tokenization endpoints are registered at root-level paths (tokenize_path and
detokenize_path) inside tokenization_router which differs from the other
OpenAI-compatible routes that use the "/v1/" prefix; add a concise comment
immediately above the pub fn tokenization_router(...) explaining that the
root-level paths are intentional (for vllm compatibility or any other reason),
and mention the expected behavior/compatibility implications so future readers
know why tokenize and detokenize are not under "/v1/".

368-375: Empty join separator may concatenate text parts incorrectly.

When content is an Array, the text parts are joined with an empty string (""). If the array contains parts like ["Hello", " world"] (with leading space), this works correctly. However, if parts are ["Hello", "world"], they become "Helloworld".

Verify this matches the expected format of ChatCompletionRequestAssistantMessageContentPart::Text from the OpenAI spec. If parts represent discrete text segments, consider joining with a space or preserving original spacing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 368 - 375, The current
handling of Content::Array collects Part::Text pieces and joins them with an
empty separator, which can incorrectly concatenate words (e.g., "Hello" +
"world" => "Helloworld"); update the logic in the Content::Array branch (where
Part::Text is mapped) to preserve spacing between segments per the OpenAI
ChatCompletionRequestAssistantMessageContentPart::Text semantics—either join
with a single space or otherwise preserve original whitespace boundaries between
parts so discrete text segments do not run together.

405-414: rfind-based trimming assumes literal text match in rendered output.

The function uses rfind(trimmed_final_message) to locate where to truncate. This works if the chat template renders the assistant content verbatim, but some templates may escape, quote, or transform the content (e.g., XML escaping < to &lt;).

If the template transforms content, rfind will fail and return an internal server error. Consider adding a debug log or comment explaining this assumption.

📝 Suggested documentation
     // Use rfind to locate the last occurrence of the assistant content in the rendered prompt,
     // then truncate everything after it (e.g. EOS tokens, generation prompts). This assumes the
     // final assistant message text appears only once at the end of the rendered output.
+    // NOTE: This approach assumes the chat template renders assistant content verbatim without
+    // escaping or transformation. Templates that modify content (e.g., XML escaping) will fail.
     let Some(final_msg_loc) = rendered_prompt.rfind(trimmed_final_message) else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/openai.rs` around lines 405 - 414, The rfind-based
trim in the continue-final-message logic (where
rendered_prompt.rfind(trimmed_final_message) is used) assumes the assistant
content appears verbatim; update the failure branch to log debug information
(using the module's logger) including a sanitized preview of rendered_prompt and
trimmed_final_message and then return the ErrorMessage::internal_server_error
with a clearer hint about possible escaping/transformations; also add an inline
comment near the rfind call documenting this assumption so future readers know
why a literal match may fail and to guide implementing an escaping-aware
fallback if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 310-339: Add a concise in-code comment inside
resolve_tokenizer_model_name explaining the fallback hierarchy: that when no
requested_model is provided the function first prefers
state.manager().model_display_names() (models known to the serving layer), and
only if that yields a single name falls back to
state.manager().get_model_cards() (card metadata which may exist before a model
is fully registered); mention that this is intentional to allow tokenizer
endpoints to use card-only metadata while noting the subtle risk that card names
may not map to serving-capable models and referencing has_model_any,
ErrorMessage::model_not_found, and bad_request so future maintainers understand
why the two checks exist and the order must be preserved.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 103-110: The module defines TokenizeResponse but is missing
DetokenizeRequest and DetokenizeResponse, causing imports in openai.rs to fail;
add two struct definitions named DetokenizeRequest and DetokenizeResponse
immediately after TokenizeResponse: DetokenizeRequest should carry a
Vec<TokenIdType> (e.g., tokens) and an optional model info if needed, and
DetokenizeResponse should contain the resulting String (e.g., text) and any
optional metadata (match fields used by the detokenize handler in openai.rs).
Ensure both derive the same traits as TokenizeResponse (Debug, Clone, Serialize,
Deserialize, PartialEq, Eq) and use serde attributes consistent with token_strs
for optional fields so the detokenize handler compiles.

In `@lib/llm/src/tokenizers/tiktoken.rs`:
- Around line 103-105: The TikTokenTokenizer encode() currently calls
self.encode_with_special_tokens(input, true) which includes special tokens by
default and is inconsistent with HuggingFaceTokenizer and FastTokenizer; change
TikTokenTokenizer::encode to call self.encode_with_special_tokens(input, false)
so the default behavior excludes special tokens like the other backends (or
alternatively add explicit documentation on this intentional difference if you
prefer to keep true).

---

Nitpick comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 2054-2068: The tokenization endpoints are registered at root-level
paths (tokenize_path and detokenize_path) inside tokenization_router which
differs from the other OpenAI-compatible routes that use the "/v1/" prefix; add
a concise comment immediately above the pub fn tokenization_router(...)
explaining that the root-level paths are intentional (for vllm compatibility or
any other reason), and mention the expected behavior/compatibility implications
so future readers know why tokenize and detokenize are not under "/v1/".
- Around line 368-375: The current handling of Content::Array collects
Part::Text pieces and joins them with an empty separator, which can incorrectly
concatenate words (e.g., "Hello" + "world" => "Helloworld"); update the logic in
the Content::Array branch (where Part::Text is mapped) to preserve spacing
between segments per the OpenAI
ChatCompletionRequestAssistantMessageContentPart::Text semantics—either join
with a single space or otherwise preserve original whitespace boundaries between
parts so discrete text segments do not run together.
- Around line 405-414: The rfind-based trim in the continue-final-message logic
(where rendered_prompt.rfind(trimmed_final_message) is used) assumes the
assistant content appears verbatim; update the failure branch to log debug
information (using the module's logger) including a sanitized preview of
rendered_prompt and trimmed_final_message and then return the
ErrorMessage::internal_server_error with a clearer hint about possible
escaping/transformations; also add an inline comment near the rfind call
documenting this assumption so future readers know why a literal match may fail
and to guide implementing an escaping-aware fallback if needed.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 87-92: The untagged enum TokenizeRequest (variants
TokenizeCompletionRequest and TokenizeChatRequest) can deserialize ambiguously
because serde tries variants in order; a JSON with both prompt and messages
would match Completion first and drop messages. Fix by either documenting this
behavior on TokenizeRequest so callers avoid mixed payloads or make
deserialization stricter (e.g., add #[serde(deny_unknown_fields)] to
TokenizeCompletionRequest and TokenizeChatRequest or change to a tagged enum) so
mixed-field objects are rejected; update the TokenizeRequest definition and
related docs accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b5291155-c3ef-4acf-affa-8df89d991f48

📥 Commits

Reviewing files that changed from the base of the PR and between 4abafff and d7d5857.

📒 Files selected for processing (10)
  • lib/llm/src/entrypoint/input/http.rs
  • lib/llm/src/http/service/openai.rs
  • lib/llm/src/http/service/service_v2.rs
  • lib/llm/src/preprocessor/prompt/template.rs
  • lib/llm/src/protocols/openai.rs
  • lib/llm/src/protocols/openai/tokenization.rs
  • lib/llm/src/tokenizers.rs
  • lib/llm/src/tokenizers/fastokens.rs
  • lib/llm/src/tokenizers/hf.rs
  • lib/llm/src/tokenizers/tiktoken.rs

Comment thread lib/llm/src/http/service/openai.rs
Comment thread lib/llm/src/protocols/openai/tokenization.rs
Comment thread lib/llm/src/tokenizers/tiktoken.rs
Comment thread lib/llm/src/entrypoint/input/http.rs Outdated
manager.save_model_card(
&format!("__local_model_card_{}", model.display_name()),
model.card().clone(),
)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original motivation was to support the local/in-process models, but we can probably get rid of that.

@grahamking

Copy link
Copy Markdown
Contributor

@jthomson04 In the spirit of recent Sync, if you have the agent plan file, could you attach it?

@jthomson04

Copy link
Copy Markdown
Contributor Author

@jthomson04 In the spirit of recent Sync, if you have the agent plan file, could you attach it?

I didn't make a plan file for this one, but I'll make sure to do that in future MRs

@jthomson04 jthomson04 requested a review from a team as a code owner April 5, 2026 02:23
@jthomson04 jthomson04 requested a review from a team April 5, 2026 02:23
jthomson04 and others added 11 commits April 16, 2026 21:28
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Add try/except fallbacks for import paths that moved between vLLM 0.18
and 0.19 (TokensPrompt, MultiModalUUIDDict) and relax the version pin
to accept both releases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The JailedStream accumulates content while detecting tool-call markers,
but previously discarded any logprobs attached to jailed chunks. This
meant that when `logprobs` was requested, tool-call responses would come
back with `logprobs: null` even though the backend had produced them.

Accumulate logprobs alongside content in ChoiceJailState and attach them
to the emitted choice when the jail ends (both mid-stream unjail and
stream-finish flush). Also propagate logprobs through
`create_tool_call_choice` so the final tool-call chunk carries the full
token-level log-probability data.

Adds three unit tests covering single-chunk, multi-chunk, and mixed
text+tool_call scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
@jthomson04 jthomson04 force-pushed the jthomson04/tokenize-endpoint branch from 4cad176 to d3d39f5 Compare April 17, 2026 05:16
Aphoh added a commit that referenced this pull request Apr 17, 2026
…from #7699

Cherry-picks the following from jthomson04/tokenize-endpoint:
- POST /v1/tokenize and /v1/detokenize HTTP endpoints
- Tokenizer trait: encode_with_special_tokens(), convert_ids_to_tokens()
- return_tokens_as_token_ids parameter for chat completions
- Multi-instance tokenize fix (discovery watcher)
- Jail logprobs preservation through tool-call jailing
Aphoh added a commit that referenced this pull request Apr 17, 2026
…robs

Add return_tokens_as_token_ids support to the SGLang decode handler,
mirroring what PR #7699 added for vLLM. When enabled, logprob token
fields are returned as "token_id:<id>" instead of decoded text.

Changes:
- decode_handler.py: Read return_tokens_as_token_ids from output_options,
  pass through _process_token_stream to _extract_logprobs, format token
  strings accordingly
- sglang_processor.py: Forward return_tokens_as_token_ids through
  _build_dynamo_preproc output_options
- vllm/handlers.py: Remove debug print left in cherry-picked code
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
biswapanda added a commit that referenced this pull request Apr 21, 2026
…7699)

Manually cherry-picked the /tokenize and /detokenize endpoint code from
ai-dynamo/dynamo PR #7699 (jthomson04/tokenize-endpoint). The full PR
had merge conflicts due to divergent chat_completions.rs refactoring,
so only the tokenization-specific changes were applied:

- New file: protocols/openai/tokenization.rs (request/response types)
- Tokenizer trait additions: encode_with_special_tokens, convert_ids_to_tokens
- HTTP routes: POST /tokenize, POST /detokenize
- Route wiring in service_v2.rs

These endpoints follow vLLM's /tokenize and /detokenize semantics,
enabling consistent tokenization between the Rust frontend and Python
clients for RL training.
biswapanda added a commit that referenced this pull request Apr 24, 2026
All rl_admin functionality has been re-implemented natively in the Dynamo
Rust frontend (lib/llm):

  - Admin routes (/pause, /resume, /update_weights, /health, /ready,
    /load_lora_adapter, /unload_lora_adapter) → openai.rs /v1/rl/* router
  - Token injection (prompt_token_ids, choice.token_ids) → handler_chat_completions
  - Tokenization → /v1/tokenize + /v1/detokenize (cherry from #7699)

Prime-RL now points admin_base_url directly at the Dynamo frontend
(/v1/rl); no separate Python sidecar is deployed. The module has zero
imports across the codebase and is not referenced in any Dockerfile,
Helm chart, or K8s manifest.
biswapanda added a commit that referenced this pull request May 8, 2026
Cleanup-only — no behavior change. Strips review-tracker noise that
accumulated on top of PR-added text during iteration:

  - "Closes hhzhang16 HH-19/HH-21/HH-22/HH-23/HH-25/HH-26/HH-27"
  - "CR-8 / CR-9 / CR-10 closure" prefixes on serde-error / doc-attach fixes
  - Branch-name references: bis/parity-tokenize-tcp, bis/prime-rl-merged
  - Internal PR numbers: #6094, #7699, #8197, #9141
  - Phase numbers from internal design docs (rl-support.md Phase 1/4/5)
  - "prime-rl" mentions in narrative copy and mermaid diagrams →
    generic "RL trainer / RL orchestrator / external client"

Technical content (semantics, invariants, why-this-exists rationale)
preserved everywhere; only the internal-process scaffolding is removed.

Scope verification: every removed line is one this branch ADDED
(diff main..HEAD shows the removed text on a `+` line). No edits land
on pre-existing main-branch comments. Specifically reverted the
nvext.rs cleanup attempt — its target lines (GAIE Stage 1/2,
SGLang-specific) live on main, not in this PR's diff.

Files touched:
  components/src/dynamo/vllm/handlers.py            +9  -10
  components/src/dynamo/vllm/worker_factory.py      +6  -4
  docs/dynamo-RL-api.md                             +19 -32
  lib/llm/src/http/service/openai.rs                +32 -34
  lib/llm/src/protocols/openai/chat_completions/delta.rs  +4 -4
  lib/llm/src/protocols/openai/completions/delta.rs       +3 -3
  lib/llm/src/protocols/openai/validate.rs                +20 -20

cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
biswapanda added a commit that referenced this pull request May 8, 2026
…e shutdowns, tier the docs

Stack-ranked review fixes from the latest Graham-style pass on this branch.
Tier 4 (commit message conventions and tokenizer-crate scope) deliberately
deferred.

Tier 1 — correctness / contract claims
  - Drop the word "atomic" from the `/v1/rl/update_weights` docblock and
    spell out the partial-failure semantics: workers `0..N-1` may have
    switched while worker `N` failed; per-worker status is in the
    response, true rollback is a follow-up.
  - Reject `n > 1` on `/v1/chat/completions` when RL token IDs are
    requested. The streaming aggregator's `completion_token_ids` Vec is
    shared across all choices, so per-choice promotion downstream cannot
    recover which tokens belong to which choice with `n > 1`. Hard-reject
    is the interim guard until the keyed-by-index refactor lands.
  - `update_weights` empty `weight_dir` (`""`) is now treated the same
    as null/missing (NCCL-mode no-op) instead of being forwarded to the
    engine as `path=""`.

Tier 2 — hard rules
  - Hot-path `[RL]` `logger.info` → `logger.debug` for pause / resume /
    flush_cache / weights-load / LoRA load / LoRA unload (8 sites). RL
    trainers fire these per training step; info-level was a log flood.
  - Extract the duplicated 4-line `EngineDeadError` shutdown stanza into
    `BaseWorkerHandler._shutdown_on_engine_dead(e) -> NoReturn` and
    collapse the 9 call sites (8 RL handlers + 1 generate path) to a
    single line each. ~32 lines removed.
  - Strip the remaining internal-tracker comments missed by an earlier
    chore-scrub: `service_v2.rs` had `jthomson04 PR #7699`,
    `bis-dev/design-docs/rl-support.md §1`, and `hhzhang16 HH-22 / HH-26`
    references in two places. Replaced with neutral prose.
  - Strip the SGLang-coordination comment in `handlers.py` ("Signatures
    intentionally line up with the SGLang RL admin routes"). The kind of
    line that goes stale when SGLang's admin set drifts.
  - Delete dead-code carcasses for the dropped routes
    (`/v1/chat/completions/tokens`, `/v1/tokenize`, `/v1/detokenize`):
    remove `tokenize`, `detokenize`, `tokenization_router`,
    `chat_completions_tokens_router`, `handler_chat_completions_tokens`
    (~240 lines), drop `pub mod tokenization;`, delete the
    `tokenization.rs` module file (124 lines), drop unused tokenize-type
    imports. All five were behind `#[allow(dead_code)]` and unmounted.

Tier 3 — tests
  - Make `RlState::new(...)` a pub(super) test-friendly constructor so
    handler-level tests don't need `from_env` / process env vars.
  - Convert `RlPauseQuery::mode: Option<String>` to `Option<PauseMode>`,
    a typed enum with `serde(rename_all = "lowercase")`. Axum now returns
    400 on `mode=foo` before the handler runs; the runtime string match
    is gone.
  - Add four behavior tests in `mod tests`:
      test_pause_mode_serde_roundtrip
      test_pause_mode_rejects_unknown_value
      test_rl_update_weights_body_defaults
      test_rl_state_new_constructs_without_env
  - Fix unrelated test struct-init breakage that shipped earlier in
    this branch: 26 sites across 11 files were missing the
    `prompt_token_ids`, `return_token_ids`, `tokens`, and
    `completion_token_ids` fields added to the response/request/nvext
    structs. `cargo test -p dynamo-llm` now compiles cleanly.

Tier 5 — design / nits
  - `DYN_RL_LIVENESS_TIMEOUT_MS` is read once in `RlState::from_env` and
    cached as `RlState.probe_timeout: Duration`; `rl_ready` and
    `rl_liveness` use the cached value instead of re-parsing the env on
    every request.
  - `rl_ready` worker-probe body simplified from a four-link
    `.ok().and_then(Result::ok).map(...).unwrap_or(false)` chain to a
    `match (timeout, send_result)` that surfaces the timeout/network
    distinction for a future log line.

cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --lib: 58 passed (4 new RL tests).
cargo test -p dynamo-llm --tests integration suites: all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` multimodal planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants