refactor: vllm v1 examples by alec-flowers · Pull Request #1756 · ai-dynamo/dynamo

alec-flowers · 2025-07-03T07:47:19Z

Overview:

Revamp vLLM example to use dynamo-run and be fully python runnable.

Key Features:

KV Routing with KV Events and Metrics fully supported
DP Workers mapped to Dynamo components meaning we control the routing decisions with DP Attention
Code is fully python runnable
Uses the vLLM arg parse for a similar experience to vllm serve
Examples for Single Node: Aggregated, Disaggregated, DEP Multi-Node: Deepseek R1 DEP

copy-pr-bot · 2025-07-03T07:47:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-07-09T22:22:11Z

Walkthrough

This update restructures the vLLM example deployment system by removing legacy configuration files, utility modules, and component classes, and replacing them with a new modular architecture. New scripts and Python modules are introduced for launching and managing distributed and disaggregated deployments, with enhanced argument parsing, protocol handling, and metrics publishing. Documentation is expanded and updated for multi-node and distributed setups.

Changes

File(s)	Change Summary
`container/Dockerfile.vllm_v1`	Updated vLLM commit reference and removed all `VLLM_KV_CAPI_PATH` environment variable declarations.
`examples/vllm_v1/README.md`, `multi-node.md`, `ds-r1-DEP.md`	Major documentation overhaul: broadened scope, updated deployment instructions, added architecture diagrams, and new multi-node/distributed deployment guides.
`examples/vllm_v1/components/args.py`, `protocol.py`, `handlers.py`, `main.py`, `publisher.py`	Introduced new argument parsing, protocol serialization, async request handling, distributed worker orchestration, and metrics publishing modules for vLLM-Dynamo integration.
`examples/vllm_v1/launch/agg.sh`, `agg_router.sh`, `dep.sh`, `disagg.sh`, `disagg_router.sh`, `dsr1_dep.sh`	Added new shell scripts for launching aggregated, disaggregated, and distributed deployments with robust process cleanup.
`examples/vllm_v1/configs/.yaml`, `graphs/.py`, `components/frontend.py`, `components/simple_load_balancer.py`, `components/worker.py`, `utils/args.py`, `utils/protocol.py`	Deleted legacy configuration files, graph definitions, frontend/load balancer/worker component classes, and utility modules.
`lib/bindings/python/src/dynamo/_core.pyi`	Changed method signatures: `namespace` now takes only `name`, and `create_service` renamed to `create_endpoint` in `WorkerMetricsPublisher`.
`lib/llm/src/kv_router/publisher.rs`	Enhanced event processor and ZMQ listener logging; added `data_parallel_rank` field to `KvEventBatch`.
`lib/llm/src/kv_router/scheduler.rs`	Improved logging in worker selection to include `overlap_blocks` and `new_blocks` values.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Ingress
    participant Router
    participant Worker (Decode/Prefill)
    participant MetricsPublisher

    User->>Ingress: HTTP request (chat/completions)
    Ingress->>Router: Route request (if enabled)
    Router->>Worker: Forward request
    Worker->>Worker: (Optional) Prefill/Decode coordination
    Worker->>MetricsPublisher: Publish metrics/events
    Worker-->>User: Streamed token responses

Poem

In the warren of code, a new path appears,
Old configs and workers hop out with cheers.
Scripts now launch with a whisk and a dash,
Protocols and handlers in a synchronous flash.
Metrics are published, deployments are wide—
The bunnies of vLLM now scale with pride!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 10

🔭 Outside diff range comments (1)

lib/bindings/python/src/dynamo/_core.pyi (1)

372-376: Inconsistent method rename: update all create_service() calls to create_endpoint()

We’ve verified that the create_service→create_endpoint rename in lib/bindings/python/src/dynamo/_core.pyi isn’t reflected across the codebase. There are dozens of call sites still invoking component.create_service(), for example:

launch/dynamo-run (vllm_v1_inc.py, vllm_inc.py, sglang_inc.py, trtllm_inc.py)

SDK core (deploy/sdk/src/dynamo/sdk/core/runner/dynamo.py, .../cli/serve_dynamo.py, .../core/lib.py, .../protocol/interface.py)

Python bindings tests (lib/bindings/python/tests/test_kv_bindings.py)

Examples (all under lib/bindings/python/examples/… and examples/…)

Action items:

Replace every await component.create_service() with await component.create_endpoint() (or equivalent synchronous calls)

If backward compatibility is needed, add create_service = create_endpoint alias with a deprecation warning in the runtime implementation

🧹 Nitpick comments (14)

examples/vllm_v1/ds-r1-DEP.md (2)

1-3: Replace placeholder with minimal working doc before merge

A top-level TODO with no description will confuse users browsing the new example.
Even one short paragraph (scope, prerequisites, expected topology) is better than an empty stub.

4-12: Spell-out variable placeholders & make commands copy-paste-safe

<node 0 addr> is ambiguous. Either:

Explain exactly which interface/IP is expected, or

Replace with an environment variable ($MASTER_ADDR) so the snippet can be copy-pasted unchanged.

Same comment applies to both node-specific command blocks.
lib/llm/src/kv_router/scheduler.rs (1)
336-343: Good addition, but emit at debug to avoid log spam

Including overlap_blocks and new_blocks in the formula string is valuable for troubleshooting, yet every request now produces an info-level line.
Unless this data is required in production dashboards, consider downgrading to debug! (or gating behind if tracing::enabled!(Level::DEBUG)).
-            tracing::info!(
+            tracing::debug!(
examples/vllm_v1/launch/agg.sh (2)
1-10: Robustness: terminate all children & fail fast

Current cleanup() only kills the ingress; the worker stays running if it daemonises or ignores SIGINT.
Also, the script continues on any error which can leave background processes orphaned.
-#!/bin/bash
+#!/usr/bin/env bash
+set -euo pipefail

 cleanup() {
     echo "Cleaning up background processes..."
-    kill $DYNAMO_PID 2>/dev/null || true
-    wait $DYNAMO_PID 2>/dev/null || true
+    # send TERM to whole process-group
+    kill -- -$$ 2>/dev/null || true
+    wait 2>/dev/null || true
 }
16-18: Capture the worker PID for symmetry

Even if the worker runs in the foreground, saving its PID allows cleanup() to wait for graceful shutdown and avoids races if the script is backgrounded later.
-python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager
+python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager &
+WORKER_PID=$!
+wait $WORKER_PID
examples/vllm_v1/launch/agg_router.sh (1)
16-21: Make both workers symmetrical & scalable

Running one worker in the foreground limits the script to two GPUs. Background both and push their PIDs into an array for easy extension.
-CUDA_VISIBLE_DEVICES=0 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager &
-WORKER_PID=$!
-
-CUDA_VISIBLE_DEVICES=1 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager
+declare -a WORKER_PIDS=()
+for gpu in 0 1; do
+  CUDA_VISIBLE_DEVICES=$gpu python3 components/main.py \
+      --model Qwen/Qwen3-0.6B --enforce-eager &
+  WORKER_PIDS+=($!)
+done
+wait "${WORKER_PIDS[@]}"
examples/vllm_v1/launch/disagg.sh (1)

16-23: Align with aggregated scripts for consistency

Consider backgrounding both prefill and decode workers and collecting their PIDs in an array; this keeps the control flow homogeneous across all example launchers and makes scaling to >2 GPUs trivial (loop over GPU list).
examples/vllm_v1/launch/dep.sh (1)
15-16: Consider using production build path.

The script uses a debug build path which may not be appropriate for production deployments. Consider using a production build or making the build type configurable.
-DYN_LOG=debug $HOME/dynamo/.build/target/debug/dynamo-run in=http out=dyn --router-mode kv &
+DYN_LOG=debug dynamo run in=http out=dyn --router-mode kv &
Or make it configurable:
+DYNAMO_BUILD_TYPE=${DYNAMO_BUILD_TYPE:-release}
-DYN_LOG=debug $HOME/dynamo/.build/target/debug/dynamo-run in=http out=dyn --router-mode kv &
+DYN_LOG=debug $HOME/dynamo/.build/target/$DYNAMO_BUILD_TYPE/dynamo-run in=http out=dyn --router-mode kv &
examples/vllm_v1/components/protocol.py (1)
26-30: Consider removing commented-out fields or documenting their purpose.

The commented-out fields suggest future extensions but may create confusion. Consider either removing them or adding a comment explaining their planned usage.
-    # lora_request: Optional[LoRARequest] = None
-    # encoder_prompt: Optional[str] = None
-    # encoder_prompt_token_ids: Optional[List[int]] = None
-    # num_cached_tokens: Optional[int] = None
-    # multi_modal_placeholders: Optional[MultiModalPlaceholderDict] = None
+    # Future extensions (commented out until needed):
+    # lora_request: Optional[LoRARequest] = None
+    # encoder_prompt: Optional[str] = None
+    # encoder_prompt_token_ids: Optional[List[int]] = None
+    # num_cached_tokens: Optional[int] = None
+    # multi_modal_placeholders: Optional[MultiModalPlaceholderDict] = None
examples/vllm_v1/multi-node.md (1)

87-99: Complete the incomplete TODO section.

The "Large Model Deployment" section is incomplete with just a TODO placeholder and partial example. This leaves users without guidance for tensor parallelism scenarios.

Would you like me to help complete this section with proper examples for large model deployment requiring tensor parallelism across multiple nodes?
examples/vllm_v1/README.md (2)
70-82: Add language specification to fenced code block.

The architecture diagram code block should specify a language to address the static analysis hint and improve rendering.

Apply this diff to specify the language:
-```
+```text
 +------+      +-----------+      +------------------+             +---------------+
 | HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
 |      |<-----| ingress   |<-----|                  |<------------|    Worker     |
84-84: Consider more formal language.

The static analysis tool suggests replacing "get spawned" with more formal language like "are spawned" for better documentation style.

Apply this diff:
-Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
+Note: The above architecture illustrates all the components. The final components that are spawned depend upon the chosen deployment pattern.
examples/vllm_v1/components/main.py (2)
52-81: Add type hints for better code clarity.

Consider adding type hints to improve code maintainability and IDE support:
-def setup_vllm_engine(config, stat_logger=None):
+def setup_vllm_engine(config: Config, stat_logger: Optional[StatLoggerFactory] = None) -> tuple[AsyncLLM, VllmConfig, dict]:
112-112: Track the TODO for register_prefill implementation.

The TODO comment indicates missing functionality for registering prefill workers similar to register_llm.

Would you like me to create an issue to track implementing the register_prefill functionality?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6835dd7 and ce6280f.

📒 Files selected for processing (33)

container/Dockerfile.vllm_v1 (1 hunks)
examples/vllm_v1/README.md (2 hunks)
examples/vllm_v1/components/args.py (1 hunks)
examples/vllm_v1/components/frontend.py (0 hunks)
examples/vllm_v1/components/handlers.py (1 hunks)
examples/vllm_v1/components/main.py (1 hunks)
examples/vllm_v1/components/protocol.py (1 hunks)
examples/vllm_v1/components/publisher.py (1 hunks)
examples/vllm_v1/components/simple_load_balancer.py (0 hunks)
examples/vllm_v1/components/worker.py (0 hunks)
examples/vllm_v1/configs/agg.yaml (0 hunks)
examples/vllm_v1/configs/deepseek_r1/agg.yaml (0 hunks)
examples/vllm_v1/configs/deepseek_r1/agg_dp.yaml (0 hunks)
examples/vllm_v1/configs/deepseek_r1/disagg.yaml (0 hunks)
examples/vllm_v1/configs/deepseek_r1/disagg_dp.yaml (0 hunks)
examples/vllm_v1/configs/disagg.yaml (0 hunks)
examples/vllm_v1/configs/disagg_planner.yaml (0 hunks)
examples/vllm_v1/ds-r1-DEP.md (1 hunks)
examples/vllm_v1/graphs/agg.py (0 hunks)
examples/vllm_v1/graphs/disagg.py (0 hunks)
examples/vllm_v1/graphs/disagg_planner.py (0 hunks)
examples/vllm_v1/launch/agg.sh (1 hunks)
examples/vllm_v1/launch/agg_router.sh (1 hunks)
examples/vllm_v1/launch/dep.sh (1 hunks)
examples/vllm_v1/launch/disagg.sh (1 hunks)
examples/vllm_v1/launch/disagg_router.sh (1 hunks)
examples/vllm_v1/launch/dsr1_dep.sh (1 hunks)
examples/vllm_v1/multi-node.md (1 hunks)
examples/vllm_v1/utils/args.py (0 hunks)
examples/vllm_v1/utils/protocol.py (0 hunks)
lib/bindings/python/src/dynamo/_core.pyi (2 hunks)
lib/llm/src/kv_router/publisher.rs (6 hunks)
lib/llm/src/kv_router/scheduler.rs (1 hunks)

💤 Files with no reviewable changes (15)

examples/vllm_v1/graphs/disagg_planner.py
examples/vllm_v1/graphs/agg.py
examples/vllm_v1/configs/disagg.yaml
examples/vllm_v1/utils/args.py
examples/vllm_v1/configs/disagg_planner.yaml
examples/vllm_v1/configs/deepseek_r1/disagg.yaml
examples/vllm_v1/configs/deepseek_r1/agg_dp.yaml
examples/vllm_v1/graphs/disagg.py
examples/vllm_v1/components/frontend.py
examples/vllm_v1/utils/protocol.py
examples/vllm_v1/configs/agg.yaml
examples/vllm_v1/configs/deepseek_r1/agg.yaml
examples/vllm_v1/configs/deepseek_r1/disagg_dp.yaml
examples/vllm_v1/components/simple_load_balancer.py
examples/vllm_v1/components/worker.py

🧰 Additional context used

🧠 Learnings (15)

examples/vllm_v1/ds-r1-DEP.md (2)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

examples/vllm_v1/launch/agg.sh (2)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

examples/vllm_v1/launch/disagg.sh (1)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

examples/vllm_v1/launch/dep.sh (1)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

lib/bindings/python/src/dynamo/_core.pyi (2)

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

examples/vllm_v1/multi-node.md (2)

Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

examples/vllm_v1/launch/dsr1_dep.sh (3)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

examples/vllm_v1/launch/agg_router.sh (2)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

lib/llm/src/kv_router/scheduler.rs (4)

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

examples/vllm_v1/launch/disagg_router.sh (2)

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

container/Dockerfile.vllm_v1 (1)

Learnt from: grahamking
PR: ai-dynamo/dynamo#1177
File: container/Dockerfile.vllm:102-105
Timestamp: 2025-05-28T22:54:46.875Z
Learning: In Dockerfiles, when appending to environment variables that may not exist in the base image, Docker validation will fail if you reference undefined variables with ${VARIABLE} syntax. In such cases, setting the environment variable directly (e.g., ENV CPATH=/usr/include) rather than appending is the appropriate approach.

examples/vllm_v1/components/main.py (2)

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

examples/vllm_v1/README.md (3)

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the `examples/` directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.

Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

examples/vllm_v1/components/publisher.py (1)

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

lib/llm/src/kv_router/publisher.rs (7)

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

🧬 Code Graph Analysis (4)

examples/vllm_v1/launch/agg.sh (3)

examples/vllm_v1/launch/agg_router.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/disagg.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/disagg_router.sh (1)

cleanup (7-11)

examples/vllm_v1/launch/disagg.sh (3)

examples/vllm_v1/launch/agg.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/agg_router.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/disagg_router.sh (1)

cleanup (7-11)

examples/vllm_v1/launch/disagg_router.sh (3)

examples/vllm_v1/launch/agg.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/agg_router.sh (1)

cleanup (5-9)

examples/vllm_v1/launch/disagg.sh (1)

cleanup (5-9)

examples/vllm_v1/components/handlers.py (2)

examples/vllm_v1/components/protocol.py (1)

MyRequestOutput (8-31)

lib/bindings/python/src/dynamo/runtime/logging.py (1)

configure_dynamo_logging (77-105)

🪛 LanguageTool

examples/vllm_v1/README.md

[style] ~84-~84: The verb “get” can be informal. Consider replacing it with a form of “to be”.
Context: ...e components. The final components that get spawned depend upon the chosen deployment patte...

(GET_USED_ARE_USED)

🪛 markdownlint-cli2 (0.17.2)

examples/vllm_v1/README.md

70-70: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (22)

lib/bindings/python/src/dynamo/_core.pyi (1)

49-49: API simplification looks good.

The removal of the path parameter from the namespace method simplifies the API surface. The method now only requires a name parameter, which appears more intuitive.

examples/vllm_v1/launch/disagg_router.sh (1)

7-12: Good cleanup function implementation.

The cleanup function properly handles arrays of worker PIDs and includes appropriate error suppression. The trap setup covers all relevant signals.

examples/vllm_v1/launch/dep.sh (2)

7-12: Good cleanup function implementation.

The cleanup function properly handles arrays of worker PIDs and includes appropriate error suppression. The trap setup covers all relevant signals.

32-34: Proper script termination handling.

The script correctly waits for all background processes to finish and provides user feedback about the startup process.

examples/vllm_v1/components/protocol.py (2)

8-16: Well-designed serialization wrapper.

The MyRequestOutput class provides a clean solution for serializing vLLM's non-serializable RequestOutput. The docstring clearly explains the purpose and rationale.

17-17: Confirm necessity of arbitrary_types_allowed=True

The arbitrary_types_allowed=True setting disables Pydantic’s built-in type validation. I attempted to import the vLLM types in a sandbox (e.g., CompletionOutput, PromptLogprobs, RequestMetrics) but the vllm module wasn’t available, so I couldn’t auto-verify their Pydantic compatibility. Please confirm whether these types truly require bypassing validation or if they can be modeled directly by Pydantic.

• File: examples/vllm_v1/components/protocol.py Line 17

examples/vllm_v1/multi-node.md (1)

32-38: Validate environment variable setup aligns with multi-node patterns.

The environment variable setup looks correct based on learned patterns. Setting HEAD_NODE_IP on all nodes and using it for NATS_SERVER and ETCD_ENDPOINTS follows the established pattern where worker nodes need to connect to the head node's external IP.

lib/llm/src/kv_router/publisher.rs (4)

197-197: Enhanced trace logging improves debugging capabilities.

The addition of trace logging with worker_id and event data provides valuable debugging information for the event processing flow.

250-251: Good addition of exit tracking variables.

The exit_reason and messages_processed variables enhance observability by providing clear insights into listener lifecycle and processed message counts.

456-458: Properly integrated data_parallel_rank field.

The new data_parallel_rank field with serde alias dp_rank is correctly added to support future data parallelism features. The field is appropriately marked as ignored for now and properly integrated into test helpers.

259-259: Appropriate log level adjustment.

Changing the cancellation log from info to debug reduces noise in production logs while maintaining debugging capabilities.

examples/vllm_v1/launch/dsr1_dep.sh (3)

56-62: Proper argument validation ensures script reliability.

The validation checks for required arguments (--num-nodes, --node-rank, --gpus-per-node) with clear error messages provide good user experience and prevent runtime errors.

78-83: Robust cleanup function ensures proper process termination.

The cleanup function properly handles background process termination with signal handling and wait commands, following best practices for bash scripts managing multiple processes.

95-113: Correct data parallel rank calculation and GPU assignment.

The loop correctly calculates data parallel ranks (dp_rank=$((i + NODE_RANK * GPUS_PER_NODE))) and assigns CUDA devices appropriately. The environment variables for vLLM are properly configured for the distributed setup.

examples/vllm_v1/components/args.py (3)

41-57: Well-structured Config class encapsulates all necessary parameters.

The Config class properly separates Dynamo-specific parameters from vLLM engine arguments, providing a clean abstraction for configuration management.

79-89: Explicit default overwriting approach is reasonable for integration.

The decision to always overwrite defaults rather than respecting all user arguments simplifies the integration with vLLM while ensuring Dynamo-specific configurations are properly applied. The error handling for missing attributes is appropriate.

131-137: Proper endpoint validation prevents runtime errors.

The endpoint parsing logic correctly validates the expected format and provides clear error messages for invalid inputs, helping users understand the required format.

examples/vllm_v1/README.md (2)

18-21: Clear title and scope definition.

The updated title and description clearly define the scope as LLM deployment examples using vLLM with Dynamo integration, providing good context for users.

87-137: Comprehensive deployment examples with clear requirements.

The categorized deployment examples with GPU requirements and shell script usage provide clear, actionable guidance for users. The tip about dynamic worker addition is particularly helpful.
examples/vllm_v1/components/main.py (1)

169-172: Configure ZMQ host and verify endpoint-offset logic

To avoid hard-coding “127.0.0.1” at replace time and ensure your offset logic handles all deployment scenarios:

• Add at the top of examples/vllm_v1/components/main.py:
import os
• Before calling offset_endpoint_port, read from an env var (with fallback):
zmq_host = os.environ.get("VLLM_ZMQ_HOST", "127.0.0.1")
• Update the endpoint construction:
--- a/examples/vllm_v1/components/main.py
+++ b/examples/vllm_v1/components/main.py
@@ -166,6 +166,8 @@
     # …existing imports…
+    import os
+
     zmq_endpoint = ZmqEventPublisher.offset_endpoint_port(
         config.engine_args.kv_events_config.endpoint,
         data_parallel_rank=config.engine_args.data_parallel_rank or 0,
-    ).replace("*", "127.0.0.1")
+    ).replace("*", zmq_host)
• Manually inspect ZmqEventPublisher.offset_endpoint_port (its implementation wasn’t found via search) to confirm it correctly:
– Parses wildcard hosts (e.g. “*”)
– Applies port offsets in multi-node or containerized setups
– Handles IPv6 or non-localhost bindings

Please verify that your host configuration and offset logic cover all intended deployment scenarios.
examples/vllm_v1/components/publisher.py (1)

59-61: Verify record signature consistency with StatLoggerBase

All record overrides must match the parameter types declared in StatLoggerBase.record. Please confirm the base‐class signature in the vllm package and then align your implementations accordingly:

• examples/vllm_v1/components/publisher.py – NullStatLogger.record (around line 30)
• examples/vllm_v1/components/publisher.py – DynamoStatLoggerPublisher.record (lines 59–61)

If StatLoggerBase.record is defined as:
def record(
    self,
    scheduler_stats: Optional[SchedulerStats],
    iteration_stats: Optional[IterationStats],
):
    ...
then update DynamoStatLoggerPublisher to:
-    def record(
-        self, scheduler_stats: SchedulerStats, iteration_stats: Optional[IterationStats]
-    ):
+    def record(
+        self,
+        scheduler_stats: Optional[SchedulerStats],
+        iteration_stats: Optional[IterationStats],
+    ):
Otherwise, if the base class requires a non-optional scheduler_stats, adjust NullStatLogger.record to use SchedulerStats instead of Optional[SchedulerStats].

Please verify and ensure all overrides exactly match the base‐class definition.
examples/vllm_v1/components/handlers.py (1)

100-116: Well-implemented background task management.

The background task implementation follows best practices:

Proper error handling with sleep to avoid tight loops

Clean cancellation in the cleanup method

Logging for debugging

Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com> Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>

nnshah1

Really nice work - !

ishandhanani · 2025-07-10T23:22:46Z

made me happy made me smile

vllm v1 facelift

fff4231

pull-request-size Bot added the size/XXL label Jul 3, 2025

alec-flowers changed the title ~~vllm v1 facelift~~ refactor: vllm v1 examples to new dynamo serve style Jul 3, 2025

github-actions Bot added the refactor label Jul 3, 2025

alec-flowers changed the title ~~refactor: vllm v1 examples to new dynamo serve style~~ refactor: vllm v1 examples Jul 3, 2025

ishandhanani mentioned this pull request Jul 3, 2025

test: fault tolerance tests #1444

Merged

nnshah1 reviewed Jul 3, 2025

View reviewed changes

Comment thread examples/vllm_v1/README.md

nnshah1 reviewed Jul 3, 2025

View reviewed changes

Comment thread examples/vllm_v1/launch/agg.sh Outdated

disagg working

c96f3d2

ptarasiewiczNV reviewed Jul 7, 2025

View reviewed changes

Comment thread examples/vllm_v1/args.py Outdated