refactor: vllm v1 examples#1756
Conversation
42acded to
5a457fa
Compare
WalkthroughThis update restructures the vLLM example deployment system by removing legacy configuration files, utility modules, and component classes, and replacing them with a new modular architecture. New scripts and Python modules are introduced for launching and managing distributed and disaggregated deployments, with enhanced argument parsing, protocol handling, and metrics publishing. Documentation is expanded and updated for multi-node and distributed setups. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Ingress
participant Router
participant Worker (Decode/Prefill)
participant MetricsPublisher
User->>Ingress: HTTP request (chat/completions)
Ingress->>Router: Route request (if enabled)
Router->>Worker: Forward request
Worker->>Worker: (Optional) Prefill/Decode coordination
Worker->>MetricsPublisher: Publish metrics/events
Worker-->>User: Streamed token responses
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 10
🔭 Outside diff range comments (1)
lib/bindings/python/src/dynamo/_core.pyi (1)
372-376: Inconsistent method rename: update allcreate_service()calls tocreate_endpoint()We’ve verified that the
create_service→create_endpointrename inlib/bindings/python/src/dynamo/_core.pyiisn’t reflected across the codebase. There are dozens of call sites still invokingcomponent.create_service(), for example:
- launch/dynamo-run (
vllm_v1_inc.py,vllm_inc.py,sglang_inc.py,trtllm_inc.py)- SDK core (
deploy/sdk/src/dynamo/sdk/core/runner/dynamo.py,.../cli/serve_dynamo.py,.../core/lib.py,.../protocol/interface.py)- Python bindings tests (
lib/bindings/python/tests/test_kv_bindings.py)- Examples (all under
lib/bindings/python/examples/…andexamples/…)Action items:
- Replace every
await component.create_service()withawait component.create_endpoint()(or equivalent synchronous calls)- If backward compatibility is needed, add
create_service = create_endpointalias with a deprecation warning in the runtime implementation
🧹 Nitpick comments (14)
examples/vllm_v1/ds-r1-DEP.md (2)
1-3: Replace placeholder with minimal working doc before mergeA top-level
TODOwith no description will confuse users browsing the new example.
Even one short paragraph (scope, prerequisites, expected topology) is better than an empty stub.
4-12: Spell-out variable placeholders & make commands copy-paste-safe
<node 0 addr>is ambiguous. Either:
- Explain exactly which interface/IP is expected, or
- Replace with an environment variable (
$MASTER_ADDR) so the snippet can be copy-pasted unchanged.Same comment applies to both node-specific command blocks.
lib/llm/src/kv_router/scheduler.rs (1)
336-343: Good addition, but emit atdebugto avoid log spamIncluding
overlap_blocksandnew_blocksin the formula string is valuable for troubleshooting, yet every request now produces aninfo-level line.
Unless this data is required in production dashboards, consider downgrading todebug!(or gating behindif tracing::enabled!(Level::DEBUG)).- tracing::info!( + tracing::debug!(examples/vllm_v1/launch/agg.sh (2)
1-10: Robustness: terminate all children & fail fastCurrent
cleanup()only kills the ingress; the worker stays running if it daemonises or ignores SIGINT.
Also, the script continues on any error which can leave background processes orphaned.-#!/bin/bash +#!/usr/bin/env bash +set -euo pipefail cleanup() { echo "Cleaning up background processes..." - kill $DYNAMO_PID 2>/dev/null || true - wait $DYNAMO_PID 2>/dev/null || true + # send TERM to whole process-group + kill -- -$$ 2>/dev/null || true + wait 2>/dev/null || true }
16-18: Capture the worker PID for symmetryEven if the worker runs in the foreground, saving its PID allows
cleanup()towaitfor graceful shutdown and avoids races if the script is backgrounded later.-python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager +python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & +WORKER_PID=$! +wait $WORKER_PIDexamples/vllm_v1/launch/agg_router.sh (1)
16-21: Make both workers symmetrical & scalableRunning one worker in the foreground limits the script to two GPUs. Background both and push their PIDs into an array for easy extension.
-CUDA_VISIBLE_DEVICES=0 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & -WORKER_PID=$! - -CUDA_VISIBLE_DEVICES=1 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager +declare -a WORKER_PIDS=() +for gpu in 0 1; do + CUDA_VISIBLE_DEVICES=$gpu python3 components/main.py \ + --model Qwen/Qwen3-0.6B --enforce-eager & + WORKER_PIDS+=($!) +done +wait "${WORKER_PIDS[@]}"examples/vllm_v1/launch/disagg.sh (1)
16-23: Align with aggregated scripts for consistencyConsider backgrounding both prefill and decode workers and collecting their PIDs in an array; this keeps the control flow homogeneous across all example launchers and makes scaling to >2 GPUs trivial (loop over GPU list).
examples/vllm_v1/launch/dep.sh (1)
15-16: Consider using production build path.The script uses a debug build path which may not be appropriate for production deployments. Consider using a production build or making the build type configurable.
-DYN_LOG=debug $HOME/dynamo/.build/target/debug/dynamo-run in=http out=dyn --router-mode kv & +DYN_LOG=debug dynamo run in=http out=dyn --router-mode kv &Or make it configurable:
+DYNAMO_BUILD_TYPE=${DYNAMO_BUILD_TYPE:-release} -DYN_LOG=debug $HOME/dynamo/.build/target/debug/dynamo-run in=http out=dyn --router-mode kv & +DYN_LOG=debug $HOME/dynamo/.build/target/$DYNAMO_BUILD_TYPE/dynamo-run in=http out=dyn --router-mode kv &examples/vllm_v1/components/protocol.py (1)
26-30: Consider removing commented-out fields or documenting their purpose.The commented-out fields suggest future extensions but may create confusion. Consider either removing them or adding a comment explaining their planned usage.
- # lora_request: Optional[LoRARequest] = None - # encoder_prompt: Optional[str] = None - # encoder_prompt_token_ids: Optional[List[int]] = None - # num_cached_tokens: Optional[int] = None - # multi_modal_placeholders: Optional[MultiModalPlaceholderDict] = None + # Future extensions (commented out until needed): + # lora_request: Optional[LoRARequest] = None + # encoder_prompt: Optional[str] = None + # encoder_prompt_token_ids: Optional[List[int]] = None + # num_cached_tokens: Optional[int] = None + # multi_modal_placeholders: Optional[MultiModalPlaceholderDict] = Noneexamples/vllm_v1/multi-node.md (1)
87-99: Complete the incomplete TODO section.The "Large Model Deployment" section is incomplete with just a TODO placeholder and partial example. This leaves users without guidance for tensor parallelism scenarios.
Would you like me to help complete this section with proper examples for large model deployment requiring tensor parallelism across multiple nodes?
examples/vllm_v1/README.md (2)
70-82: Add language specification to fenced code block.The architecture diagram code block should specify a language to address the static analysis hint and improve rendering.
Apply this diff to specify the language:
-``` +```text +------+ +-----------+ +------------------+ +---------------+ | HTTP |----->| dynamo |----->| vLLM Worker |------------>| vLLM Prefill | | |<-----| ingress |<-----| |<------------| Worker |
84-84: Consider more formal language.The static analysis tool suggests replacing "get spawned" with more formal language like "are spawned" for better documentation style.
Apply this diff:
-Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern. +Note: The above architecture illustrates all the components. The final components that are spawned depend upon the chosen deployment pattern.examples/vllm_v1/components/main.py (2)
52-81: Add type hints for better code clarity.Consider adding type hints to improve code maintainability and IDE support:
-def setup_vllm_engine(config, stat_logger=None): +def setup_vllm_engine(config: Config, stat_logger: Optional[StatLoggerFactory] = None) -> tuple[AsyncLLM, VllmConfig, dict]:
112-112: Track the TODO for register_prefill implementation.The TODO comment indicates missing functionality for registering prefill workers similar to
register_llm.Would you like me to create an issue to track implementing the
register_prefillfunctionality?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
container/Dockerfile.vllm_v1(1 hunks)examples/vllm_v1/README.md(2 hunks)examples/vllm_v1/components/args.py(1 hunks)examples/vllm_v1/components/frontend.py(0 hunks)examples/vllm_v1/components/handlers.py(1 hunks)examples/vllm_v1/components/main.py(1 hunks)examples/vllm_v1/components/protocol.py(1 hunks)examples/vllm_v1/components/publisher.py(1 hunks)examples/vllm_v1/components/simple_load_balancer.py(0 hunks)examples/vllm_v1/components/worker.py(0 hunks)examples/vllm_v1/configs/agg.yaml(0 hunks)examples/vllm_v1/configs/deepseek_r1/agg.yaml(0 hunks)examples/vllm_v1/configs/deepseek_r1/agg_dp.yaml(0 hunks)examples/vllm_v1/configs/deepseek_r1/disagg.yaml(0 hunks)examples/vllm_v1/configs/deepseek_r1/disagg_dp.yaml(0 hunks)examples/vllm_v1/configs/disagg.yaml(0 hunks)examples/vllm_v1/configs/disagg_planner.yaml(0 hunks)examples/vllm_v1/ds-r1-DEP.md(1 hunks)examples/vllm_v1/graphs/agg.py(0 hunks)examples/vllm_v1/graphs/disagg.py(0 hunks)examples/vllm_v1/graphs/disagg_planner.py(0 hunks)examples/vllm_v1/launch/agg.sh(1 hunks)examples/vllm_v1/launch/agg_router.sh(1 hunks)examples/vllm_v1/launch/dep.sh(1 hunks)examples/vllm_v1/launch/disagg.sh(1 hunks)examples/vllm_v1/launch/disagg_router.sh(1 hunks)examples/vllm_v1/launch/dsr1_dep.sh(1 hunks)examples/vllm_v1/multi-node.md(1 hunks)examples/vllm_v1/utils/args.py(0 hunks)examples/vllm_v1/utils/protocol.py(0 hunks)lib/bindings/python/src/dynamo/_core.pyi(2 hunks)lib/llm/src/kv_router/publisher.rs(6 hunks)lib/llm/src/kv_router/scheduler.rs(1 hunks)
💤 Files with no reviewable changes (15)
- examples/vllm_v1/graphs/disagg_planner.py
- examples/vllm_v1/graphs/agg.py
- examples/vllm_v1/configs/disagg.yaml
- examples/vllm_v1/utils/args.py
- examples/vllm_v1/configs/disagg_planner.yaml
- examples/vllm_v1/configs/deepseek_r1/disagg.yaml
- examples/vllm_v1/configs/deepseek_r1/agg_dp.yaml
- examples/vllm_v1/graphs/disagg.py
- examples/vllm_v1/components/frontend.py
- examples/vllm_v1/utils/protocol.py
- examples/vllm_v1/configs/agg.yaml
- examples/vllm_v1/configs/deepseek_r1/agg.yaml
- examples/vllm_v1/configs/deepseek_r1/disagg_dp.yaml
- examples/vllm_v1/components/simple_load_balancer.py
- examples/vllm_v1/components/worker.py
🧰 Additional context used
🧠 Learnings (15)
examples/vllm_v1/ds-r1-DEP.md (2)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
examples/vllm_v1/launch/agg.sh (2)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
examples/vllm_v1/launch/disagg.sh (1)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
examples/vllm_v1/launch/dep.sh (1)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
lib/bindings/python/src/dynamo/_core.pyi (2)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
examples/vllm_v1/multi-node.md (2)
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
examples/vllm_v1/launch/dsr1_dep.sh (3)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
examples/vllm_v1/launch/agg_router.sh (2)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
lib/llm/src/kv_router/scheduler.rs (4)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
examples/vllm_v1/launch/disagg_router.sh (2)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
container/Dockerfile.vllm_v1 (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1177
File: container/Dockerfile.vllm:102-105
Timestamp: 2025-05-28T22:54:46.875Z
Learning: In Dockerfiles, when appending to environment variables that may not exist in the base image, Docker validation will fail if you reference undefined variables with ${VARIABLE} syntax. In such cases, setting the environment variable directly (e.g., ENV CPATH=/usr/include) rather than appending is the appropriate approach.
examples/vllm_v1/components/main.py (2)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
examples/vllm_v1/README.md (3)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the `examples/` directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
examples/vllm_v1/components/publisher.py (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
lib/llm/src/kv_router/publisher.rs (7)
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
🧬 Code Graph Analysis (4)
examples/vllm_v1/launch/agg.sh (3)
examples/vllm_v1/launch/agg_router.sh (1)
cleanup(5-9)examples/vllm_v1/launch/disagg.sh (1)
cleanup(5-9)examples/vllm_v1/launch/disagg_router.sh (1)
cleanup(7-11)
examples/vllm_v1/launch/disagg.sh (3)
examples/vllm_v1/launch/agg.sh (1)
cleanup(5-9)examples/vllm_v1/launch/agg_router.sh (1)
cleanup(5-9)examples/vllm_v1/launch/disagg_router.sh (1)
cleanup(7-11)
examples/vllm_v1/launch/disagg_router.sh (3)
examples/vllm_v1/launch/agg.sh (1)
cleanup(5-9)examples/vllm_v1/launch/agg_router.sh (1)
cleanup(5-9)examples/vllm_v1/launch/disagg.sh (1)
cleanup(5-9)
examples/vllm_v1/components/handlers.py (2)
examples/vllm_v1/components/protocol.py (1)
MyRequestOutput(8-31)lib/bindings/python/src/dynamo/runtime/logging.py (1)
configure_dynamo_logging(77-105)
🪛 LanguageTool
examples/vllm_v1/README.md
[style] ~84-~84: The verb “get” can be informal. Consider replacing it with a form of “to be”.
Context: ...e components. The final components that get spawned depend upon the chosen deployment patte...
(GET_USED_ARE_USED)
🪛 markdownlint-cli2 (0.17.2)
examples/vllm_v1/README.md
70-70: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (22)
lib/bindings/python/src/dynamo/_core.pyi (1)
49-49: API simplification looks good.The removal of the
pathparameter from thenamespacemethod simplifies the API surface. The method now only requires anameparameter, which appears more intuitive.examples/vllm_v1/launch/disagg_router.sh (1)
7-12: Good cleanup function implementation.The cleanup function properly handles arrays of worker PIDs and includes appropriate error suppression. The trap setup covers all relevant signals.
examples/vllm_v1/launch/dep.sh (2)
7-12: Good cleanup function implementation.The cleanup function properly handles arrays of worker PIDs and includes appropriate error suppression. The trap setup covers all relevant signals.
32-34: Proper script termination handling.The script correctly waits for all background processes to finish and provides user feedback about the startup process.
examples/vllm_v1/components/protocol.py (2)
8-16: Well-designed serialization wrapper.The
MyRequestOutputclass provides a clean solution for serializing vLLM's non-serializableRequestOutput. The docstring clearly explains the purpose and rationale.
17-17: Confirm necessity ofarbitrary_types_allowed=TrueThe
arbitrary_types_allowed=Truesetting disables Pydantic’s built-in type validation. I attempted to import the vLLM types in a sandbox (e.g.,CompletionOutput,PromptLogprobs,RequestMetrics) but thevllmmodule wasn’t available, so I couldn’t auto-verify their Pydantic compatibility. Please confirm whether these types truly require bypassing validation or if they can be modeled directly by Pydantic.• File:
examples/vllm_v1/components/protocol.pyLine 17examples/vllm_v1/multi-node.md (1)
32-38: Validate environment variable setup aligns with multi-node patterns.The environment variable setup looks correct based on learned patterns. Setting
HEAD_NODE_IPon all nodes and using it forNATS_SERVERandETCD_ENDPOINTSfollows the established pattern where worker nodes need to connect to the head node's external IP.lib/llm/src/kv_router/publisher.rs (4)
197-197: Enhanced trace logging improves debugging capabilities.The addition of trace logging with worker_id and event data provides valuable debugging information for the event processing flow.
250-251: Good addition of exit tracking variables.The
exit_reasonandmessages_processedvariables enhance observability by providing clear insights into listener lifecycle and processed message counts.
456-458: Properly integrated data_parallel_rank field.The new
data_parallel_rankfield with serde aliasdp_rankis correctly added to support future data parallelism features. The field is appropriately marked as ignored for now and properly integrated into test helpers.
259-259: Appropriate log level adjustment.Changing the cancellation log from info to debug reduces noise in production logs while maintaining debugging capabilities.
examples/vllm_v1/launch/dsr1_dep.sh (3)
56-62: Proper argument validation ensures script reliability.The validation checks for required arguments (
--num-nodes,--node-rank,--gpus-per-node) with clear error messages provide good user experience and prevent runtime errors.
78-83: Robust cleanup function ensures proper process termination.The cleanup function properly handles background process termination with signal handling and wait commands, following best practices for bash scripts managing multiple processes.
95-113: Correct data parallel rank calculation and GPU assignment.The loop correctly calculates data parallel ranks (
dp_rank=$((i + NODE_RANK * GPUS_PER_NODE))) and assigns CUDA devices appropriately. The environment variables for vLLM are properly configured for the distributed setup.examples/vllm_v1/components/args.py (3)
41-57: Well-structured Config class encapsulates all necessary parameters.The Config class properly separates Dynamo-specific parameters from vLLM engine arguments, providing a clean abstraction for configuration management.
79-89: Explicit default overwriting approach is reasonable for integration.The decision to always overwrite defaults rather than respecting all user arguments simplifies the integration with vLLM while ensuring Dynamo-specific configurations are properly applied. The error handling for missing attributes is appropriate.
131-137: Proper endpoint validation prevents runtime errors.The endpoint parsing logic correctly validates the expected format and provides clear error messages for invalid inputs, helping users understand the required format.
examples/vllm_v1/README.md (2)
18-21: Clear title and scope definition.The updated title and description clearly define the scope as LLM deployment examples using vLLM with Dynamo integration, providing good context for users.
87-137: Comprehensive deployment examples with clear requirements.The categorized deployment examples with GPU requirements and shell script usage provide clear, actionable guidance for users. The tip about dynamic worker addition is particularly helpful.
examples/vllm_v1/components/main.py (1)
169-172: Configure ZMQ host and verify endpoint-offset logicTo avoid hard-coding “127.0.0.1” at replace time and ensure your offset logic handles all deployment scenarios:
• Add at the top of examples/vllm_v1/components/main.py:
import os• Before calling offset_endpoint_port, read from an env var (with fallback):
zmq_host = os.environ.get("VLLM_ZMQ_HOST", "127.0.0.1")• Update the endpoint construction:
--- a/examples/vllm_v1/components/main.py +++ b/examples/vllm_v1/components/main.py @@ -166,6 +166,8 @@ # …existing imports… + import os + zmq_endpoint = ZmqEventPublisher.offset_endpoint_port( config.engine_args.kv_events_config.endpoint, data_parallel_rank=config.engine_args.data_parallel_rank or 0, - ).replace("*", "127.0.0.1") + ).replace("*", zmq_host)• Manually inspect ZmqEventPublisher.offset_endpoint_port (its implementation wasn’t found via search) to confirm it correctly:
– Parses wildcard hosts (e.g. “*”)
– Applies port offsets in multi-node or containerized setups
– Handles IPv6 or non-localhost bindingsPlease verify that your host configuration and offset logic cover all intended deployment scenarios.
examples/vllm_v1/components/publisher.py (1)
59-61: Verifyrecordsignature consistency withStatLoggerBaseAll
recordoverrides must match the parameter types declared inStatLoggerBase.record. Please confirm the base‐class signature in the vllm package and then align your implementations accordingly:• examples/vllm_v1/components/publisher.py – NullStatLogger.record (around line 30)
• examples/vllm_v1/components/publisher.py – DynamoStatLoggerPublisher.record (lines 59–61)If
StatLoggerBase.recordis defined as:def record( self, scheduler_stats: Optional[SchedulerStats], iteration_stats: Optional[IterationStats], ): ...then update DynamoStatLoggerPublisher to:
- def record( - self, scheduler_stats: SchedulerStats, iteration_stats: Optional[IterationStats] - ): + def record( + self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + ):Otherwise, if the base class requires a non-optional
scheduler_stats, adjustNullStatLogger.recordto useSchedulerStatsinstead ofOptional[SchedulerStats].Please verify and ensure all overrides exactly match the base‐class definition.
examples/vllm_v1/components/handlers.py (1)
100-116: Well-implemented background task management.The background task implementation follows best practices:
- Proper error handling with sleep to avoid tight loops
- Clean cancellation in the cleanup method
- Logging for debugging
Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com> Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>
|
made me happy made me smile |
Overview:
Revamp vLLM example to use dynamo-run and be fully python runnable.
Key Features: