feat: kv aware router + disagg router + prefill queue by tedzhouhk · Pull Request #11 · ai-dynamo/dynamo

tedzhouhk · 2025-03-04T18:58:03Z

Integrate kv-aware router to vllm disagg (nixl)
Implement a naive heuristics-based disagg router with etcd watcher in rust and integrate to vllm-nixl disagg via python bindings
Prefill queue + pull-based prefill for load balancing

…ributed into ptarasiewicz/vllm-nixl

rmccorm4

Approving to unblock, but will review more in depth later.

Please call out any known issues or known areas to follow up on, if any.

rmccorm4

Need to fix copyright and precommits

…/disagg_router

Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: hongkuan <hongkuanz@nvidia.com> Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>

…onal Dynamo frontend and backend both now run without needing etcd. The next step is making them talk to each other. Some features such as KV routing still require etcd. Discovered and removed old unused `DisaggregatedRouter`. Added in #11 ! Signed-off-by: Graham King <grahamk@nvidia.com>

Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

Quick-win review fixes from PR #9131. Heavy-lift items (#9 prompt_token_ids env-gate, #11 update_weights atomicity, #13 per-choice completion_token_ids) tracked separately as follow-ups. handlers.py - Catch EngineDeadError before the generic except in all 8 RL handlers (pause/resume/liveness_probe/get_state/flush_cache/update_weights_from_path/ load_lora_adapter/unload_lora_adapter): match the existing shutdown pattern in this file so admin calls also surface engine death instead of leaving a broken worker alive. - get_state: fall back to a no-op collective_rpc when check_health is absent — same fallback liveness_probe already uses, otherwise older engines without check_health always look alive. - load_lora_adapter hot-swap path: a remove_lora() failure now returns a 400-style error response (was: silent log warn + continue, leaving add_lora to no-op against the still-registered ID); a reset_prefix_cache() failure after add_lora succeeds also returns error (was: log error and continue, leaving stale KV from the old adapter routable). - unload_lora_adapter: an unregister_model() failure after engine remove_lora succeeds now returns error (was: log warn and report success, leaving model=<lora_name> still routed to this worker even though _resolve_lora_request would now fall back to the base model). container/deps/vllm/install_vllm.sh - Pin prime-rl install to an immutable commit SHA (d49f3939e7dca29bceb9ed515cc1782497b67e81 ↔ tag v0.5.1.dev101) so a re-pointed tag upstream can't change what we ship. PRIME_RL_REF kept in build logs for human readability; PRIME_RL_COMMIT is the authoritative pin. - Replace `echo "\n=== ..."` with `printf '\n=== ...\n'` (shellcheck SC2028). lib/llm/src/http/service/openai.rs - Force `request.inner.logprobs = Some(true)` unconditionally in both RL token-id promotion blocks (was: only when None). RL extraction of completion_token_ids depends on logprobs being on at the engine; an explicit logprobs=false would otherwise silently drop them. - Bound `/v1/rl/ready` per-worker probes with a 5s timeout (override via DYN_RL_LIVENESS_TIMEOUT_MS). Was reusing the shared 600s http_client, so one wedged worker could block readiness for 10 minutes instead of failing fast as 503. - Tokenize Chat handler: call `request.validate()?` before `merged_chat_template_kwargs()` so the continue_final_message + add_generation_prompt mutual-exclusion constraint is enforced (validate() existed but was never invoked). lib/llm/src/protocols/openai/chat_completions.rs - Update stale doc comments on the legacy `tokens` and `return_token_ids` fields: they pointed callers at the now-404 `/v1/chat/completions/tokens` URI. Direct callers to the canonical top-level `prompt_token_ids` extension and `nvext.extra_fields` instead. cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --test test_common_ext: 15 passed.

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

tedzhouhk and others added 30 commits February 18, 2025 15:51

initial impl of disagg router

eb5adc1

fix precommit

f5f3b8c

init example

d80e1d6

add nixl to dockerfile.vllm

fb357ca

add nixl torch example

e51ff32

wip vllm with nixl

d6fba17

first working nixl conditional prefill

f959b01

add readme

0ff5e41

use callback for remote prefill req

c93a152

wip tp > 1

7c8f728

add nixl metadta struct

1fbc202

update readme with tp > 1

2fe0710

decode run with MQLLMEngine

d976011

decode on triton

be5cd23

triton dummy prefill

96cde49

triton prefill

1438fb4

update readme

40959d7

remove nixl torch example

594161d

update todos

d72fc11

update todos

350b831

add http endpoint

07c00a3

remove remote prefill response

46ecc50

update todos

2edc85d

exchange metadta over fs

8624ac5

do not restrict mem

0a76d5e

update readme

9d33228

update dockerfile with nixl changes

78615eb

update patch

4d6088a

Merge branch 'main' of github.com:triton-inference-server/triton_dist…

e46aeda

…ributed into ptarasiewicz/vllm-nixl

update trd package name

93250b9

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:13 — with GitHub Actions Inactive

rmccorm4 approved these changes Mar 9, 2025

View reviewed changes

rmccorm4 reviewed Mar 9, 2025

View reviewed changes

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

b409c20

…/disagg_router

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:22 — with GitHub Actions Inactive

fix: copyright

139047b

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:23 — with GitHub Actions Inactive

fix: precommit

5285cf0

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:26 — with GitHub Actions Inactive

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:27 — with GitHub Actions Inactive

fix: precommit

f6f8d3f

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:49 — with GitHub Actions Inactive

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:50 — with GitHub Actions Inactive

fix: merge issue

6679a04

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:53 — with GitHub Actions Inactive

tedzhouhk temporarily deployed to GITLAB March 9, 2025 00:55 — with GitHub Actions Inactive

tedzhouhk enabled auto-merge (squash) March 9, 2025 00:56

tedzhouhk merged commit 039f9a5 into main Mar 9, 2025

tedzhouhk deleted the hzhou/disagg_router branch March 9, 2025 01:09

grahamking mentioned this pull request Oct 30, 2025

chore: Remove old DisaggregatedRouter, making etcd presence optional #4011

Merged

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: kv aware router + disagg router + prefill queue#11

feat: kv aware router + disagg router + prefill queue#11
tedzhouhk merged 82 commits into
mainfrom
hzhou/disagg_router

tedzhouhk commented Mar 4, 2025

Uh oh!

rmccorm4 left a comment •

edited

Loading

Uh oh!

rmccorm4 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

tedzhouhk commented Mar 4, 2025

Uh oh!

rmccorm4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rmccorm4 left a comment •

edited

Loading