feat: kv aware router + disagg router + prefill queue#11
Merged
Conversation
tedzhouhk
commented
Mar 4, 2025
Contributor
- Integrate kv-aware router to vllm disagg (nixl)
- Implement a naive heuristics-based disagg router with etcd watcher in rust and integrate to vllm-nixl disagg via python bindings
- Prefill queue + pull-based prefill for load balancing
…ributed into ptarasiewicz/vllm-nixl
rmccorm4
approved these changes
Mar 9, 2025
rmccorm4
reviewed
Mar 9, 2025
rmccorm4
left a comment
Contributor
There was a problem hiding this comment.
Need to fix copyright and precommits
…/disagg_router
kylehh
pushed a commit
to kylehh/dynamo
that referenced
this pull request
Apr 11, 2025
Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: hongkuan <hongkuanz@nvidia.com> Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>
grahamking
added a commit
that referenced
this pull request
Oct 30, 2025
…onal Dynamo frontend and backend both now run without needing etcd. The next step is making them talk to each other. Some features such as KV routing still require etcd. Discovered and removed old unused `DisaggregatedRouter`. Added in #11 ! Signed-off-by: Graham King <grahamk@nvidia.com>
ranrubin
added a commit
that referenced
this pull request
Apr 20, 2026
Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
biswapanda
added a commit
that referenced
this pull request
May 8, 2026
Quick-win review fixes from PR #9131. Heavy-lift items (#9 prompt_token_ids env-gate, #11 update_weights atomicity, #13 per-choice completion_token_ids) tracked separately as follow-ups. handlers.py - Catch EngineDeadError before the generic except in all 8 RL handlers (pause/resume/liveness_probe/get_state/flush_cache/update_weights_from_path/ load_lora_adapter/unload_lora_adapter): match the existing shutdown pattern in this file so admin calls also surface engine death instead of leaving a broken worker alive. - get_state: fall back to a no-op collective_rpc when check_health is absent — same fallback liveness_probe already uses, otherwise older engines without check_health always look alive. - load_lora_adapter hot-swap path: a remove_lora() failure now returns a 400-style error response (was: silent log warn + continue, leaving add_lora to no-op against the still-registered ID); a reset_prefix_cache() failure after add_lora succeeds also returns error (was: log error and continue, leaving stale KV from the old adapter routable). - unload_lora_adapter: an unregister_model() failure after engine remove_lora succeeds now returns error (was: log warn and report success, leaving model=<lora_name> still routed to this worker even though _resolve_lora_request would now fall back to the base model). container/deps/vllm/install_vllm.sh - Pin prime-rl install to an immutable commit SHA (d49f3939e7dca29bceb9ed515cc1782497b67e81 ↔ tag v0.5.1.dev101) so a re-pointed tag upstream can't change what we ship. PRIME_RL_REF kept in build logs for human readability; PRIME_RL_COMMIT is the authoritative pin. - Replace `echo "\n=== ..."` with `printf '\n=== ...\n'` (shellcheck SC2028). lib/llm/src/http/service/openai.rs - Force `request.inner.logprobs = Some(true)` unconditionally in both RL token-id promotion blocks (was: only when None). RL extraction of completion_token_ids depends on logprobs being on at the engine; an explicit logprobs=false would otherwise silently drop them. - Bound `/v1/rl/ready` per-worker probes with a 5s timeout (override via DYN_RL_LIVENESS_TIMEOUT_MS). Was reusing the shared 600s http_client, so one wedged worker could block readiness for 10 minutes instead of failing fast as 503. - Tokenize Chat handler: call `request.validate()?` before `merged_chat_template_kwargs()` so the continue_final_message + add_generation_prompt mutual-exclusion constraint is enforced (validate() existed but was never invoked). lib/llm/src/protocols/openai/chat_completions.rs - Update stale doc comments on the legacy `tokens` and `return_token_ids` fields: they pointed callers at the now-404 `/v1/chat/completions/tokens` URI. Direct callers to the canonical top-level `prompt_token_ids` extension and `nvext.extra_fields` instead. cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --test test_common_ext: 15 passed.
kaim-eng
added a commit
that referenced
this pull request
May 11, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.
Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
(`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
- Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
`Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
`${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
usage instructions.
- Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
`DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
- 5 `scripts/inspect_*.py` cluster probes parameterized via
`DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
rather than a hard-coded namespace.
- `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
`<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
not committed" convention (matches `examples/deployments/powerplanner/
.tmp-gp-minimal.yaml`) and to block the four common root-level
personal-scratch files from sneaking back in via `git add .`.
dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
`deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
`${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
`envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
`qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
`kubectl wait --for=jsonpath='{.status.state}'=successful` form
in both §4 and the Checklist (removes the manual-`-w` divergence).
Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
- Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
draft.` with last-validation date.
- §3.1: added an explicit note that `aic_interpolation` and `mode`
are pre-existing PlannerConfig fields (owned by
`monitoring/aic_interpolation.py`) and intentionally absent from
the Names Registry.
- §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
it correctly points at the *throughput regression* case rather
than reading as a blanket "config revert".
- §6.5 pseudocode: replaced module-level `self.device_count` with
`pynvml.nvmlDeviceGetCount()` (the original was syntactically
incorrect outside the daemon class).
- §13 Open Question #11: corrected the duplicated
`scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
`scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
§5.3.
* powerplanner-testbed-design.md
- §7: added a "Numeric-suffix convention" note explaining the
intentional D21/E21 ID collision (IDs unique by filename + tuple,
not renumbered).
- §11: corrected "Six guards" -> "Seven guards" to match the seven
items actually listed.
- §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
- §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
+ 30 parametrized) so the listing matches the actual run.
Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
Windows-only environment limitations (cp1252 codec, `os.killpg`
POSIX-only, missing `filterpy`) confirmed via `git stash` to be
untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
`tests/fault_tolerance/...` (pre-existing, separate component) and
`examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.
Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.
Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
(`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
- Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
`Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
`${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
usage instructions.
- Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
`DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
- 5 `scripts/inspect_*.py` cluster probes parameterized via
`DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
rather than a hard-coded namespace.
- `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
`<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
not committed" convention (matches `examples/deployments/powerplanner/
.tmp-gp-minimal.yaml`) and to block the four common root-level
personal-scratch files from sneaking back in via `git add .`.
dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
`deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
`${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
`envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
`qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
`kubectl wait --for=jsonpath='{.status.state}'=successful` form
in both §4 and the Checklist (removes the manual-`-w` divergence).
Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
- Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
draft.` with last-validation date.
- §3.1: added an explicit note that `aic_interpolation` and `mode`
are pre-existing PlannerConfig fields (owned by
`monitoring/aic_interpolation.py`) and intentionally absent from
the Names Registry.
- §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
it correctly points at the *throughput regression* case rather
than reading as a blanket "config revert".
- §6.5 pseudocode: replaced module-level `self.device_count` with
`pynvml.nvmlDeviceGetCount()` (the original was syntactically
incorrect outside the daemon class).
- §13 Open Question #11: corrected the duplicated
`scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
`scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
§5.3.
* powerplanner-testbed-design.md
- §7: added a "Numeric-suffix convention" note explaining the
intentional D21/E21 ID collision (IDs unique by filename + tuple,
not renumbered).
- §11: corrected "Six guards" -> "Seven guards" to match the seven
items actually listed.
- §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
- §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
+ 30 parametrized) so the listing matches the actual run.
Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
Windows-only environment limitations (cp1252 codec, `os.killpg`
POSIX-only, missing `filterpy`) confirmed via `git stash` to be
untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
`tests/fault_tolerance/...` (pre-existing, separate component) and
`examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.
Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.
Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
(`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
- Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
`Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
`${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
usage instructions.
- Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
`DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
- 5 `scripts/inspect_*.py` cluster probes parameterized via
`DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
rather than a hard-coded namespace.
- `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
`<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
not committed" convention (matches `examples/deployments/powerplanner/
.tmp-gp-minimal.yaml`) and to block the four common root-level
personal-scratch files from sneaking back in via `git add .`.
dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
`deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
`${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
`envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
`qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
`kubectl wait --for=jsonpath='{.status.state}'=successful` form
in both §4 and the Checklist (removes the manual-`-w` divergence).
Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
- Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
draft.` with last-validation date.
- §3.1: added an explicit note that `aic_interpolation` and `mode`
are pre-existing PlannerConfig fields (owned by
`monitoring/aic_interpolation.py`) and intentionally absent from
the Names Registry.
- §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
it correctly points at the *throughput regression* case rather
than reading as a blanket "config revert".
- §6.5 pseudocode: replaced module-level `self.device_count` with
`pynvml.nvmlDeviceGetCount()` (the original was syntactically
incorrect outside the daemon class).
- §13 Open Question #11: corrected the duplicated
`scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
`scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
§5.3.
* powerplanner-testbed-design.md
- §7: added a "Numeric-suffix convention" note explaining the
intentional D21/E21 ID collision (IDs unique by filename + tuple,
not renumbered).
- §11: corrected "Six guards" -> "Seven guards" to match the seven
items actually listed.
- §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
- §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
+ 30 parametrized) so the listing matches the actual run.
Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
Windows-only environment limitations (cp1252 codec, `os.killpg`
POSIX-only, missing `filterpy`) confirmed via `git stash` to be
untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
`tests/fault_tolerance/...` (pre-existing, separate component) and
`examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 18, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.
Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
(`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
- Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
`Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
`${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
usage instructions.
- Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
`DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
- 5 `scripts/inspect_*.py` cluster probes parameterized via
`DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
rather than a hard-coded namespace.
- `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
`<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
not committed" convention (matches `examples/deployments/powerplanner/
.tmp-gp-minimal.yaml`) and to block the four common root-level
personal-scratch files from sneaking back in via `git add .`.
dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
`deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
`${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
`envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
`qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
`kubectl wait --for=jsonpath='{.status.state}'=successful` form
in both §4 and the Checklist (removes the manual-`-w` divergence).
Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
- Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
draft.` with last-validation date.
- §3.1: added an explicit note that `aic_interpolation` and `mode`
are pre-existing PlannerConfig fields (owned by
`monitoring/aic_interpolation.py`) and intentionally absent from
the Names Registry.
- §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
it correctly points at the *throughput regression* case rather
than reading as a blanket "config revert".
- §6.5 pseudocode: replaced module-level `self.device_count` with
`pynvml.nvmlDeviceGetCount()` (the original was syntactically
incorrect outside the daemon class).
- §13 Open Question #11: corrected the duplicated
`scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
`scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
§5.3.
* powerplanner-testbed-design.md
- §7: added a "Numeric-suffix convention" note explaining the
intentional D21/E21 ID collision (IDs unique by filename + tuple,
not renumbered).
- §11: corrected "Six guards" -> "Seven guards" to match the seven
items actually listed.
- §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
- §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
+ 30 parametrized) so the listing matches the actual run.
Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
Windows-only environment limitations (cp1252 codec, `os.killpg`
POSIX-only, missing `filterpy`) confirmed via `git stash` to be
untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
`tests/fault_tolerance/...` (pre-existing, separate component) and
`examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.