CI: Cache image on main branch by saturley-hall · Pull Request #7 · ai-dynamo/dynamo

saturley-hall · 2025-03-04T17:13:23Z

No description provided.

github-actions · 2025-03-04T17:47:52Z

Test Results

2 files 2 suites 26s ⏱️
71 tests 71 ✅ 0 💤 0 ❌
89 runs 88 ✅ 1 💤 0 ❌

Results for commit e688847.

saturley-hall · 2025-03-04T17:52:34Z

Closing to reopen

Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cycle Address senior-architect review of the metrics infrastructure: #1 (two parallel Prometheus paths): merge `register_prometheus` + `setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings` method on the Rust `LLMEngine` trait. The trait surface shrinks from 7 → 6 methods; both expfmt bridging and the structured component publisher go through one place. The Python ABC keeps its two methods — the PyO3 bridge adapts (engine-author API unchanged for Python). #3 + #8 (framework lifecycle gauges gated on engine opt-in): - `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side `LifecycleGauges` (prometheus crate gauges registered via the runtime's `MetricsRegistry`). Worker constructs them after `engine.start()` succeeds; observes during cleanup/drain. Emits regardless of engine opt-in. - Drop `set_cleanup_time` / `set_drain_time` from the `ComponentMetricsPublisher` trait — engine no longer responsible. - Drop the corresponding methods from `PyComponentMetricsPublisher`. - `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for parity with the legacy entry points (both legacy `main.py` and the unified bridge populate it). The unified bridge now constructs `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when the engine returns no per-rank sources. #5 (kv_cache_hit_rate Optional semantics): clearer docstring on both Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly documented (`None` = no data, `0.0` = legitimate zero-hit). Deferred with rationale (explained in the post-review architect's note): - #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing but no current bug; 5 fields for 5 signals isn't worth the refactor cost yet. Revisit at ~10+ signals. - #4 (per-source cadence): explicit defer in original review. - #7 (conformance in CI): workflow change, separate from trait surface. Conformance kit collapses `check_register_prometheus` + `check_component_metrics` into `check_setup_metrics`. All 68 backend-common unit tests pass. Mocker example updated. PyO3 bridge wheel rebuilt successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

`TestMultiPodConflict` previously created a bystander pod with `NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS clusters running the NVIDIA Container Toolkit in CDI device mode (the dpp-dev-env default, and the direction NVIDIA is pushing for newer operator releases) reject that with `unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to admit the pod — so the test class was carrying a `@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled the whole §8.5.7 contract on every run. This commit lands the CDI-mode-safe rewrite: - Bystander now claims `nvidia.com/gpu: 1` from the device plugin (both `requests` and `limits` per K8s extended-resource rules). CDI-mode-agnostic — the device plugin owns the actual GPU wiring via its own machinery. - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml` mirrors the inline Python manifest exactly so YAML-debugging and test-debugging agree (the README explicitly calls this out). - New helper `_per_gpu_compute_pids()` snapshots `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the agent pod (which has hostPID + NVML access) and returns `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new PID on some index after the bystander reaches Running — that index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES pin needed. - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` → `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect the index-agnostic contract. Blast-radius / recovery asserts key off the discovered `assigned_gpu`; the loop over the other GPUs iterates `range(8)` and skips that index instead of hard-coding `range(1, 8)`. - Class-level `@pytest.mark.skip` decorator removed. The supporting `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither in code nor as a runtime branch, since neither path is needed once the CDI-safe flow works. - `bystander_uuid` class fixture removed (no longer needed — the UUID/index discovery is post-creation, not pre). `examples/multi-dgd-live-test/README.md` updated: - Finding #7 marked **Status: fixed** with cross-reference to this commit and an explicit description of the new (a)/(b)/(c) flow. - Expected-outcome counts updated: Path B now expects 5 PASSED / 2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the next live run. - Live-run record row updated to note the rewrite and expected PASS outcome. Verified locally: test module compiles + collects cleanly. The end-to-end CDI-vs-non-CDI assignment behaviour is by construction the device plugin's responsibility, so the failure mode of this rewrite is bounded to "device plugin returns no GPU within 90 s" — handled with a clear pytest.fail message that includes a before/after PID snapshot for debugging. Refs: PR #9683 review (Finding #7 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

`TestMultiPodConflict` previously created a bystander pod with `NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS clusters running the NVIDIA Container Toolkit in CDI device mode (the dpp-dev-env default, and the direction NVIDIA is pushing for newer operator releases) reject that with `unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to admit the pod — so the test class was carrying a `@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled the whole §8.5.7 contract on every run. This commit lands the CDI-mode-safe rewrite: - Bystander now claims `nvidia.com/gpu: 1` from the device plugin (both `requests` and `limits` per K8s extended-resource rules). CDI-mode-agnostic — the device plugin owns the actual GPU wiring via its own machinery. - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml` mirrors the inline Python manifest exactly so YAML-debugging and test-debugging agree (the README explicitly calls this out). - New helper `_per_gpu_compute_pids()` snapshots `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the agent pod (which has hostPID + NVML access) and returns `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new PID on some index after the bystander reaches Running — that index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES pin needed. - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` → `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect the index-agnostic contract. Blast-radius / recovery asserts key off the discovered `assigned_gpu`; the loop over the other GPUs iterates `range(8)` and skips that index instead of hard-coding `range(1, 8)`. - Class-level `@pytest.mark.skip` decorator removed. The supporting `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither in code nor as a runtime branch, since neither path is needed once the CDI-safe flow works. - `bystander_uuid` class fixture removed (no longer needed — the UUID/index discovery is post-creation, not pre). `examples/multi-dgd-live-test/README.md` updated: - Finding #7 marked **Status: fixed** with cross-reference to this commit and an explicit description of the new (a)/(b)/(c) flow. - Expected-outcome counts updated: Path B now expects 5 PASSED / 2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the next live run. - Live-run record row updated to note the rewrite and expected PASS outcome. Verified locally: test module compiles + collects cleanly. The end-to-end CDI-vs-non-CDI assignment behaviour is by construction the device plugin's responsibility, so the failure mode of this rewrite is bounded to "device plugin returns no GPU within 90 s" — handled with a clear pytest.fail message that includes a before/after PID snapshot for debugging. Refs: PR #9683 review (Finding #7 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

Two regression tests that pin CONTRACT.md's peer-failure-injection contract: LifecycleEvent::Failed mid-output (P0 #7 on prefill, P0 #9 on decode) must route through cleanup_failed_request EXACTLY ONCE even when a concurrent path (e.g., a spawn-catch failure-sink call) fires for the same request_id. P0 #7 — cd_prefill_frame_error_mid_output_triggers_cleanup - Setup: post-USAA prefill state, observer tracking outputs. - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. }) PLUS a concurrent coord.cleanup_failed_request(rid, ..) call (mirroring the spawn-catch / direct-failure-sink race). - Assert: mark_failed_onboarding called EXACTLY ONCE with the full G1 window vLLM allocated; coord.states[rid] removed; mark_onboarding_complete NEVER called. - Probe verified: with the prefill CdRequest.cleanup_claimed CAS guard temporarily disabled, the test fails with 'got 2 calls'. P0 #9 — cd_decode_peer_failure_mid_output_cleans_up - Setup: post-USAA decode state, local-kick spawned but unresolved (so unfilled_g1_block_ids covers both local-match + remote slices). - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. }) PLUS a concurrent CdFailureSink::on_session_failure call. - Assert: mark_failed_onboarding called EXACTLY ONCE with the full external G1 slice; inflight_budget fully released; wrapper cd_request_state evicted; coord.states evicted; mark_onboarding_complete NEVER called. - Probe verified: with the decode CdRequestState.cleanup_claimed CAS guard temporarily disabled, the test fails with 'got 2 calls'. Both tests use kvbm_engine::p2p::session::LifecycleEvent (already public) as the deterministic injection knob — no MockSession extension needed. The 'in-flight session calls return Err' branch of the original P0 #7 spec is not pinned here because MockSession's commit / make_available always return Ok regardless of lifecycle state; that would require a separate MockSession enhancement and exercises velo-wire semantics rather than connector semantics.

Codex stop-time review caught: the previous shape of these tests fired LifecycleEvent::Failed AND a concurrent cleanup call in parallel, then asserted exactly-once mark_failed_onboarding. That assertion holds even if the lifecycle path is dead — the direct call alone produces one cleanup. The test gated the CAS guard but NOT the wire from session.lifecycle() → watcher → cleanup. Restructure both tests into two sequential stages: Stage 1 — lifecycle path is the ONLY trigger. Pre-condition: no mark_failed_onboarding calls yet. Inject LifecycleEvent::Failed on the session. Wait. The resulting cleanup can only have come from the watcher → cleanup chain. Stage 2 — duplicate cleanup is a no-op. Fire a second cleanup_failed_request (or on_session_failure) after Stage 1 already claimed the CAS. The duplicate must NOT re-emit mark_failed_onboarding. Probe verification on the prefill side: short-circuiting Self::lifecycle_failure_reason to always return None breaks Stage 1 — the watcher takes the cooperative branch instead, no cleanup is dispatched, and Stage 1 times out waiting for mark_failed_onboarding. The lifecycle wire is gated. Probe on the decode side surfaced an asymmetry worth pinning in the test comment: even with lifecycle_failure_reason short- circuited, the decode test passes because the watcher's cooperative branch calls state.cancel.cancel(), which propagates through run_remote_pipeline's tokio::select! arm → spawn-catch → wrapper.cleanup_failed_request. So the decode test gates the OUTCOME contract (peer Failed → cleanup) but does not exclusively gate the lifecycle-failure-reason route. The shared Self::lifecycle_failure_reason function is gated by the prefill test; documenting the dual-path nature in the decode test comment. Both tests still pass (full suite 242 lib + 15 decode + 29 prefill + others green). The CAS-guard discrimination probes from the previous commit still hold: disabling either CAS makes the respective Stage 2 / mark_failed_onboarding count assertion fail.

/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16) Test-only additions for the seams the review flagged as untested. ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases): both components changed → full ScalingDecision; both equal current → None (PSM-equivalent no-change); single-component proposal → other count stays None; non-apply execute_action → None. Previously every adapter.tick test hit only the None/empty path, so a regression in the projection / no-change detection would have shipped silently. ai-dynamo#8 _tick_input_to_context + FPM encoding: build a TickInput with traffic (incl kv_hit_rate), worker counts (incl scaling-in-progress flags), and a real ForwardPassMetrics; assert the PipelineContext.observations mapping and that the FPM bytes decode back (key format "<worker_id>/<dp_rank>", canonical encoder). This is the ingress glue where the add_observations P1 + the projection live. ai-dynamo#15 registry mutation during an in-flight tick: suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new plugin while suspended, release, and assert the late plugin did NOT join the in-flight stage (pre-tick snapshot) and the tick completed cleanly — then a fresh tick picks it up. Exercises the no-locks invariant that scheduler.py/server.py document but no test covered. ai-dynamo#16 test_tick_diagnostics_extended scope note: clarify in the module docstring that plugin_overrides / reconcile_reasons / held_over_plugins have no production populator in this PR; these tests lock the dataclass contract (defaults / no shared-mutable aliasing), not live behavior. 835 planner tests pass (+6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

`TestMultiPodConflict` previously created a bystander pod with `NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS clusters running the NVIDIA Container Toolkit in CDI device mode (the dpp-dev-env default, and the direction NVIDIA is pushing for newer operator releases) reject that with `unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to admit the pod — so the test class was carrying a `@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled the whole §8.5.7 contract on every run. This commit lands the CDI-mode-safe rewrite: - Bystander now claims `nvidia.com/gpu: 1` from the device plugin (both `requests` and `limits` per K8s extended-resource rules). CDI-mode-agnostic — the device plugin owns the actual GPU wiring via its own machinery. - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml` mirrors the inline Python manifest exactly so YAML-debugging and test-debugging agree (the README explicitly calls this out). - New helper `_per_gpu_compute_pids()` snapshots `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the agent pod (which has hostPID + NVML access) and returns `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new PID on some index after the bystander reaches Running — that index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES pin needed. - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` → `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect the index-agnostic contract. Blast-radius / recovery asserts key off the discovered `assigned_gpu`; the loop over the other GPUs iterates `range(8)` and skips that index instead of hard-coding `range(1, 8)`. - Class-level `@pytest.mark.skip` decorator removed. The supporting `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither in code nor as a runtime branch, since neither path is needed once the CDI-safe flow works. - `bystander_uuid` class fixture removed (no longer needed — the UUID/index discovery is post-creation, not pre). `examples/multi-dgd-live-test/README.md` updated: - Finding #7 marked **Status: fixed** with cross-reference to this commit and an explicit description of the new (a)/(b)/(c) flow. - Expected-outcome counts updated: Path B now expects 5 PASSED / 2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the next live run. - Live-run record row updated to note the rewrite and expected PASS outcome. Verified locally: test module compiles + collects cleanly. The end-to-end CDI-vs-non-CDI assignment behaviour is by construction the device plugin's responsibility, so the failure mode of this rewrite is bounded to "device plugin returns no GPU within 90 s" — handled with a clear pytest.fail message that includes a before/after PID snapshot for debugging. Refs: PR #9683 review (Finding #7 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

`TestMultiPodConflict` previously created a bystander pod with `NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS clusters running the NVIDIA Container Toolkit in CDI device mode (the dpp-dev-env default, and the direction NVIDIA is pushing for newer operator releases) reject that with `unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to admit the pod — so the test class was carrying a `@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled the whole §8.5.7 contract on every run. This commit lands the CDI-mode-safe rewrite: - Bystander now claims `nvidia.com/gpu: 1` from the device plugin (both `requests` and `limits` per K8s extended-resource rules). CDI-mode-agnostic — the device plugin owns the actual GPU wiring via its own machinery. - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml` mirrors the inline Python manifest exactly so YAML-debugging and test-debugging agree (the README explicitly calls this out). - New helper `_per_gpu_compute_pids()` snapshots `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the agent pod (which has hostPID + NVML access) and returns `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new PID on some index after the bystander reaches Running — that index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES pin needed. - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` → `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect the index-agnostic contract. Blast-radius / recovery asserts key off the discovered `assigned_gpu`; the loop over the other GPUs iterates `range(8)` and skips that index instead of hard-coding `range(1, 8)`. - Class-level `@pytest.mark.skip` decorator removed. The supporting `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither in code nor as a runtime branch, since neither path is needed once the CDI-safe flow works. - `bystander_uuid` class fixture removed (no longer needed — the UUID/index discovery is post-creation, not pre). `examples/multi-dgd-live-test/README.md` updated: - Finding #7 marked **Status: fixed** with cross-reference to this commit and an explicit description of the new (a)/(b)/(c) flow. - Expected-outcome counts updated: Path B now expects 5 PASSED / 2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the next live run. - Live-run record row updated to note the rewrite and expected PASS outcome. Verified locally: test module compiles + collects cleanly. The end-to-end CDI-vs-non-CDI assignment behaviour is by construction the device plugin's responsibility, so the failure mode of this rewrite is bounded to "device plugin returns no GPU within 90 s" — handled with a clear pytest.fail message that includes a before/after PID snapshot for debugging. Refs: PR #9683 review (Finding #7 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

ci: cache image on main for speed of ci

89d25e2

saturley-hall temporarily deployed to GITLAB March 4, 2025 17:13 — with GitHub Actions Inactive

ci: run it on PR before setting back to main

e688847

saturley-hall temporarily deployed to GITLAB March 4, 2025 17:17 — with GitHub Actions Inactive

saturley-hall temporarily deployed to GITLAB March 4, 2025 17:23 — with GitHub Actions Inactive

saturley-hall closed this Mar 4, 2025

dagil-nvidia mentioned this pull request Oct 21, 2025

feat: Add automated dependency version tracking and extraction #3547

Closed

coderabbitai Bot mentioned this pull request Nov 5, 2025

feat: Update Dynamo k8s deployment example to use ModelExpress #4112

Closed

MatejKosec mentioned this pull request Feb 25, 2026

fix(responses): accept assistant output_text messages without id/status in input #6599

Merged

3 tasks

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

atchernych mentioned this pull request Apr 23, 2026

Proposal: Hybrid Priority Scheduling in Dynamo #8580

Open

keivenchang mentioned this pull request Apr 24, 2026

[INVESTIGATE] sglang + DSv4: kwarg workaround breaks live — needs diagnosis #8671

Closed

4 tasks

dmitry-tokarev-nv mentioned this pull request May 26, 2026

fix(trtllm): restore deps, default CMD, and reduce layer count after upstream base switch #9889

Merged

15 tasks

dagil-nvidia mentioned this pull request Jun 3, 2026

feat(deploy): add LMCache MP aggregated serving example for vLLM #10185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Cache image on main branch#7

CI: Cache image on main branch#7
saturley-hall wants to merge 2 commits into
mainfrom
harrison/main_branch_cache

saturley-hall commented Mar 4, 2025

Uh oh!

github-actions Bot commented Mar 4, 2025

Uh oh!

saturley-hall commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saturley-hall commented Mar 4, 2025

Uh oh!

github-actions Bot commented Mar 4, 2025

Test Results

Uh oh!

saturley-hall commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant