CI: Cache image on main branch#7
Closed
saturley-hall wants to merge 2 commits into
Closed
Conversation
Contributor
Test Results 2 files 2 suites 26s ⏱️ Results for commit e688847. |
Member
Author
|
Closing to reopen |
3 tasks
ranrubin
added a commit
that referenced
this pull request
Apr 20, 2026
Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin
added a commit
that referenced
this pull request
Apr 20, 2026
Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
4 tasks
furionw
added a commit
that referenced
this pull request
May 12, 2026
…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25
added a commit
that referenced
this pull request
May 16, 2026
…cycle Address senior-architect review of the metrics infrastructure: #1 (two parallel Prometheus paths): merge `register_prometheus` + `setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings` method on the Rust `LLMEngine` trait. The trait surface shrinks from 7 → 6 methods; both expfmt bridging and the structured component publisher go through one place. The Python ABC keeps its two methods — the PyO3 bridge adapts (engine-author API unchanged for Python). #3 + #8 (framework lifecycle gauges gated on engine opt-in): - `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side `LifecycleGauges` (prometheus crate gauges registered via the runtime's `MetricsRegistry`). Worker constructs them after `engine.start()` succeeds; observes during cleanup/drain. Emits regardless of engine opt-in. - Drop `set_cleanup_time` / `set_drain_time` from the `ComponentMetricsPublisher` trait — engine no longer responsible. - Drop the corresponding methods from `PyComponentMetricsPublisher`. - `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for parity with the legacy entry points (both legacy `main.py` and the unified bridge populate it). The unified bridge now constructs `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when the engine returns no per-rank sources. #5 (kv_cache_hit_rate Optional semantics): clearer docstring on both Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly documented (`None` = no data, `0.0` = legitimate zero-hit). Deferred with rationale (explained in the post-review architect's note): - #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing but no current bug; 5 fields for 5 signals isn't worth the refactor cost yet. Revisit at ~10+ signals. - #4 (per-source cadence): explicit defer in original review. - #7 (conformance in CI): workflow change, separate from trait surface. Conformance kit collapses `check_register_prometheus` + `check_component_metrics` into `check_setup_metrics`. All 68 backend-common unit tests pass. Mocker example updated. PyO3 bridge wheel rebuilt successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng
added a commit
that referenced
this pull request
May 22, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 22, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.
This commit lands the CDI-mode-safe rewrite:
- Bystander now claims `nvidia.com/gpu: 1` from the device plugin
(both `requests` and `limits` per K8s extended-resource rules).
CDI-mode-agnostic — the device plugin owns the actual GPU wiring
via its own machinery.
- `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
mirrors the inline Python manifest exactly so YAML-debugging and
test-debugging agree (the README explicitly calls this out).
- New helper `_per_gpu_compute_pids()` snapshots
`nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
agent pod (which has hostPID + NVML access) and returns
`{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
PID on some index after the bystander reaches Running — that
index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
pin needed.
- Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
`test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
the index-agnostic contract. Blast-radius / recovery asserts key
off the discovered `assigned_gpu`; the loop over the other GPUs
iterates `range(8)` and skips that index instead of hard-coding
`range(1, 8)`.
- Class-level `@pytest.mark.skip` decorator removed. The supporting
`_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
in code nor as a runtime branch, since neither path is needed
once the CDI-safe flow works.
- `bystander_uuid` class fixture removed (no longer needed — the
UUID/index discovery is post-creation, not pre).
`examples/multi-dgd-live-test/README.md` updated:
- Finding #7 marked **Status: fixed** with cross-reference to this
commit and an explicit description of the new (a)/(b)/(c) flow.
- Expected-outcome counts updated: Path B now expects 5 PASSED /
2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
next live run.
- Live-run record row updated to note the rewrite and expected
PASS outcome.
Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.
Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 22, 2026
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 25, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 25, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.
This commit lands the CDI-mode-safe rewrite:
- Bystander now claims `nvidia.com/gpu: 1` from the device plugin
(both `requests` and `limits` per K8s extended-resource rules).
CDI-mode-agnostic — the device plugin owns the actual GPU wiring
via its own machinery.
- `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
mirrors the inline Python manifest exactly so YAML-debugging and
test-debugging agree (the README explicitly calls this out).
- New helper `_per_gpu_compute_pids()` snapshots
`nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
agent pod (which has hostPID + NVML access) and returns
`{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
PID on some index after the bystander reaches Running — that
index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
pin needed.
- Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
`test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
the index-agnostic contract. Blast-radius / recovery asserts key
off the discovered `assigned_gpu`; the loop over the other GPUs
iterates `range(8)` and skips that index instead of hard-coding
`range(1, 8)`.
- Class-level `@pytest.mark.skip` decorator removed. The supporting
`_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
in code nor as a runtime branch, since neither path is needed
once the CDI-safe flow works.
- `bystander_uuid` class fixture removed (no longer needed — the
UUID/index discovery is post-creation, not pre).
`examples/multi-dgd-live-test/README.md` updated:
- Finding #7 marked **Status: fixed** with cross-reference to this
commit and an explicit description of the new (a)/(b)/(c) flow.
- Expected-outcome counts updated: Path B now expects 5 PASSED /
2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
next live run.
- Live-run record row updated to note the rewrite and expected
PASS outcome.
Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.
Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 25, 2026
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
Merged
15 tasks
ryanolson
added a commit
that referenced
this pull request
May 26, 2026
Two regression tests that pin CONTRACT.md's peer-failure-injection contract: LifecycleEvent::Failed mid-output (P0 #7 on prefill, P0 #9 on decode) must route through cleanup_failed_request EXACTLY ONCE even when a concurrent path (e.g., a spawn-catch failure-sink call) fires for the same request_id. P0 #7 — cd_prefill_frame_error_mid_output_triggers_cleanup - Setup: post-USAA prefill state, observer tracking outputs. - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. }) PLUS a concurrent coord.cleanup_failed_request(rid, ..) call (mirroring the spawn-catch / direct-failure-sink race). - Assert: mark_failed_onboarding called EXACTLY ONCE with the full G1 window vLLM allocated; coord.states[rid] removed; mark_onboarding_complete NEVER called. - Probe verified: with the prefill CdRequest.cleanup_claimed CAS guard temporarily disabled, the test fails with 'got 2 calls'. P0 #9 — cd_decode_peer_failure_mid_output_cleans_up - Setup: post-USAA decode state, local-kick spawned but unresolved (so unfilled_g1_block_ids covers both local-match + remote slices). - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. }) PLUS a concurrent CdFailureSink::on_session_failure call. - Assert: mark_failed_onboarding called EXACTLY ONCE with the full external G1 slice; inflight_budget fully released; wrapper cd_request_state evicted; coord.states evicted; mark_onboarding_complete NEVER called. - Probe verified: with the decode CdRequestState.cleanup_claimed CAS guard temporarily disabled, the test fails with 'got 2 calls'. Both tests use kvbm_engine::p2p::session::LifecycleEvent (already public) as the deterministic injection knob — no MockSession extension needed. The 'in-flight session calls return Err' branch of the original P0 #7 spec is not pinned here because MockSession's commit / make_available always return Ok regardless of lifecycle state; that would require a separate MockSession enhancement and exercises velo-wire semantics rather than connector semantics.
ryanolson
added a commit
that referenced
this pull request
May 26, 2026
Codex stop-time review caught: the previous shape of these tests fired LifecycleEvent::Failed AND a concurrent cleanup call in parallel, then asserted exactly-once mark_failed_onboarding. That assertion holds even if the lifecycle path is dead — the direct call alone produces one cleanup. The test gated the CAS guard but NOT the wire from session.lifecycle() → watcher → cleanup. Restructure both tests into two sequential stages: Stage 1 — lifecycle path is the ONLY trigger. Pre-condition: no mark_failed_onboarding calls yet. Inject LifecycleEvent::Failed on the session. Wait. The resulting cleanup can only have come from the watcher → cleanup chain. Stage 2 — duplicate cleanup is a no-op. Fire a second cleanup_failed_request (or on_session_failure) after Stage 1 already claimed the CAS. The duplicate must NOT re-emit mark_failed_onboarding. Probe verification on the prefill side: short-circuiting Self::lifecycle_failure_reason to always return None breaks Stage 1 — the watcher takes the cooperative branch instead, no cleanup is dispatched, and Stage 1 times out waiting for mark_failed_onboarding. The lifecycle wire is gated. Probe on the decode side surfaced an asymmetry worth pinning in the test comment: even with lifecycle_failure_reason short- circuited, the decode test passes because the watcher's cooperative branch calls state.cancel.cancel(), which propagates through run_remote_pipeline's tokio::select! arm → spawn-catch → wrapper.cleanup_failed_request. So the decode test gates the OUTCOME contract (peer Failed → cleanup) but does not exclusively gate the lifecycle-failure-reason route. The shared Self::lifecycle_failure_reason function is gated by the prefill test; documenting the dual-path nature in the decode test comment. Both tests still pass (full suite 242 lib + 15 decode + 29 prefill + others green). The CAS-guard discrimination probes from the previous commit still hold: disabling either CAS makes the respective Stage 2 / mark_failed_onboarding count assertion fail.
kangclzjc
added a commit
to kangclzjc/dynamo
that referenced
this pull request
Jun 4, 2026
/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16) Test-only additions for the seams the review flagged as untested. ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases): both components changed → full ScalingDecision; both equal current → None (PSM-equivalent no-change); single-component proposal → other count stays None; non-apply execute_action → None. Previously every adapter.tick test hit only the None/empty path, so a regression in the projection / no-change detection would have shipped silently. ai-dynamo#8 _tick_input_to_context + FPM encoding: build a TickInput with traffic (incl kv_hit_rate), worker counts (incl scaling-in-progress flags), and a real ForwardPassMetrics; assert the PipelineContext.observations mapping and that the FPM bytes decode back (key format "<worker_id>/<dp_rank>", canonical encoder). This is the ingress glue where the add_observations P1 + the projection live. ai-dynamo#15 registry mutation during an in-flight tick: suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new plugin while suspended, release, and assert the late plugin did NOT join the in-flight stage (pre-tick snapshot) and the tick completed cleanly — then a fresh tick picks it up. Exercises the no-locks invariant that scheduler.py/server.py document but no test covered. ai-dynamo#16 test_tick_diagnostics_extended scope note: clarify in the module docstring that plugin_overrides / reconcile_reasons / held_over_plugins have no production populator in this PR; these tests lock the dataclass contract (defaults / no shared-mutable aliasing), not live behavior. 835 planner tests pass (+6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.
This commit lands the CDI-mode-safe rewrite:
- Bystander now claims `nvidia.com/gpu: 1` from the device plugin
(both `requests` and `limits` per K8s extended-resource rules).
CDI-mode-agnostic — the device plugin owns the actual GPU wiring
via its own machinery.
- `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
mirrors the inline Python manifest exactly so YAML-debugging and
test-debugging agree (the README explicitly calls this out).
- New helper `_per_gpu_compute_pids()` snapshots
`nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
agent pod (which has hostPID + NVML access) and returns
`{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
PID on some index after the bystander reaches Running — that
index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
pin needed.
- Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
`test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
the index-agnostic contract. Blast-radius / recovery asserts key
off the discovered `assigned_gpu`; the loop over the other GPUs
iterates `range(8)` and skips that index instead of hard-coding
`range(1, 8)`.
- Class-level `@pytest.mark.skip` decorator removed. The supporting
`_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
in code nor as a runtime branch, since neither path is needed
once the CDI-safe flow works.
- `bystander_uuid` class fixture removed (no longer needed — the
UUID/index discovery is post-creation, not pre).
`examples/multi-dgd-live-test/README.md` updated:
- Finding #7 marked **Status: fixed** with cross-reference to this
commit and an explicit description of the new (a)/(b)/(c) flow.
- Expected-outcome counts updated: Path B now expects 5 PASSED /
2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
next live run.
- Live-run record row updated to note the rewrite and expected
PASS outcome.
Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.
Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.
This commit lands the CDI-mode-safe rewrite:
- Bystander now claims `nvidia.com/gpu: 1` from the device plugin
(both `requests` and `limits` per K8s extended-resource rules).
CDI-mode-agnostic — the device plugin owns the actual GPU wiring
via its own machinery.
- `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
mirrors the inline Python manifest exactly so YAML-debugging and
test-debugging agree (the README explicitly calls this out).
- New helper `_per_gpu_compute_pids()` snapshots
`nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
agent pod (which has hostPID + NVML access) and returns
`{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
PID on some index after the bystander reaches Running — that
index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
pin needed.
- Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
`test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
the index-agnostic contract. Blast-radius / recovery asserts key
off the discovered `assigned_gpu`; the loop over the other GPUs
iterates `range(8)` and skips that index instead of hard-coding
`range(1, 8)`.
- Class-level `@pytest.mark.skip` decorator removed. The supporting
`_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
in code nor as a runtime branch, since neither path is needed
once the CDI-safe flow works.
- `bystander_uuid` class fixture removed (no longer needed — the
UUID/index discovery is post-creation, not pre).
`examples/multi-dgd-live-test/README.md` updated:
- Finding #7 marked **Status: fixed** with cross-reference to this
commit and an explicit description of the new (a)/(b)/(c) flow.
- Expected-outcome counts updated: Path B now expects 5 PASSED /
2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
next live run.
- Live-run record row updated to note the rewrite and expected
PASS outcome.
Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.
Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 8, 2026
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.